Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Sep 26.
Published in final edited form as: Adv Neural Inf Process Syst. 2021 Dec;34:16671–16685.

Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs with a Generative Model

Bingyan Wang †,*, Yuling Yan †,*, Jianqing Fan
PMCID: PMC9512142  NIHMSID: NIHMS1782585  PMID: 36168331

Abstract

The curse of dimensionality is a widely known issue in reinforcement learning (RL). In the tabular setting where the state space S and the action space A are both finite, to obtain a nearly optimal policy with sampling access to a generative model, the minimax optimal sample complexity scales linearly with |S|×|A|, which can be prohibitively large when S or A is large. This paper considers a Markov decision process (MDP) that admits a set of state-action features, which can linearly express (or approximate) its probability transition kernel. We show that a model-based approach (resp. Q-learning) provably learns an ε-optimal policy (resp. Q-function) with high probability as soon as the sample size exceeds the order of K(1γ)3ε2(resp.K(1γ)4ε2), up to some logarithmic factor. Here K is the feature dimension and γ ∈ (0, 1) is the discount factor of the MDP. Both sample complexity bounds are provably tight, and our result for the model-based approach matches the minimax lower bound. Our results show that for arbitrarily large-scale MDP, both the model-based approach and Q-learning are sample-efficient when K is relatively small, and hence the title of this paper.

Keywords: model-based reinforcement learning, vanilla Q-learning, linear transition model, sample complexity, leave-one-out analysis

1. Introduction

Reinforcement learning (RL) studies the problem of learning and decision making in a Markov decision process (MDP). Recent years have seen exciting progress in applications of RL in real world decision-making problems such as AlphaGo [SHM+16, SSS+17] and autonomous driving [KST+21]. Specifically, the goal of RL is to search for an optimal policy that maximizes the cumulative reward, based on sequential noisy data. There are two popular approaches to RL: model-based and model-free ones.

  • The model-based approaches start with formulating an empirical MDP by learning the probability transition model from the collected data samples, and then estimating the optimal policy / value function based on the empirical MDP.

  • The model-free approaches (e.g. Q-learning) learn the optimal policy or the optimal (action-)value function from samples. As its name suggests, model-free approaches do not attempt to learn the model explicitly.

Generally speaking, model-based approaches enjoy great flexibility since after the transition model is learned in the first place, it can then be applied to any other problems without touching the raw data samples. In comparison, model-free methods, due to its online nature, are usually memory-efficient and can interact with the environment and update the estimate on the fly.

This paper is devoted to investigating the sample efficiency of both model-based RL and Q-learning (arguably one of the most commonly adopted model-free RL algorithms). It is well known that MDPs suffer from the curse of dimensionality. For example, in the tabular setting where the state space S and the action space A are both finite, to obtain a near optimal policy or value function given sampling access to a generative model, the minimax optimal sample complexity scales linearly with |S|×|A| [AMK13, AKY20]. However contemporary applications of RL often encounters environments with exceedingly large state and action spaces, whilst the data collection might be expensive or even high-stake. This suggests a large gap between the theoretical findings and practical decision-making problems where |S| and |A| are large or even infinite.

To close the aforementioned theory-practice gap, one natural idea is to impose certain structural assumption on the MDP. In this paper we follow the feature-based linear transition model studied in [YW19], where each state-action pair (s,a)S×A admits a K dimensional feature vector ϕ(s,a)K that expresses the transition dynamics (s,a)=Ψϕ(s,a) for some unknown matrix Ψ|S|×K which is common for all (s, a). This model encompasses both the tabular case and the homogeneous model in which the state space can be partitioned into K equivalent classes. Assuming access to a generative model [Kak03, KS99], under this structural assumption, this paper aims to answer the following two questions:

How many samples are needed for model-based RL and Q-learning to learn an optimal policy under the feature-based linear transition model?

In what follows, we will show that the answer to this question scales linearly with the dimension of the feature space K and is independent of |S| and |A| under the feature-based linear transition model. With the aid of this structural assumption, model-based RL and Q-learning becomes significantly more sample-efficient than that in the tabular setting.

Our contributions.

We focus our attention on an infinite horizon MDP with discount factor γ ∈ (0, 1). We use ε-optimal policy to indicate the policy whose expected discounted cumulative rewards are ε close to the optimal value of the MDP. Our contributions are two-fold:

  • We demonstrate that model-based RL provably learns an ε-optimal policy by performing planning based on an empirical MDP constructed from a total number of
    O˜(K(1γ)3ε2)
    samples, for all ε ∈ (0, (1 − γ)−1/2]. Here O˜() hides logarithmic factors compared to the usual O(·) notation. To the best of our knowledge, this is the first theoretical guarantee for model-based RL under the feature-based linear transition model. This sample complexity bound matches the minimax limit established in [YW19] up to logarithmic factor.
  • We also show that Q-learning provably finds an entrywise ε-optimal Q-function using a total number of
    O˜(K(1γ)4ε2)
    samples, for all ε ∈ (0, 1]. This sample complexity upper bound improves the state-of-the-art result in [YW19] and the dependency on the effective horizon (1 − γ)−4 is sharp in view of [LCC+21a].

These results taken collectively show the minimax optimality of model-based RL and the sub-optimality of Q-learning in sample complexity.

2. Problem formulation

This paper focuses on tabular MDPs in the discounted infinite-horizon setting [B+00]. Here and throughout, Δd1{vd:i=1dvi=1,vi0,i[d]} stands for the d-dimensional probability simplex and [N] ≔ {1, 2, ⋯, N} for any N+.

Discounted infinite-horizon MDPs.

Denote a discounted infinite-horizon MDP by a tuple M=(S,A,P,r,γ), where S={1,,|S|} is a finite set of states, A={1,,|A|} is a finite set of actions, P : S×AΔ|S|1 represents the probability transition kernel where P(s′|s, a) denotes the probability of transiting from state s to state s′ when action a is taken, r : S×A[0,1] denotes the reward function where r(s, a) is the instantaneous reward received when taking action aA while in state sS, and γ ∈ (0, 1) is the discount factor.

Value function and Q-function.

Recall that the goal of RL is to learn a policy that maximizes the cumulative reward, which corresponds to value functions or Q-functions in the corresponding MDP. For a deterministic policy π:SA and a starting state sS, we define the value function as

Vπ(s)E[k=0γkr(sk,ak)s0=s]

for all sS. Here, the trajectory is generated by ak = π(sk) and sk+1 ~ P(sk+1|sk, ak) for every k ≥ 0. This function measures the expected discounted cumulative reward received on the trajectory {(sk, ak)}k≥0 and the expectation is taken with respect to the randomness of the transitions sk+1 ~ P(·|sk, ak) on the trajectory. Recall that the immediate rewards lie in [0, 1], it is easy to derive that 0Vπ(s)11γ for any policy π and state s. Accordingly, we define the Q-function for policy π as

Qπ(s,a)E[k=0γkr(sk,ak)s0=s,a0=a]

for all (s,a)S×A. Here, the actions are chosen by the policy π except for the initial state (i.e. ak = π(sk) for all k ≥ 1). Similar to the value function, we can easily check that 0Qπ(s,a)11γ for any π and (s, a). To maximize the value function or Q function, previous literature [BD59,SB18] establishes that there exists an optimal policy π⋆ which simultaneously maximizes Vπ(s) (resp. Qπ(s, a)) for all sS (resp. (s,a)S×A). We define the optimal value function V⋆ and optimal Q-function Q⋆ respectively as

V(s)maxπVπ(s)=Vπ(s),Q(s,a)maxπQπ(s,a)=Qπ(s,a)

for any state-action pair (s,a)S×A.

Linear transition model.

Given a set of K feature functions ϕ1,ϕ2,,ϕK:S×A, we define ϕ to be a feature mapping from S×A to K such that

ϕ(s,a)=[ϕ1(s,a),,ϕK(s,a)]K.

Then we are ready to define the linear transition model [YW19] as follows.

Definition 1 (Linear transition model).

Given a discounted infinite-horizon MDP M=(S,A,P,r,γ) and a feature mapping ϕ:S×AK, M admits the linear transition model if there exists some (unknown) functions ψ1,,ψK:S, such that

P(ss,a)=k=1Kϕk(s,a)ψk(s) (2.1)

for every (s,a)S×A and sS.

Readers familiar with linear MDP literatures might immediately recognize that the above definition is the same as the structure imposed on the probability transition kernel P in the linear MDP model [YW19,JYWJ20,ZBB+20,HZG21,TV20,WDYS20,WJLJ21]. However unlike linear MDP which also requires the reward function r(s, a) to be linear in the feature mapping ϕ(s, a), here we do not impose any structural assumption on the reward.

Example 1 (Tabular MDP).

Each tabular MDP can be viewed as a linear transition model with feature mapping ϕ(s,a)=e(s,a)|S|×|A| (i.e. the vector with all entries equal to 0 but the one corresponding to (s, a) equals to 1) for all (s,a)S×A. To see this, we can check that Definition 1 is satisfied with K=|S|×|A| and ψ(s,a)(s)=(ss,a) for each s, sS and aA. This example is a sanity check of Definition 1, which also shows that our results (Theorem 1 and 2) can recover previous results on tabular MDP [AKY20,LCC+21a] by taking K=|S|×|A|.

Example 2 (Simplex Feature Space).

If all feature vectors {ϕ(s,a)}(s,a)S×A fall in the probability simplex ΔK−1, a linear transition model can be constructed by taking ψk(·) to be any probability measure over S for all k ∈ [K].

A key observation is that the model size of linear transition model with known feature mapping ϕ is |S|K (the number of coefficients ψk (s′) in (2.1)), which is still large when the state space S is large. In contrast, it will be established later that to learn a near-optimal policy or Q-function, we only need a much smaller number of samples, which depends linearly on K and is independent of |S|.

Next, we introduce a critical assumption employed in prior literature [YW19,ZLKB19,SS20].

Assumption 1 (Anchor state-action pairs).

Assume there exists a set of anchor state-action pairs KS×A with |K|=K 1 such that for any (s,a)S×A, its corresponding feature vector can be expressed as a convex combination of the feature vectors of anchor state-action pairs {(s,a) : (s,a)K}:

ϕ(s,a)=i:(si,ai)Kλi(s,a)ϕ(si,ai)fori=1Kλi(s,a)=1andλi(s,a)0. (2.2)

Further, we assume that the vectors in {ϕ(s,a):(s,a)K} are linearly independent.

We pause to develop some intuition of this assumption using Examples 1 and 2. In Example 1, it is straightforward to check that tabular MDPs satisfies Assumption 1 with K=S×A. In terms of Example 2, without loss of generality we can assume that the subspace spanned by the features has full rank, i.e. span{ϕ(s,a) : (s,a)S×A}=K (otherwise we can reduce the dimension of feature space). Then we can also check that Example 2 satisfies Assumption 1 with arbitrary KS×A such that the vectors in {ϕ(s,a):(s,a)K} are linearly independent. In fact, this sort of “anchor” notion appears widely in the literature: [AGM12] considers “anchor word” in topic modeling; [DS04] defines “separability” in their study of non-negative matrix factorization; [SJJ95] introduces “aggregate” in reinforcement learning; [DKW19] studies “anchor state” in soft state aggregation models. These concepts all bear some kind of resemblance to our definition of anchor state-action pairs here.

Throughout this paper, we assume that the feature mapping ϕ is known, which is a widely adopted assumption in previous literature [YW19,JYWJ20,ZHG21,HZG21,TV20,WDYS20,WJLJ21]. In practice, large scale RL usually makes use of representation learning to obtain the feature mapping ϕ. Furthermore, the learned representations can be selected to satisfy the anchor state-action pairs assumption by design.

A useful implication of Assumption 1 is that we can represent the transition kernel as

P(s,a)=i:(si,ai)Kλi(s,a)P(si,ai), (2.3)

This follows simply from substituting (2.2) into (2.1) (see (A.4) in Appendix A for a formal proof).

3. Model-based RL with a generative model

We start with studying model-based RL with a generative model in this section. We propose a model-based planning algorithm and show that it returns an ε-optimal policy with minimax optimal sample size.

3.1. Main results

A generative model and an empirical MDP.

We assume access to a generative model that provides us with independent samples from M. For each anchor state-action pair (si,ai)K, we collect N independent samples si(j)~P(si,ai),j[N]. This allows us to construct an empirical transition kernel P^ where

P^(ss,a)=i=1Kλi(s,a)(1Nj=1N1{si(j)=s}), (3.1)

for each (s,a)S×A. Here, 1Nj=1N1{si(j)=s} is an empirical estimate of P(s′|si, ai) and then (2.3) is employed. With P^ in hand, we can construct an empirical MDP M^=(S,A,P^,r,γ). Our goal here is to derive the sample complexity which guarantees that the optimal policy of M^ is an ε-optimal policy for the true MDP M. The algorithm is summarized below.

3.1.

Careful readers may note that in Algorithm 1, {λ(s,a):(s,a)S×A} is used in the construction of P^, while {λ(s,a):(s,a)S×A} is not input into the algorithm. This is because given K and ϕ are known, {λ(s,a):(s,a)S×A} can be calculated explicitly. The following theorem provides theoretical guarantees for the output policy π^ of the chosen optimization algorithm on the empirical MDP M^.

Theorem 1.

Suppose that δ > 0 and ε ∈ (0, (1 − γ)−1/2]. Let π^ be the policy returned by Algorithm 1. Assume that

NClog(K/((1γ)δ))(1γ)3ε2 (3.2)

for some sufficiently large constant C > 0. Then with probability exceeding 1 − δ,

Q(s,a)Qπ^(s,a)ε+4εopt1γ, (3.3)

for every (s,a)S×A. Here εopt is the target algorithmic error level in Algorithm 1.

We first remark that the two terms on the right hand side of (3.3) can be viewed as statistical error and algorithmic error, respectively. The first term ε denotes the statistical error coming from the deviation of the empirical MDP M^ from the true MDP M. As the sample size N grows, ε could decrease towards 0. The other term 4εopt/(1 − γ) represents the algorithmic error where εopt is the target accuracy level of the planning algorithm applied to M^. Note that εopt can be arbitrarily small if we run the planning algorithm (e.g. value iteration) for enough iterations. A few implications of this theorem are in order.

  • Minimax-optimal sample complexity. Assume that εopt is made negligibly small, e.g. εopt = O((1 − γ)ε) to be discussed in the next point. Note that we draw N independent samples for each state-action pair (s,a)K, therefore the requirement (3.2) for finding an O(ε)-optimal policy translates into the following sample complexity requirement
    O˜(K(1γ)3ε2).
    This matches the minimax optimal lower bound (up to a logarithm factor) established in [YW19, Theorem 1] for feature-based MDP. In comparison, for tabular MDP the minimax optimal sample complexity is Ω˜((1γ)3ε2|SA|) [AMK13,AKY20]. Our sample complexity scales linearly with K instead of |S||A| for tabular MDP as desired.
  • Computational complexity. An advantage of Theorem 1 is that it incorporates the use of any efficient planning algorithm applied to the empirical MDP M^. Classical algorithms include Q-value iteration (QVI) or policy iteration (PI) [Put14]. For example, QVI achieves the target level εopt in O((1γ)1logεopt1) iterations, and each iteration takes time proportional to O(NK+|SA|K). To learn an O(ε)-optimal policy, which requires sample complexity (3.2) and the target level εopt = O((1 − γ)ε), the overall running time is
    O˜(|S||A|K1γ+K(1γ)4ε2).
    In comparison, for the tabular MDP the corresponding running time is O˜((1γ)4ε2|S||A|) [AKY20]. This suggests that under the feature-based linear transition model, the computational complexity is min{|S||A|/K,(1γ)3ε2/K} times lower than that for the tabular MDP (up to logarithm factors), which is significantly more efficient when K is not too large.
  • Stability vis-à-vis model misspecification. A more general version of Theorem 1 (Theorem 3 in Appendix B) shows that when P approximately (instead of exactly) admits the linear transition model, we can still achieve some meaningful result. Specifically, if there exists a linear transition kernel P˜ obeying max(s,a)S×AP˜(s,a)P(s,a)1ξ for some ξ ≥ 0, we can show that π^ returned by Algorithm 1 (with slight modification) satisfies
    Q(s,a)Qπ^(s,a)ε+4εopt1γ+22ξ(1γ)2,
    for every (s,a)S×A. This shows that the model-based method is stable vis-á-vis model misspecification. Interested readers are referred to Appendix B for more details.

In Algorithm 1, the reward function r is assumed to be known. If the information of r is unavailable, an alternative is to assume that r is linear with respect to the feature mapping ϕ, i.e. r(s, a) = θϕ(s, a) for every (s,a)S×A, which is widely adopted in linear MDP literature [HZG21,JYWJ20,WDYS20,WJLJ21]. Under this linear assumption, one can obtain θ by solving the following linear system of equations

r(s,a)=θϕ(s,a),(s,a)K, (3.4)

which can be constructed by the observed reward r(s, a) for all anchor state-action pairs.

4. Model-free RL—vanilla Q Learning

In this section, we turn to study one of the most popular model-free RL algorithms—Q-learning. We provide tight sample complexity bound for vanilla Q-learning under the feature-based linear transition model, which shows its sample-efficiency (depends on |K| instead of |S| or |A|) and sub-optimality in the dependency on the effective horizon.

4.1. Q-learning algorithm

The vanilla Q-learning algorithm maintains a Q-function estimate Qt:S×A for all t ≥ 0, with initialization Q0 obeying 0Q0(s,a)11γ for every (s,a)S×A. Assume we have access to a generative model. In each iteration t ≥ 1, we collect an independent sample st(s,a)~P(s,a) for every anchor state-action pair (s,a)K and define QK(t):K to be

QK(t)(s,a)maxaAQt(st,a),stst(s,a)~P(s,a).

Then given the learning rate ηt ∈ (0, 1], the algorithm adopts the following update rule to update all entries of the Q-function estimate

Qt=(1ηt)Qt1+ηtTK(t)(Qt1).

Here, TK(t) is an empirical Bellman operator associated with the linear transition model M and the set K and is given by

TK(t)(Q)(s,a)r(s,a)+γλ(s,a)QK(t),

where (2.3) is used in the construction. Clearly, this newly defined operator TK(t) is an unbiased estimate of the famous Bellman operator T [Bel52] defined as

(s,a)S×A : T(Q)(s,a)r(s,a)+γEs~P(s,a)[maxaAQ(s,a)].

A critical property is that the Bellman operator T is contractive with a unique fixed point which is the optimal Q-function Q⋆ [Bel52]. To solve the fixed-point equation T(Q)=Q, Q-learning was then introduced by [WD92] based on the idea of stochastic approximation [RM51]. This procedure is precisely described in Algorithm 2.

4.

4.2. Main results

We are now ready to provide our main result for vanilla Q-learning, assuming sampling access to a generative model.

Theorem 2.

Consider any δ ∈ (0, 1) and ε ∈ (0, 1]. Assume that for any 0 ≤ tT, the learning rates satisfy

11+c1(1γ)Tlog2Tηt11+c2(1γ)tlog2T (4.1)

for some sufficiently small universal constants c1c2 > 0. Suppose that the total number of iterations T exceeds

TC3log(KT/δ)log4T(1γ)4ε2 (4.2)

for some sufficiently large universal constant C3 > 0. If the initialization obeys 0Q0(s,a)11γ for any (s,a)S×A, then with probability exceeding 1 − δ, the output QT of Algorithm 2 satisfies

max(s,a)S×A|QT(s,a)Q(s,a)|ε. (4.3)

In addition, let πT (resp. VT) to be the policy (resp. value function) induced by QT, then one has

maxsS|VπT(s)V(s)|2γε1γ. (4.4)

This theorem provides theoretical guarantees on the performance of Algorithm 2. A few implications of this theorem are in order.

  • Learning rate. The condition (4.1) accommodates two commonly adopted choice of learning rates: (i) linearly rescaled learning rates ηt = [1 + c2(1 − γ)t/log2T]−1, and (ii) iteration-invariant learning rates ηt ≡ [1 + c1(1 − γ)T/log2T]. Interested readers are referred to the discussions in [LCC+21a, Section 3.1] for more details on these two learning rate schemes.

  • Tight sample complexity bound. Note that we draw K independent samples in each iteration, therefore the iteration complexity (4.2) can be translated into the sample complexity bound TK in order for Q-learning to achieve ε-accuracy:
    O˜(K(1γ)4ε2). (4.5)
    As we will see shortly, this result improves the state-of-the-art sample complexity bound presented in [YW19, Theorem 2]. In addition, the dependency on the effective horizon (1 − γ)−4 matches the lower bound established in [LCC+21a, Theorem 2] for vanilla Q-learning using either learning rate scheme covered in the previous remark, suggesting that our sample complexity bound (4.5) is sharp.
  • Stability vis-à-vis model misspecification. Just like the model-based approach, we can also show that Q-learning is also stable vis-á-vis model misspecification when P approximately admits the linear transition model. We refer interested readers to Theorem 4 in Appendix B for more details.

Comparison with [YW19].

We compare our result with the sample complexity bounds for Q-learning under the feature-based linear transition model in [YW19].

  • We first compare our result with [YW19, Theorem 2], which is, to the best of our knowledge, the state-of-the-art theory for this problem. When there is no model misspecification, [YW19, Theorem 2] showed that in order for their Phased Parametric Q-learning2 (Algorithm 1 therein) to learn an ε-optimal policy, the sample size needs to be
    O˜(K(1γ)7ε2).
    Note that (4.5) is the sample complexity required for entrywise ε-accurate estimate of the optimal Q-function, thus a fair comparison requires to deduce the sample complexity for learning an ε-optimal policy from (4.4), which is
    O˜(K(1γ)6ε2).
    Hence, our sample complexity improves upon previous work by a factor at least on the order of (1 − γ)−1. However it is worth mentioning that [YW19, Theorem 2] is built upon weaker conditions i=1Kλi(s,a)=1 and i=1K|λi(s,a)|L for some L ≥ 1, which does not require λi(s, a) ≥ 0. Our result holds under Assumption 1, which requires i=1Kλi(s,a)=1 and λi(s, a) ≥ 0. Under the current analysis framework, it is difficult to obtain tight sample complexity bounds without assuming λi(s, a) ≥ 0.
  • Besides vanilla Q-learning, [YW19] also proposed a new variant of Q-learning called Optimal Phased Parametric Q-Learning (Algorithm 2 therein), which is essentially Q-learning with variance reduction. [YW19, Theorem 3] showed that the sample complexity for this algorithm is
    O˜(K(1γ)3ε2),
    which matches minimax optimal lower bound (up to a logarithm factor) established in [YW19, Theorem 1]. Careful reader might notice that this sample complexity bound is better than ours for vanilla Q-learning. We emphasize that as elucidated in the second implication under Theorem 2, our result is already tight for vanilla Q-learning. This observation reveals that while the sample complexity for vanilla Q-learning is provably sub-optimal, the variants of Q-learning can have better performance and achieve minimax optimal sample complexity.

We conclude this section by comparing model-based and model-free approaches. Theorem 1 shows that the sample complexity of the model-based approach is minimax optimal, whilst vanilla Q-learning, perhaps the most commonly adopted model-free method, is sub-optimal according to Theorem 2. However this does not mean that model-based method is better than model-free ones since (i) some variants of Q-learning (see [YW19, Algorithm 2] for example) also has minimax optimal sample complexity; and (ii) in many applications it might be unrealistic to estimate the model in advance.

5. A glimpse of our technical approaches

The establishment of Theorems 1 and 2 calls for a series of technical novelties in the proof. In what follows, we briefly highlight our key technical ideas and novelties.

  • For the model-based approach, we employ “leave-one-out” analysis to decouple the complicated statistical dependency between the empirical probability transition model P^ and the corresponding optimal policy. Specifically, [AKY20] proposed to construct a collection of auxiliary MDPs where each one of them leaves out a single state s by setting s to be an absorbing state and keeping everything else untouched. We tailor this high level idea to the needs of linear transition model, then the independence between the newly constructed MDP with absorbing state s and data samples collected at state s will facilitate our analysis, as detailed in Lemma 1. This “leave-one-out” type of analysis has been utilized in studying numerous problems by a long line of work, such as [EK18,MWCC18,Wai19a,CFMW19,CCF+20,CFMY20,CFWY21], just to name a few.

  • To obtain tighter sample complexity bound than the previous one O˜(K(1γ)7ε2) in [YW19] for vanilla Q-learning, we invoke Freedman’s inequality [Fre75] for the concentration of an error term with martingale structure as illustrated in Appendix C, while the classical ones used in analyzing Q-learning are Hoeffding’s inequality and Bernstein’s inequality [YW19]. The use of Freedman’s inequality helps us establish a recursive relation on {QtQ}t=0T, which consequently leads to the performance guarantee (4.3). It is worth mentioning that [LCC+21a] also studied vanilla Q-learning in the tabular MDP setting and adopted Freedman’s inequality, while we emphasize that it requires a lot of efforts and more delicate analyses in order to study linear transition model and also allow for model misspecification in the current paper, as detailed in the appendices.

6. Additional related literature

To remedy the issue of prohibitively high sample complexity, there exists a substantial body of literature proposing and studying many structural assumptions and complexity notions under different settings. This current paper focuses on linear transition model which is studied in MDP by numerous previous works [YW19,JYWJ20,YW20,ZHG21,MJTS20,HDL+21,WDYS20,TV20,HZG21,WJLJ21]. Among them, [YW19] studied linear transition model and provided tight sample complexity bounds for a new variant of Q-learning with the help of variance reduction. [JYWJ20] focused on linear MDP and designed an algorithm called “Least-Squares Value Iteration with UCB” with both polynomial runtime and polynomial sample complexity without accessing generative model. [WDYS20] extended the study of linear MDP to the framework of reward-free reinforcement learning. [ZHG21] considered a different feature mapping called linear kernel MDP and devised an algorithm with polynomial regret bound without generative model. Other popular structure assumptions include: [WVR17] studied fully deterministic transition dynamics; [JKA+17] introduced Bellman rank and proposed an algorithm which needs sample size polynomially dependent on Bellman rank to obtain a near-optimal policy in contextual decision processes; [DLWZ19] assumed that the value function has low variance compared to the mean for all deterministic policy; [MR07,PLT+08,ABA18,ZLKB20] used linear model to approximate the value function; [LCC+21b] assumed that the optimal Q-function can be linearly-parameterized by the features.

Apart from the linear transition model, another notion adopted in this work is the generative model, whose role in discounted MDP has been studied by extensive literature. The concept of generative model was originally introduced by [KS99], and then widely adopted in numerous works, including [Kak03,AMK13, YW19,Wai19b,AKY20,PW20], to name a few. Specifically, it is assumed that a generative model of the studied MDP is available and can be queried for every state-action pair and output the next state. Among previous works, [AMK13] proved that the minimax lower bound on the sample complexity to obtain an ε-optimal policy was Ω˜(|S||A|(1γ)3ε2). [AMK13] also showed that model-based approach can output an ε-optimal value function with near-optimal sample complexity for ε ∈ (0, 1). Then [AKY20] made significant progress on the challenging problem of establishing minimax optimal sample complexity in estimating an ε-optimal policy with the help of “leave-one-out” analysis.

In addition, after being proposed in [Wat89], Q-learning has become the focus of a rich line of research [WD92, BT96, KS99, EDMB03, AMGK11, JAZBJ18, Wai19a, CZD+19, LWC+20b, XG20]. Among them, [CZD+19,LWC+20b,XG20] studied Q-learning in the presence of Markovian data, i.e. a single sample trajectory. In contrast, under the generative setting of Q-learning where a fresh sample can be drawn from the simulator at each iteration, [Wai19b] analyzed a variant of Q-learning with the help of variance reduction, which was proved to enjoy minimax optimal sample complexity O˜(|SA|(1γ)3ε2). Then more recently, [LCC+21a] improved the lower bound of the vanilla Q-learning algorithm in terms of its scaling with 11γ and proved a matching upper bound O˜(|S||A|(1γ)4ε2).

7. Discussion

This paper studies sample complexity of both model-based and model-free RL under a discounted infinite-horizon MDP with feature-based linear transition model. We establish tight sample complexity bounds for both model-based approaches and Q-learning, which scale linearly with the feature dimension K instead of |S|×|A|, thus considerably reduce the required sample size for large-scale MDPs when K is relatively small. Our results are sharp, and the sample complexity bound for the model-based approach matches the minimax lower bound. The current work suggests a couple of directions for future investigation, as discussed in detail below.

  • Extension to episodic MDPs. An interesting direction for future research is to study linear transition model in episodic MDP. This focus of this work is infinite-horizon discounted MDPs, and hopefully the analysis here can be extended to study the episodic MDP as well ([DB15,DLB17,AOM17,JA18,WDYK20,HZG21]).

  • Continuous state and action space. The state and action spaces in this current paper are still assumed to be finite, since the proof relies heavily on the matrix operations. However, we expect that the results can be extended to accommodate continuous state and action space by employing more complicated analysis.

  • Accommodating entire range of ε. Since both value functions and Q-functions can take value in [0, (1 − γ)−1], ideally our theory should cover all choices of ε ∈ (0, (1 − γ)−1]. However we require that ε ∈ (0, (1 − γ)−1/2] in Theorem 1 and ε ∈ (0, 1] in Theorem 2. While most of the prior works like [AKY20,YW19] also impose these restrictions, a recent work [LWC+20a] proposed a perturbed model-based planning algorithm and proved minimax optimal guarantees for any ε ∈ (0, (1 − γ)−1]. While their work only focused on model-based RL under tabular MDP, an interesting future direction is to improve our theory to accommodate any ε ∈ (0, (1 − γ)−1].

  • General function approximation. Another future direction is to extend the study to more general function approximation starting from linear structure covered in this paper. There exists a rich body of work proposing and studying different structures, such as linear value function approximation [MR07,PLT+08,ABA18,ZLKB20], linear MDPs with infinite dimensional features [AHKS20], Eluder dimension [WSY20], Bellman rank [JKA+17] and Witness rank [SJK+19], etc. Therefore, it is hopeful to investigate these settings and improve the sample efficiency.

Acknowledgements

B. Wang is supported in part by Gordon Y. S. Wu Fellowships in Engineering. Y. Yan is supported in part by ARO grant W911NF-20-1-0097 and NSF grant CCF-1907661. Part of this work was done while Y. Yan was visiting the Simons Institute for the Theory of Computing. J. Fan is supported in part by the ONR grant N00014-19-1-2120 and the NSF grants DMS-1662139, DMS-1712591, DMS-2052926, DMS-2053832, and the NIH grant 2R01-GM072611-15.

APP1

A. Notations

In this section we gather the notations that will be used throughout the appendix.

For any vectors u=[ui]i=1nn and v=[ui]i=1nn, let uv=[uivi]i=1n denote the Hadamard product of u and v. We slightly abuse notations to use and | · | to define entry-wise operation, i.e. for any vector v=[vi]i=1n denote v[vi]i=1n and |v|[|vi|]i=1n. Furthermore, the binary notations ≤ and ≥ are both defined in entry-wise manner, i.e. uv (resp. uv) means uivi (resp. uivi) for all 1 ≤ in. For a collection of vectors v1,,vmn with vi=[vi,j]j=1nn, we define the max operator to be max1imvi[max1imvi,j]j=1n.

For any matrix Mm×n, ∥M1 is defined as the largest row-wise 1 norm of M, i.e. M1maxij|Mi,j|. In addition, we define 1 to be a vector with all the entries being 1, and I be the identity matrix. To express the probability transition function P in matrix form, we define the matrix P|S||A|×|S| to be a matrix whose (s, a)-th row Ps,a corresponds to P(·|s, a). In addition, we define Pπ to be the probability transition matrix induced by policy π, i.e. P(s,a),(s,a)π=Ps,a(s)1π(s)=a for all state-action pairs (s, a) and (s, a′). We define πt to be the policy induced by Qt, i.e. Qt(s, πt(s)) = maxa Qt(s, a) for all sS. Furthermore, we denote the reward function r by vector r|S||A|, i.e. the (s, a)-th element of r equals r(s, a). In the same manner, we define Vπ|S|, V|S|, Vt|S|, Qπ|S||A|, Q|SA| and Qt|S||A| to represent Vπ, V, Vt, Qπ, Q and Qt respectively. By using these notations, we can rewrite the Bellman equation as

Qπ=r+γPVπ=r+γPπQπ. (A.1)

Further, for any vector V|S|, let VarP(V)|S||A| be

VarP(V)P(VV)(PV)(PV), (A.2)

and define VarPs,a(V) to be

VarPs,a(V)Ps,a(VV)(Ps,aV)2, (A.3)

where Ps,a is the (s, a)-th row of P.

Next, we reconsider Assumption 1. For any state-action pair (s, a), we define vector λ(s,a)K(resp.ϕ(s,a)K) with λ(s,a)=[λi(s,a)]i=1K(resp.ϕ(s,a)=[ϕi(s,a)]i=1K) and matrix Λ|S||A|×K(resp. Φ|S||A|×K) whose (s, a)-th row corresponds to λ(s, a) (resp. ϕ(s, a)). Define vector ψ(s,a)K with ψ(s,a)=[ψi(s,a)]i=1K and matrix ΨK×|S| whose (s, a)-th column corresponds to ψ(s, a). Further, let PKK×|S|(resp. ΦKK×K) to be a submatrix of P (resp. Φ) formed by concatenating the rows {Ps,a,(s,a)K}(resp.{Φs,a,(s,a)K}). By using the previous notations, we can express the relations in Definition 1 and Assumption 1 as PK=ΦKΨ, P = ΦΨ and Φ=ΛΦK. Note that Assumption 1 suggests ΦK is invertible. Taking these equations collectively yields

P=ΦΨ=ΦΦK1PK=ΛΦKΦK1PK=ΛPK, (A.4)

which is reminiscent of the anchor word condition in topic modelling [AGM12]. In addition, for each iteration t, we denote the collected samples as {st(s,a)}(s,a)K and define a matrix P^K(t){0,1}K×|S| to be

P^K(t)((s,a),s){1,if s=st(s,a)0,otherwise (A.5)

for any (s,a)K  and sS. Further, we define P^t=ΛP^K(t). Then it is obvious to see that  P^t has nonnegative entries and unit 1 norm for each row due to Assumption 1, i.e. P^t1=1.

B. Analysis of model-based RL (Proof of Theorem 1)

In this section, we will provide complete proof for Theorem 1. As a matter of fact, our proof strategy here justifies a more general version of Theorem 1 that accounts for model misspecification, as stated below.

Theorem 3.

Suppose that δ > 0 and ε ∈ (0, (1 − γ)−1/2]. Assume that there exists a probability transition model P˜ obeying Definition 2.1 and Assumption 1 with feature vectors {ϕ(s,a)}(s,a)S×AK and anchor state-action pairs K such that

P˜P1ξ

for some ξ ≥ 0. Let π^ be the policy returned by Algorithm 1. Assume that

NClog(K/((1γ)δ))(1γ)3ε2 (B.1)

for some sufficiently large constant C > 0. Then with probability exceeding 1 − δ,

Q(s,a)Qπ^(s,a)ε+4εopt1γ+22ξ(1γ)2, (B.2)

for every state-action pair (s,a)S×A.

Theorem 3 subsumes Theorem 1 as a special case with ξ = 0. The remainder of this section is devoted to proving Theorem 3.

B.1. Proof of Theorem 3

The err Qπ^Q can be decomposed as

Qπ^Q=Qπ^Q^π^+Q^π^Q^+Q^QQπ^Q^π^+Q^π^Q^+Q^πQ(Qπ^Q^π^+Q^π^Q^+Q^πQ)1. (B.3)

For policy π^ satisfying the condition in Theorem 1, we have Q^π^Q^εopt. It boils down to control Qπ^Q^π^ and Q^πQ.

To begin with, we can use (A.1) to further decompose Qπ^Q^π^ as

Qπ^Q^π^=(IγPπ^)1r(IγP^π^)1r=(IγPπ^)1[(IγP^π^)(IγPπ^)]Q^π^=γ(IγPπ^)1(PP^)V^π^γ(IγPπ^)1(PP^)V^+γ(IγPπ^)1(PP^)(V^π^V^)γ(IγPπ^)1|(PP^)V^|+2γεopt1γ. (B.4)

Here the last inequality is due to

γ(IγPπ^)1(PP^)(V^π^V^)γ(IγPπ^)11(PP^)(V^π^V^)γ(IγPπ^)11(P1+P^1)V^π^V^2γεopt1γ,

where we use the fact that (IγPπ^)111/(1γ) and P1=P^1=1.

Similarly, for the term Q^πQ in (B.3), we have

Q^πQ=γ(IγPπ)1(PP^)V^πγ(IγPπ)1|(PP^)V^π|. (B.5)

As can be seen from (B.4) and (B.5), it boils down to bound |(PP^)V^| and |(PP^)V^π|. We have the following lemma.

Lemma 1.

With probability exceeding 1 − δ, one has

|(PP^)s,aV^|10ξ1γ+42log(4K/δ)N+4log(8K/((1γ)δ))(1γ)N+4log(8K/((1γ)δ))NVarPs,a(V^), (B.6)
|(PP^)s,aV^π|10ξ1γ+42log(4K/δ)N+4log(8K/((1γ)δ))(1γ)N+4log(8K/((1γ)δ))NVarPs,a(V^π). (B.7)
Proof.

See Appendix B.2. □

Applying (B.6) to (B.4) reveals that

Qπ^Q^π^4log(8K/((1γ)δ))Nγ(IγPπ^)1VarPs,a(V^)+γ1γ[42log(4K/δ)N+4log(8K/((1γ)δ))(1γ)N]+10γξ(1γ)2+2γεopt1γ. (B.8)

For the first term, one has

VarPs,a(V^)VarPs,a(Vπ^)+VarPs,a(Vπ^V^π^)+VarPs,a(V^π^V^)VarPs,a(Vπ^)+Vπ^V^π^+εoptVarPs,a(Vπ^)+Qπ^Q^π^+εopt,

where the first inequality comes from the fact that Var(X+Y)Var(X)+Var(Y) for any random variables X and Y. It follows that

γ(IγPπ^)1VarPs,a(V^)γ(IγPπ^)1VarPa,a(Vπ^)+γ1γ(Qπ^Q^π^+εopt)γ2(1γ)3+γ1γ(Qπ^Q^π^+εopt), (B.9)

where the second inequality utilizes [AMK13, Lemma 7].

Plugging (B.9) into (B.8) yields

Qπ^Q^π^4log(8K/((1γ)δ))N[γ2(1γ)3+γ1γ(Qπ^Q^π^+εopt)]+γ1γ[42log(4K/δ)N+4log(8K/((1γ)δ))(1γ)N]+10γξ(1γ)2+2γεopt1γ.

Then we can rearrange terms to obtain

Qπ^Q^π^10γlog(8K/((1γ)δ))N(1γ)3+11γξ(1γ)2+3γεopt1γ (B.10)

as long as NC˜log(8K/((1γ)δ))/(1γ)2 for some sufficiently large constant C˜>0.

In a similar vein, we can use (B.5) and (B.7) to obtain that

Q^πQ10γlog(8K/((1γ)δ))N(1γ)3+11γξ(1γ)2. (B.11)

Finally, we can substitute (B.10) and (B.11) into (B.3) to achieve

Qπ^Q(20γlog(8K/((1γ)δ))N(1γ)3+22γξ(1γ)2+3γεopt1γ+εopt)1.

This result implies that

Qπ^Q(ε+22ξ(1γ)2+4εopt1γ)1,

as long as

NClog(8K/((1γ)δ))(1γ)3ε2,

for some sufficiently large constant C > 0.

B.2. Proof of Lemma 1

To prove this theorem, we invoke the idea of s-absorbing MDP proposed by [AKY20]. For a state sS and a scalar u, we define a new MDP Ms,u to be identical to M on all the other states except s; on state s, Ms,u is absorbing such that PMs,u(ss,a)=1 and rMs,u(s,a)=(1γ)u for all aA. More formally, we define PMu,s and rMu,s as

PMs,u(ss,a)=1,rMs,u(s,a)=(1γ)u,for all aA,PMs,u(s,a)=P(s,a),rMs,u(s,a)=r(s,a),for all ss and aA.

To streamline notations, we will use Vs,uπ|S| and Vs,u|S| to denote the value function of Ms,u under policy π and the optimal value function of Ms,u respectively. Furthermore, we denote by M^s,u the MDP whose probability transition kernel is identical to P^ at all states except that state s is absorbing. Similar as before, we use V^s,u|S| to denote the optimal value function under M^s,u. The construction of this collection of auxiliary MDPs will facilitate our analysis by decoupling the statistical dependency between P^ and π^.

To begin with, we can decompose the quantity of interest as

|(PP^)s,aV^|=|(PP^)s,a(V^V^s,u+V^s,u)||(PP^)s,aV^s,u|+|(PP^)s,a(V^V^s,u)|(i)|(PP˜)s,aV^s,u|+|λ(s,a)(P˜KPK)V^s,u|+|λ(s,a)(PKP^K)V^s,u|+(Ps,a1+P^s,a1)V^V^s,u (B.12)
(PP˜)s,a1V^s,u+λ(s,a)1(P˜KPK)V^s,u+λ(s,a)1(PKP^K)V^s,u+2V^V^s,u (B.13)
 (ii) 2ξ1γ+max(s,a)K|(PP^)s,aV^s,u|+2V^V^s,u, (B.14)

where (i) makes use of P˜s,a=λ(s,a) P˜K and P^s,a=λ(s,a)P^K; (ii) depends on PP˜1ξ, ∥λ(s, a)∥1 = 1 and V^s,u(1γ)1. For each state s, the value of u will be selected from a set Us. The choice of Us will be specified later. Then for some fixed u in Us and fixed state-action pair (s,a)K, due to the independence between P^s,a. and V^s,u, we can apply Bernstein’s inequality (cf. [Ver18, Theorem 2.8.4]) conditional on V^s,u to reveal that with probability greater than 1 − δ/2,

|(PP^)s,aV^s,u|2log(4/δ)NVarPs,a(V^s,u)+2log(4/δ)3(1γ)N. (B.15)

Invoking the union bound over all the K state-action pairs of K and all the possible values of u in Us demonstrate that with probability greater than 1 − δ/2,

|(PP^)s,aV^s,u|2log(4K|Us|/δ)NVarPs,a(V^s,u)+2log(4K|Us|/δ)3(1γ)N, (B.16)

holds for all state-action pair (s,a)K and all uUs. Here, VarPs,a() is defined in (A.3). Then we observe that

VarPs,a(V^s,u)VarPs,a(V^V^s,u)+VarPs,a(V^)V^V^s,u+VarPs,a(V^)|V^(s)u|+VarPs,a(V^), (B.17)

where (i) is due to VarPs,a(V1+V2)VarPs,a(V1)+VarPs,a(V2) and (ii) holds since

V^V^s,u=V^s,V^(s)V^s,u|V^(s)u|, (B.18)

whose proof can be found in [AKY20, Lemma 8 and 9].

By substituting (B.16), (B.17) and (B.18) into (B.14), we arrive at

|(PP^)s,aV^|2ξ1γ+|V^(s)u|(2+2log(4K|Us|/δ)N)+2log(4K|Us|/δ)NVarPs,a(V^)+2log(4K|Us|/δ)3(1γ)N. (B.19)

Then it boils down to determining Us. The coarse bounds of Q^π and Q^ in the following lemma provide a guidance on the choice of Us.

Lemma 2.

For δ ∈ (0, 1), with probability exceeding 1 − δ/2 one has

QQ^πγ1γlog(4K/δ)2N(1γ)2+2γξ(1γ)2, (B.20)
QQ^γ1γlog(4K/δ)2N(1γ)2+2γξ(1γ)2. (B.21)
Proof.

See Appendix B.3. □

This inspires us to choose Us to be the set consisting of equidistant points in [V (s) − R(δ), V (s)+R(δ)] with |Us| = ⌈1/(1 – γ)⌉2 and

R(δ)γ1γlog(4K/δ)2N(1γ)2+2γξ(1γ)2.

Since VV^QQ^, Lemma 2 implies that V^(s)[V(s)R(δ),V(s)+R(δ)] with probability over 1 − δ/2. Hence, we have

minuUs|V^(s)u|2R(δ)|Us|+12γ2log(4K/δ)N+4γξ. (B.22)

Consequently, with probability exceeding 1 − δ, one has

|(PP^)s,aV^| (i) 2ξ1γ+minuUs|V^(s)u|(2+2log(4K|Us|/δ)N)+2log(4K|Us|/δ)NVarPs,a(V^)+2log(4K|Us|/δ)3(1γ)N (ii) 2ξ1γ+(2γ2log(4K/δ)N+4γξ)(2+4log(8K/((1γ)δ))N)+4log(8K/((1γ)δ))NVarPs,a(V^)+2log(8K/((1γ)δ))3(1γ)N10ξ1γ+42log(4K/δ)N+4log(8K/((1γ)δ))(1γ)N+4log(8K/((1γ)δ))NVarPs,a(V^),

where (i) follows from (B.19) and (ii) utilizes (B.22). This finishes the proof for the first inequality. The second inequality can be proved in a similar way and is omitted here for brevity.

B.3. Proof of Lemma 2

To begin with, one has

(P^P)VΛ(P^KPK)V+Λ(PKP˜K)V+(P˜P)VΛ1(P^KPK)V+Λ1(PKP˜K)V+P˜P1V(P^KPK)V+2ξ1γ, (B.23)

where the first line uses P^=ΛP^K and P˜=ΛP˜K; the last inequality comes from the facts that P˜P1ξ,, ∥Λ1 = 1 and ∥V ≤ (1 − γ)−1. Then we turn to bound (P^KPK)V. In view of (3.1), Hoeffding’s inequality (cf. [Ver18, Theorem 2.2.6]) implies that for (s,a)K,

(|(P^P)s,aV|t)2exp(2t2V2/N).

Hence by the standard union bound argument we have

(P^KPK)VV2log(4K/δ)2Nlog(4K/δ)2N(1γ)2, (B.24)

with probability over 1 – δ/2.

  1. Now we are ready to bound QπQ^π. One has
    QπQ^π=(IγPπ)1r(IγP^π)1r=(IγP^π)1((IγP^π)(IγPπ))Qπ=γ(IγP^π)1(PπP^π)Qπ=γ(IγP^π)1(PP^)Vπ,
    where the first equality makes use of (A.1). Then we take (B.23) and (B.24) collectively to achieve
    γ(IγP^π)1(PP^)Vγi=0γi(P^π)i(PP^)Vγi=0γi(P^π)i1(PP^)Vγ1γlog(4K/δ)2N(1γ)2+2γξ(1γ)2,

    where the last line comes from the fact that for all i1,(P^π)i is a probability transition matrix so that (P^π)i1=1. This justifies the first inequality (B.20).

  2. In terms of the second one, [AKY20, Section A.4] implies that
    QQ^γ1γ(PP^)V.
    Substitution of (B.23) and (B.24) into the above inequality yields
    QQ^γ1γlog(4K/δ)2N(1γ)2+2γξ(1γ)2.

C. Analysis of Q-learning (Proof of Theorem 2)

In this section, we will provide complete proof for Theorem 2. We actually prove a more general version of Theorem 2 that takes model misspecification into consideration, as stated below.

Theorem 4.

Consider any δ ∈ (0, 1) and ε ∈ (0, 1]. Suppose that there exists a probability transition model P˜ obeying Definition 2.1 and Assumption 1 with feature vectors {ϕ(s,a)}(s,a)S×AK and anchor state-action pairs K such that

P˜P1ξ

for some ξ ≥ 0. Assume that the initialization obeys 0Q0(s,a)11γ for any (s,a)S×A and for any 0 ≤ tT, the learning rates satisfy

11+c1(1γ)Tlog2Tηt11+c2(1γ)tlog2T, (C.1)

for some sufficiently small universal constants c1c2 > 0. Suppose that the total number of iterations T exceeds

TC3log(KT/δ)log4T(1γ)4ε2, (C.2)

for some sufficiently large universal constant C3 > 0. If there exists a linear probability transition model P˜ satisfying Assumption 1 with feature vectors {ϕ(s,a)}(s,a)S×A such that P˜P1ξ, then with probability exceeding 1 − δ, the output QT of Algorithm 2 satisfies

max(s,a)S×A|QT(s,a)Q(s,a)|ε+6γξ(1γ)2, (C.3)

for some constant C4 > 0. In addition, let πT (resp. VT) to be the policy (resp. value function) induced by QT, then one has

maxsS|VπT(s)V(s)|2γ1γ(ε+6γξ(1γ)2). (C.4)

Theorem 4 subsumes Theorem 2 as a special case with ξ = 0. The remainder of this section is devoted to proving Theorem 4.

C.1. Proof of Theorem 4

First we show that (C.4) can be easily obtained from (C.3). Since [SY94] gives rise to

VπTV2γVTV1γ,

we have

VπTV2γQTQ1γ,

due to ∥VTV ≤ ∥QTQ. Then (C.4) follows directly from (C.3).

Therefore, we are left to justify (C.3). To start with, we consider the update rule

Qt=(1ηt)Qt1+ηt(r+γP^tVt1).

By defining the error term ΔtQtQ, we can decompose Δt into

Δt=(1ηt)Qt1+ηt(r+γP^tVt1)Q=(1ηt)(Qt1Q)+ηt(r+γP^tVt1Q)=(1ηt)(Qt1Q)+γηt(P^tVt1PV)=(1ηt)Δt1+γηtΛ(P^K(t)PK)Vt1+γηtΛPK(Vt1V)+γηt(ΛPKP)V. (C.5)

Here in the penultimate equality, we make use of Q = r + γPV; and the last equality comes from P^t=ΛP^K(t) which is defined in (A.5). It is straightforward to check that ΛPK is also a probability transition matrix. We denote by P¯=ΛPK hereafter. The third term in the decomposition above can be upper and lower bounded by

P¯(Vt1V)=P¯πt1Qt1P¯πQP¯πt1Qt1P¯πt1Q=P¯πt1Δt1,

and

P¯(Vt1V)=P¯πt1Qt1P¯πQP¯πQt1P¯πQ=P¯πΔt1.

Plugging these bounds into (C.5) yields

Δt(1ηt)Δt1+γηtΛ(P^K(t)PK)Vt1+γηtP¯πt1Δt1+γηt(ΛPKP)V,Δt(1ηt)Δt1+γηtΛ(P^K(t)PK)Vt1+γηtP¯πΔt1+γηt(ΛPKP)V.

Repeatedly invoking these two recursive relations leads to

Δtη0(t)Δ0+i=1tηi(t)γ(P¯πt1Δt1+Λ(P^K(t)PK)Vt1+(ΛPKP)V), (C.6)
Δtη0(t)Δ0+i=1tηi(t)γ(P¯πΔt1+Λ(P^K(t)PK)Vt1+(ΛPKP)V), (C.7)

where

ηi(t){j=1t(1ηj), if i=0,ηij=i+1t(1ηj), if 0<i<t,ηt, if i=t.

Here we adopt the same notations as [LCC+21a].

To begin with, we consider the upper bound (C.6). It can be further decomposed as

Δtη0(t)Δ0+i=1(1α)tηi(t)γ(P¯πt1Δt1+Λ(P^K(t)PK)Vt1)θt+i=(1α)t+1tηi(t)γΛ(P^K(t)PK)Vi1νt+i=1tηi(t)γ(ΛPKP)Vωt+i=(1α)t+1tηi(t)γP¯πt1Δi1, (C.8)

where we define αC4(1 − γ)/log T for some constant C4 > 0. Next, we turn to bound θt and νt respectively for any t satisfying Tc2log11γtT with stepsize choice (4.1).

Bounding ωt.

It is straightforward to bound

ωt=(i)γ(ΛPKP)V (ii) γ(Λ1(PKP˜K)V+(P˜P)V) (iii)  2γξ1γ,

where the first equality comes from the fact that i=1tηi(t)=1 [LCC+21a, Equation (40)]; the second inequality utilizes P˜=ΛP˜K; the last line uses the facts that ∥Λ1 = 1, ∥V ≤ (1 − γ)−1 and P˜KPK1P˜P1ξ.

Bounding θt.

By similar derivation as Step 1 in [LCC+21a, Appendix A.2], we have

θtη0(t)Δ0+tmax1i(1α)tηi(t)max1i(1α)t(P¯πt1Δi1+ΛP^K(t)Vi1+ΛPKVi1) (i) η0(t)Δ0+tmax1i(1α)tηi(t)max1i(1α)t(Δi1+2Vi1)(ii)12T211γ+12T2t31γ2(1γ)T, (C.9)

where (i) is due to the fact that P¯πt11=ΛP^K(t)1=ΛPK1=1 and (ii) comes from [LCC+21a, Equation (39a)].

Bounding νt.

To control the second term, we apply the following Freedman’s inequality.

Lemma 3 (Freedman’s Inequality).

Consider a real-valued martingale {Yk : k = 0, 1, 2, ⋯} with difference sequence {Xk : k = 1, 2, 3, ⋯}. Assume that the difference sequence is uniformly bounded:

|Xk|R  and E[Xk{Xj}j=1k1]=0  for all k1.

Let

Snk=1nXi,Tnk=1nVar{Xk{Xj}j=1k1}.

Then for any given σ2 ≥ 0, one has

(|Sn|τ and Tnσ2)2exp(τ2/2σ2+Rτ/3).

In addition, suppose that Wnσ2 holds deterministically. For any positive integer K ≥ 1, with probability at least 1 − δ one has

|Sn|8max{Tn,σ22K}log2Kδ+43Rlog2Kδ.
Proof.

See [LCC+21a, Theorem 4]. □

To apply this inequality, we can express νt as

νti=(1α)t+1txi,

with

xiηi(t)γΛ(P^K(t)PK)Vi1,  and E[xiVi1,,V0]=0. (C.10)
  1. In order to calculate bound R in Lemma 3, one has
    Bmax(1α)t<ttximax(1α)t<ttηi(t)Λ(P^K(t)PK)Vi1max(1α)t<ttηi(t)(ΛP^K(t)1+ΛPK1)Vi1max(1α)t<ttηi(t)21γ4log4T(1γ)2T,

    where the last inequality comes from [LCC+21a, Eqn (39b)] and the fact that Vi111γ.

  2. Then regarding the variance term, we claim for the moment that
    Wti=(1α)t+1tdiag(Var(xiVi1,,V0))γ2i=(1α)t+1t(ηi(t))2VarP¯(Vi1). (C.11)
    Then we have
    Wtmax(1α)titηi(t)(i=(1α)t+1tηi(t))max(1α)ti<tVarP¯(Vi)2log4T(1γ)Tmax(1α)ti<tVarP¯(Vi), (C.12)
    where the second line comes from [LCC+21a, Eqns (39b), (40)]. A trivial upper bound for Wt is
    |Wt|2log4T(1γ)T1(1γ)21=2log4T(1γ)3T1,

    which uses the fact that VarP(Vi)Vi21/(1γ)2.

Then, we invoke Lemma 3 with K=2log211γ and apply the union bound argument over K to arrive at

|νt|8(Wt+σ22K1)log8KTlog11γδ+43Blog8KTlog11γδ18(Wt+2log4T(1γ)T1)log8KTδ+43Blog8KTlog11γδ132log4T(1γ)Tlog8KTδ(max(1α)ti<tVarΛPK(Vi)+1)+12log4T(1γ)2Tlog8KTδ1. (C.13)

Hence if we define

φt64log4TlogKTδ(1γ)T(maxt2itVarP¯(Vi)+1),

then (C.9) and (C.13) implies that

|θt|+|νt|+|ωt|φt+2γξ1γ1, (C.14)

with probability over 1 − δ for all 2t/3 ≤ kt, as long as Tlog4TlogKTδ/(1γ)3. Therefore, plugging (C.14) into (C.8), we arrive at the recursive relationship

Δtφt+2γξ1γ1+i=(1α)k+1kηi(k)γP¯πi1Δi1=φt+2γξ1γ1+i=(1α)kk1ηi(k)γP¯πi1Δi.

This recursion is expressed in a similar way as [LCC+21a, Eqn. (46)] so we can invoke similar derivation in [LCC+21a, Appendix A.2] to obtain that

Δt30log4TlogKTδ(1γ)4T(1+maxt2i<tΔi)1+2γξ(1γ)21. (C.15)

Then we turn to (C.7). Applying a similar argument, we can deduce that

Δt30log4TlogKTδ(1γ)4T(1+maxt2i<tΔi)12γξ(1γ)21. (C.16)

For any t satisfying Tc2log11γtT, taking (C.15) and (C.16) collectively gives rise to

Δt30log4TlogKTδ(1γ)4T(1+maxt2i<tΔi)+2γξ(1γ)2. (C.17)

Let

ukmax{Δt:2kTc2log11γtT}.

By taking supremum over t{2kT/(c2log11γ),,T} on both sides of (C.17), we have

uk30log4TlogKTδ(1γ)4T(1+uk1)+2γξ(1γ)21klog(c2log11γ). (C.18)

It is straightforward to bound u011γ. For k ≥ 1, it is straightforward to obtain from (C.18) that

uk3max{30log4TlogKTδ(1γ)4T,30log4TlogKTδ(1γ)4Tuk1,2γξ(1γ)2}, (C.19)

for 1klog(c2log11γ). We analyze (C.19) under two different cases:

  1. If there exists some integer k0 with 1k0<log(c2log11γ), such that
    uk0max{1,6γξ(1γ)2},
    then it is straightforward to check from (C.19) that
    uk0+13max{30log4TlogKTδ(1γ)4T,2γξ(1γ)2} (C.20)

    as long as TC3(1 − γ)−4 log4 T log(KT/δ) for some sufficiently large constant C3 > 0.

  2. Otherwise we have uk>max{1,6γξ(1γ)2} for all 1k<log(c2log11γ). This together with (C.19) suggests that
    max{1,6γξ(1γ)2}<3max{30log4TlogKTδ(1γ)4T,30log4TlogKTδ(1γ)4Tuk1,2γξ(1γ)2},
    and therefore
    max{30log4TlogKTδ(1γ)4T,30log4TlogKTδ(1γ)4Tuk1,2γξ(1γ)2}=30log4TlogKTδ(1γ)4Tuk1
    for all 1klog(c2log11γ). Let
    vk90log4TlogKTδ(1γ)4Tuk1.
    Then we know from (C.18) that
    ukvk1klog(c2log11γ).
    By applying the above two inequalities recursively, we know that
    ukvk=(8100log4TlogKTδ(1γ)4T)1/2uk11/2(8100log4TlogKTδ(1γ)4T)1/2vk11/2(8100log4TlogKTδ(1γ)4T)1/2+1/4uk21/4(8100log4TlogKTδ(1γ)4T)1/2+1/4vk21/4(8100log4TlogKTδ(1γ)4T)11/2ku01/2k8100log4TlogKTδ(1γ)4T(11γ)1/2k,
    where the last inequality holds as long as TC3 log4 T log(KT/δ)(1 − γ)−4 for some sufficiently large constant C3 > 0. Let k0=c˜loglog11γ for some properly chosen constant c˜>0 such that k0 is an integer between 1 and log(c2log11γ), we have
    uk08100log4TlogKTδ(1γ)4T(11γ)1/2k0=O(log4TlogKTδ(1γ)4T).

    When TC3 log4 T log(KT/δ)(1 − γ)−4 for some sufficiently large constant C3 > 0, this implies that uk01, which contradicts with the preassumption that uk>max{1,6γξ(1γ)2} for all 1kc2log11γ.

Consequently, (C.20) must hold true and then the definition of uk immediately leads to

ΔT90log4TlogKTδ(1γ)4T+6γξ(1γ)2.

Then for any ε ∈ (0, 1], one has

ΔTε+6γξ(1γ)2,

as long as

90log4TlogKTδ(1γ)4Tε.

Hence, if the total number of iterations T satisfies

TC3log4TlogKTδ(1γ)4ε2

for some sufficiently large constant C3 > 0, (4.3) would hold for Algorithm 1 with probability over 1 − δ.

Finally, we are left to justify (C.11). Recall the definition of xi (cf. (C.10)), one has

diag(Var(xiVi1,,V0))=γ2(ηi(t))2diag(Var(Λ(P^K(t)PK)Vi1Vi1))=γ2(ηi(t))2diag(ΛVar((P^K(i)PK)Vi1Vi1)Λ)=γ2(ηi(t))2{λ(s,a)2VarPK(Vi1)}s,a,

where the notation VarPK(Vi1) is defined in (A.2). Plugging this into the definition of Wt leads to

Wt=γ2i=(1α)t+1t(ηi(t))2{λ(s,a)2VarPK(Vi1)}s,a=γ2i=(1α)t+1t(ηi(t))2{λ(s,a)2(PK(Vi1Vi1)(PKVi1)(PKVi1))}s,a. (C.21)

Then we introduce a useful claim as follows. The proof is deferred to Appendix C.2.

Claim 1.

For any state-action pair (s,a)S×A and vector V|S|, one has

λ(s,a)2(PK(VV)(PKV)(PKV))λ(s,a)PK(VV)(λ(s,a)PKV)(λ(s,a)PKV). (C.22)

By invoking this claim with V = Vi−1 and taking collectively with (C.21), one has

Wtγ2i=(1β)t+1t(ηi(t))2{λ(s,a)PK(Vi1Vi1)(λ(s,a)PKVi1)(λ(s,a)PKVi1)}s,a=γ2i=(1β)t+1t(ηi(t))2[ΛPK(Vi1Vi1)(ΛPKVi1)(ΛPKVi1)]=γ2i=(1β)t+1t(ηi(t))2VarP¯(Vi1),

which is the desired result.

C.2. Proof of Claim 1

To simplify notations in this proof, we use [λi]i=1K,[Pi,j]1iK,1j|S| and [Vi]i=1|S| to denote λ(s, a), PK and V respectively. Then one has

λ(s,a)PK(VV)(λ(s,a)PKV)(λ(s,a)PKV)λ(s,a)2(PK(VV)(PKV)(PKV))=i=1Kj=1|S|λiPi,jVj2(i=1Kj=1|S|λiPi,jVj)2i=1Kj=1|S|λi2Pi,jVj2+i=1Kλi2(j=1|S|Pi,jVj)2=i=1Kj=1|S|λiPi,jVj[(1λi)Vjiij=1|S|λiPi,jVj].=i=1Kj=1|S|λiPi,jVj[(i=1Kj=1|S|λiPi,jλi)Vjiij=1|S|λiPi,jVj]=i=1Kj=1|S|iij=1|S|λiPi,jVjλiPi,j(VjVj)

where in the penultimate equality, we use the fact that

i=1Kj=1|S|λiPi,j=λ(s,a)PK1=1.

It follows that

λ(s,a)PK(VV)(λ(s,a)PKV)(λ(s,a)PKV)λ(s,a)2(PK(VV)(PKV)(PKV))=i=1K1i<ij=1|S|j=1|S|[λiPi,jVjλiPi,j(VjVj)+λiPi,jVjλiPi,j(VjVj)]=i=1K1i<iλiλi[j=1|S|j=1|S|Pi,jVjPi,j(VjVj)+j=1|S|j=1|S|Pi,jVjPi,j(VjVj)]= (i) i=1K1i<iλiλi[j=1|S|j=1|S|Pi,jVjPi,j(VjVj)+j=1|S|j=1|S|Pi,jVjPi,j(VjVj)]=i=1K1i<iλiλi[j=1|S|j=1|S|Pi,jPi,j(VjVj)2]0,

where in (i), we exchange the indices j and j′.

D. Feature dimension and the number of anchor state-action pairs

The assumption that the feature dimension (denoted by Kd) and the number of anchor state-action pairs (denoted by Kn) are equal is actually non-essential. In what follows, we will show that if KdKn, then we can modify the current feature mapping ϕ:S×AKd to achieve a new feature mapping ϕ:S×AKn that does not change the transition model P. By doing so, the new feature dimension Kn equals to the number of anchor state-action pairs.

To begin with, we recall from Definition 1 that there exists Kd unknown functions ψ1,,ψKd:S, such that

P(ss,a)=k=1Kdϕk(s,a)ψk(s),

for every (s,a)S×A and sS. In addition, we also recall from Assumption 1 that there exists KS×A with |K|=Kn such that for any (s,a)S×A,

ϕ(s,a)=i:(si,ai)Kλi(s,a)ϕ(si,ai)Kd  for i=1Knλi(s,a)=1  and λi(s,a)0.

Case 1:

Kd > Kn. In this case, the vectors in {ϕ(s,a):(s,a)K} are linearly independent. For ease of presentation and without loss of generality, we assume that Kd = Kn + 1. This indicates that the matrix ΦKd×(|S||A|) whose columns are composed of the feature vectors of all state-action pairs has rank Kn and is hence not full row rank. This suggests that there exists Kn linearly independent rows (without loss of generality, we assume they are the first Kn rows). We can remove the last row from Φ to obtain ΦΦ1:Kn,:Kn×(|S||A|) such that Φ′ is full row rank. Then we show that we can actually use the columns of Φ′ as new feature mappings. To see why this is true, note that the last row ΦKn+1,: can be represented as a linear combination of the first Kn rows, namely there must exist constants {ck}k=1Kn such that for any (s,a)S×A,

ϕKn+1(s,a)=k=1Knckϕk(s,a).

Define ψk=ψk+ckψKn+1 for k = 1, …, Kn, we have

P(ss,a)=k=1Kdϕk(s,a)ψk(s)=ϕKn+1(s,a)ψKn+1(s)+k=1Knϕk(s,a)ψk(s)=k=1Knϕk(s,a)[ψk(s)+ckψKn+1(s)]=k=1Knϕk(s,a)ψk(s),

which is linear with respect to the new Kn dimensional feature vectors. It is also straightforward to check that the new feature mapping satisfies Assumption 1 with the original anchor state-action pairs K.

Case 2:

Kd < Kn. For ease of presentation and without loss of generality, we assume that Kn = Kd + 1 and that the subspace spanned by the feature vectors of anchor state-action pairs is non-degenerate, i.e., has rank Kd (otherwise we can use similar method as in Case 1 to further reduce the feature dimension Kd). In this case, the matrix ΦKKd×Kn whose columns are composed of the feature vectors of anchor state-action pairs has rank Kd. We can add KnKd = 1 new row to ΦK to obtain ΦKKn×Kn such that ΦK has full rank Kn. Then we let the columns of ΦK=[ϕ(s,a)](s,a)K to be the new feature vectors of the anchor state-action pairs, and define the new feature vectors for all other state-action pairs (s,a)K by

ϕ(s,a)=i:(si,ai)Kλi(s,a)ϕ(si,ai).

We can check that the transition model P is not changed if we let ψKn(s)=0 for every sS. It is also straightforward to check that Assumption 1 is satisfied.

To conclude, when KdKn, we can always construct a new set of feature mappings with dimension Kn such that: (i) the feature dimension equals to the number of anchor state-action pairs (they are both Kn); (ii) the transition model can still be linearly parameterized by this new set of feature mappings; and (iii) the anchor state-action pair assumption (Assumption 1) is satisfied with the original anchor state-action pairs.

Footnotes

1

Without loss of generality, one can always assume that the number of anchor state-action pairs equals to the feature dimension K. Interested readers are referred to Appendix D for detailed argument.

2
The difference between Algorithm 2 and Phased Parametric Q-Learning in [YW19] is that Algorithm 2 maintains and updates a Q-function estimate Qt, while Phased Parametric Q-Learning parameterized Q-function by
Qw(s,a)r(s,a)+γϕ(s,a)w,
and then updates the parameters w.

References

  • [ABA18].Azizzadenesheli Kamyar, Brunskill Emma, and Anandkumar Animashree. Efficient exploration through bayesian deep q-networks. In 2018 Information Theory and Applications Workshop (ITA), pages 1–9. IEEE, 2018. [Google Scholar]
  • [AGM12].Arora Sanjeev, Ge Rong, and Moitra Ankur. Learning topic models–going beyond svd. In 2012 IEEE 53rd annual symposium on foundations of computer science, pages 1–10. IEEE, 2012. [Google Scholar]
  • [AHKS20].Agarwal Alekh, Henaff Mikael, Kakade Sham, and Sun Wen. Pc-pg: Policy cover directed exploration for provable policy gradient learning. arXiv preprint arXiv:2007.08459, 2020. [Google Scholar]
  • [AKY20].Agarwal Alekh, Kakade Sham, and Yang Lin F. Model-based reinforcement learning with a generative model is minimax optimal. In Conference on Learning Theory, pages 67–83. PMLR, 2020. [Google Scholar]
  • [AMGK11].Gheshlaghi Azar Mohammad, Munos Remi, Ghavamzadaeh M, and Kappen Hilbert J. Speedy q-learning. 2011.
  • [AMK13].Gheshlaghi Azar Mohammad, Munos Rémi, and Kappen Hilbert J. Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91(3):325–349, 2013. [Google Scholar]
  • [AOM17].Gheshlaghi Azar Mohammad, Osband Ian, and Munos Rémi. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263–272. PMLR, 2017. [Google Scholar]
  • [B+00].Bertsekas Dimitri P et al. Dynamic programming and optimal control: Vol. 1. Athena scientific Belmont, 2000. [Google Scholar]
  • [BD59].Bellman Richard and Dreyfus Stuart. Functional approximations and dynamic programming. Mathematical Tables and Other Aids to Computation, pages 247–251, 1959. [Google Scholar]
  • [Bel52].Bellman Richard. On the theory of dynamic programming. Proceedings of the National Academy of Sciences of the United States of America, 38(8):716, 1952. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [BT96].Bertsekas Dimitri P and Tsitsiklis John N. Neuro-dynamic programming. Athena Scientific, 1996. [Google Scholar]
  • [CCF+20].Chen Yuxin, Chi Yuejie, Fan Jianqing, Ma Cong, and Yan Yuling. Noisy matrix completion: Understanding statistical guarantees for convex relaxation via nonconvex optimization. SIAM Journal on Optimization, 30(4):3098–3121, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [CFMW19].Chen Yuxin, Fan Jianqing, Ma Cong, and Wang Kaizheng. Spectral method and regularized mle are both optimal for top-k ranking. Annals of statistics, 47(4):2204, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [CFMY20].Chen Yuxin, Fan Jianqing, Ma Cong, and Yan Yuling. Bridging convex and nonconvex optimization in robust pca: Noise, outliers, and missing data. arXiv preprint arXiv:2001.05484, accepted to Annals of Statistics, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [CFWY21].Chen Yuxin, Fan Jianqing, Wang Bingyan, and Yan Yuling. Convex and nonconvex optimization are both minimax-optimal for noisy blind deconvolution under random designs. Journal of the American Statistical Association, (just-accepted):1–27, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [CZD+19].Chen Zaiwei, Zhang Sheng, Doan Thinh T, Maguluri Siva Theja, and Clarke John-Paul. Performance of q-learning with linear function approximation: Stability and finite-time analysis. arXiv preprint arXiv:1905.11425, 2019. [Google Scholar]
  • [DB15].Dann Christoph and Brunskill Emma. Sample complexity of episodic fixed-horizon reinforcement learning. Advances in Neural Information Processing Systems, 28:2818–2826, 2015. [Google Scholar]
  • [DKW19].Duan Yaqi, Ke Zheng Tracy, and Wang Mengdi. State aggregation learning from markov transition data. Advances in Neural Information Processing Systems, 32, 2019. [Google Scholar]
  • [DLB17].Dann Christoph, Lattimore Tor, and Brunskill Emma. Unifying pac and regret: uniform pac bounds for episodic reinforcement learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 5717–5727, 2017. [Google Scholar]
  • [DLWZ19].Du Simon S, Luo Yuping, Wang Ruosong, and Zhang Hanrui. Provably efficient q-learning with function approximation via distribution shift error checking oracle. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 8060–8070, 2019. [Google Scholar]
  • [DS04].Donoho David and Stodden Victoria. When does non-negative matrix factorization give a correct decomposition into parts? In 17th Annual Conference on Neural Information Processing Systems, NIPS 2003. Neural information processing systems foundation, 2004. [Google Scholar]
  • [EDMB03].Even-Dar Eyal, Mansour Yishay, and Bartlett Peter. Learning rates for q-learning. Journal of machine learning Research, 5(1), 2003. [Google Scholar]
  • [EK18].Karoui Noureddine El. On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators. Probability Theory and Related Fields, 170(1):95–175, 2018. [Google Scholar]
  • [Fre75].Freedman David A. On tail probabilities for martingales. the Annals of Probability, pages 100–118, 1975. [Google Scholar]
  • [HDL+21].Hao Botao, Duan Yaqi, Lattimore Tor, Szepesvári Csaba, and Wang Mengdi. Sparse feature selection makes batch reinforcement learning more sample efficient. In International Conference on Machine Learning, pages 4063–4073. PMLR, 2021. [Google Scholar]
  • [HZG21].He Jiafan, Zhou Dongruo, and Gu Quanquan. Logarithmic regret for reinforcement learning with linear function approximation. In International Conference on Machine Learning, pages 4171–4180. PMLR, 2021. [Google Scholar]
  • [JA18].Jiang Nan and Agarwal Alekh. Open problem: The dependence of sample complexity lower bounds on planning horizon. In Conference On Learning Theory, pages 3395–3398. PMLR, 2018. [Google Scholar]
  • [JAZBJ18].Jin Chi, Allen-Zhu Zeyuan, Bubeck Sebastien, and Jordan Michael I. Is q-learning provably efficient? In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 4868–4878, 2018. [Google Scholar]
  • [JKA+17].Jiang Nan, Krishnamurthy Akshay, Agarwal Alekh, Langford John, and Schapire Robert E. Contextual decision processes with low bellman rank are pac-learnable. In International Conference on Machine Learning, pages 1704–1713. PMLR, 2017. [Google Scholar]
  • [JYWJ20].Jin Chi, Yang Zhuoran, Wang Zhaoran, and Jordan Michael I. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020. [Google Scholar]
  • [Kak03].Kakade Sham Machandranath. On the sample complexity of reinforcement learning. PhD thesis, UCL (University College London), 2003. [Google Scholar]
  • [KS99].Kearns Michael and Singh Satinder. Finite-sample convergence rates for q-learning and indirect algorithms. Advances in neural information processing systems, pages 996–1002, 1999. [Google Scholar]
  • [KST+21].Kiran B Ravi, Sobh Ibrahim, Talpaert Victor, Mannion Patrick, Al Sallab Ahmad A, Yogamani Senthil, and Pérez Patrick. Deep reinforcement learning for autonomous driving: A survey. IEEE Transactions on Intelligent Transportation Systems, 2021. [Google Scholar]
  • [LCC+21a].Li Gen, Cai Changxiao, Chen Yuxin, Gu Yuantao, Wei Yuting, and Chi Yuejie. Is q-learning minimax optimal? a tight sample complexity analysis. arXiv preprint arXiv:2102.06548, 2021. [Google Scholar]
  • [LCC+21b].Li Gen, Chen Yuxin, Chi Yuejie, Gu Yuantao, and Wei Yuting. Sample-efficient reinforcement learning is feasible for linearly realizable mdps with limited revisiting. arXiv preprint arXiv:2105.08024, 2021. [Google Scholar]
  • [LWC+20a].Li Gen, Wei Yuting, Chi Yuejie, Gu Yuantao, and Chen Yuxin. Breaking the sample size barrier in model-based reinforcement learning with a generative model. Advances in Neural Information Processing Systems, 33, 2020. [Google Scholar]
  • [LWC+20b].Li Gen, Wei Yuting, Chi Yuejie, Gu Yuantao, and Chen Yuxin. Sample complexity of asynchronous q-learning: Sharper analysis and variance reduction. Advances in neural information processing systems, 2020. [Google Scholar]
  • [MJTS20].Modi Aditya, Jiang Nan, Tewari Ambuj, and Singh Satinder. Sample complexity of reinforcement learning using linearly combined model ensembles. In International Conference on Artificial Intelligence and Statistics, pages 2010–2020. PMLR, 2020. [Google Scholar]
  • [MR07].Melo Francisco S and Ribeiro M Isabel. Q-learning with linear function approximation. In International Conference on Computational Learning Theory, pages 308–322. Springer, 2007. [Google Scholar]
  • [MWCC18].Ma Cong, Wang Kaizheng, Chi Yuejie, and Chen Yuxin. Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval and matrix completion. In International Conference on Machine Learning, pages 3345–3354. PMLR, 2018. [Google Scholar]
  • [PLT+08].Parr Ronald, Li Lihong, Taylor Gavin, Painter-Wakefield Christopher, and Littman Michael L. An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning. In Proceedings of the 25th international conference on Machine learning, pages 752–759, 2008. [Google Scholar]
  • [Put14].Puterman Martin L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014. [Google Scholar]
  • [PW20].Pananjady Ashwin and Wainwright Martin J. Instance-dependent -bounds for policy evaluation in tabular reinforcement learning. IEEE Transactions on Information Theory, 67(1):566–585, 2020. [Google Scholar]
  • [RM51].Robbins Herbert and Monro Sutton. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951. [Google Scholar]
  • [SB18].Sutton Richard S and Barto Andrew G. Reinforcement learning: An introduction. MIT press, 2018. [Google Scholar]
  • [SHM+16].Silver David, Huang Aja, Maddison Chris J, Guez Arthur, Sifre Laurent, Van Den Driessche George, Schrittwieser Julian, Antonoglou Ioannis, Panneershelvam Veda, Lanctot Marc, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016. [DOI] [PubMed] [Google Scholar]
  • [SJJ95].Singh Satinder P, Jaakkola Tommi, and Jordan Michael I. Reinforcement learning with soft state aggregation. Advances in neural information processing systems 7, 7:361, 1995. [Google Scholar]
  • [SJK+19].Sun Wen, Jiang Nan, Krishnamurthy Akshay, Agarwal Alekh, and Langford John. Model-based rl in contextual decision processes: Pac bounds and exponential improvements over model-free approaches. In Conference on Learning Theory, pages 2898–2933. PMLR, 2019. [Google Scholar]
  • [SS20].Shariff Roshan and Szepesvári Csaba. Efficient planning in large mdps with weak linear function approximation. arXiv preprint arXiv:2007.06184, 2020. [Google Scholar]
  • [SSS+17].Silver David, Schrittwieser Julian, Simonyan Karen, Antonoglou Ioannis, Huang Aja, Guez Arthur, Hubert Thomas, Baker Lucas, Lai Matthew, Bolton Adrian, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017. [DOI] [PubMed] [Google Scholar]
  • [SY94].Singh Satinder P and Yee Richard C. An upper bound on the loss from approximate optimal-value functions. Machine Learning, 16(3):227–233, 1994. [Google Scholar]
  • [TV20].Touati Ahmed and Vincent Pascal. Efficient learning in non-stationary linear markov decision processes. arXiv preprint arXiv:2010.12870, 2020. [Google Scholar]
  • [Ver18].Vershynin Roman. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge University Press, 2018. [Google Scholar]
  • [Wai19a].Wainwright Martin J. Stochastic approximation with cone-contractive operators: Sharp -bounds for q-learning. arXiv preprint arXiv:1905.06265, 2019. [Google Scholar]
  • [Wai19b].Wainwright Martin J. Variance-reduced q-learning is minimax optimal. arXiv preprint arXiv:1906.04697, 2019. [Google Scholar]
  • [Wat89].Watkins Christopher John Cornish Hellaby. Learning from delayed rewards. 1989.
  • [WD92].Watkins Christopher JCH and Peter Dayan. Q-learning. Machine learning, 8(3–4):279–292, 1992. [Google Scholar]
  • [WDYK20].Wang Ruosong, Du Simon S, Yang Lin, and Kakade Sham. Is long horizon rl more difficult than short horizon rl? Advances in Neural Information Processing Systems, 33, 2020. [Google Scholar]
  • [WDYS20].Wang Ruosong, Du Simon S, Yang Lin F, and Salakhutdinov Ruslan. On reward-free reinforcement learning with linear function approximation. arXiv preprint arXiv:2006.11274, 2020. [Google Scholar]
  • [WJLJ21].Wei Chen-Yu, Jahromi Mehdi Jafarnia, Luo Haipeng, and Jain Rahul. Learning infinite-horizon average-reward mdps with linear function approximation. In International Conference on Artificial Intelligence and Statistics, pages 3007–3015. PMLR, 2021. [Google Scholar]
  • [WSY20].Wang Ruosong, Salakhutdinov Russ R, and Yang Lin. Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension. Advances in Neural Information Processing Systems, 33, 2020. [Google Scholar]
  • [WVR17].Wen Zheng and Van Roy Benjamin. Efficient reinforcement learning in deterministic systems with value function generalization. Mathematics of Operations Research, 42(3):762–782, 2017. [Google Scholar]
  • [XG20].Xu Pan and Gu Quanquan. A finite-time analysis of q-learning with neural network function approximation. In International Conference on Machine Learning, pages 10555–10565. PMLR, 2020. [Google Scholar]
  • [YW19].Yang Lin and Wang Mengdi. Sample-optimal parametric q-learning using linearly additive features. In International Conference on Machine Learning, pages 6995–7004. PMLR, 2019. [Google Scholar]
  • [YW20].Yang Lin and Wang Mengdi. Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. In International Conference on Machine Learning, pages 10746–10756. PMLR, 2020. [Google Scholar]
  • [ZBB+20].Zanette Andrea, Brandfonbrener David, Brunskill Emma, Pirotta Matteo, and Lazaric Alessandro. Frequentist regret bounds for randomized least-squares value iteration. In International Conference on Artificial Intelligence and Statistics, pages 1954–1964. PMLR, 2020. [Google Scholar]
  • [ZHG21].Zhou Dongruo, He Jiafan, and Gu Quanquan. Provably efficient reinforcement learning for discounted mdps with feature mapping. In International Conference on Machine Learning, pages 12793–12802. PMLR, 2021. [Google Scholar]
  • [ZLKB19].Zanette Andrea, Lazaric Alessandro, Kochenderfer Mykel J, and Brunskill Emma. Limiting extrapolation in linear approximate value iteration. Advances in Neural Information Processing Systems, 32:5615–5624, 2019. [Google Scholar]
  • [ZLKB20].Zanette Andrea, Lazaric Alessandro, Kochenderfer Mykel, and Brunskill Emma. Learning near optimal policies with low inherent bellman error. In International Conference on Machine Learning, pages 10978–10989. PMLR, 2020. [Google Scholar]

RESOURCES