Abstract
The curse of dimensionality is a widely known issue in reinforcement learning (RL). In the tabular setting where the state space and the action space are both finite, to obtain a nearly optimal policy with sampling access to a generative model, the minimax optimal sample complexity scales linearly with , which can be prohibitively large when or is large. This paper considers a Markov decision process (MDP) that admits a set of state-action features, which can linearly express (or approximate) its probability transition kernel. We show that a model-based approach (resp. Q-learning) provably learns an ε-optimal policy (resp. Q-function) with high probability as soon as the sample size exceeds the order of , up to some logarithmic factor. Here K is the feature dimension and γ ∈ (0, 1) is the discount factor of the MDP. Both sample complexity bounds are provably tight, and our result for the model-based approach matches the minimax lower bound. Our results show that for arbitrarily large-scale MDP, both the model-based approach and Q-learning are sample-efficient when K is relatively small, and hence the title of this paper.
Keywords: model-based reinforcement learning, vanilla Q-learning, linear transition model, sample complexity, leave-one-out analysis
1. Introduction
Reinforcement learning (RL) studies the problem of learning and decision making in a Markov decision process (MDP). Recent years have seen exciting progress in applications of RL in real world decision-making problems such as AlphaGo [SHM+16, SSS+17] and autonomous driving [KST+21]. Specifically, the goal of RL is to search for an optimal policy that maximizes the cumulative reward, based on sequential noisy data. There are two popular approaches to RL: model-based and model-free ones.
The model-based approaches start with formulating an empirical MDP by learning the probability transition model from the collected data samples, and then estimating the optimal policy / value function based on the empirical MDP.
The model-free approaches (e.g. Q-learning) learn the optimal policy or the optimal (action-)value function from samples. As its name suggests, model-free approaches do not attempt to learn the model explicitly.
Generally speaking, model-based approaches enjoy great flexibility since after the transition model is learned in the first place, it can then be applied to any other problems without touching the raw data samples. In comparison, model-free methods, due to its online nature, are usually memory-efficient and can interact with the environment and update the estimate on the fly.
This paper is devoted to investigating the sample efficiency of both model-based RL and Q-learning (arguably one of the most commonly adopted model-free RL algorithms). It is well known that MDPs suffer from the curse of dimensionality. For example, in the tabular setting where the state space and the action space are both finite, to obtain a near optimal policy or value function given sampling access to a generative model, the minimax optimal sample complexity scales linearly with [AMK13, AKY20]. However contemporary applications of RL often encounters environments with exceedingly large state and action spaces, whilst the data collection might be expensive or even high-stake. This suggests a large gap between the theoretical findings and practical decision-making problems where and are large or even infinite.
To close the aforementioned theory-practice gap, one natural idea is to impose certain structural assumption on the MDP. In this paper we follow the feature-based linear transition model studied in [YW19], where each state-action pair admits a K dimensional feature vector that expresses the transition dynamics for some unknown matrix which is common for all (s, a). This model encompasses both the tabular case and the homogeneous model in which the state space can be partitioned into K equivalent classes. Assuming access to a generative model [Kak03, KS99], under this structural assumption, this paper aims to answer the following two questions:
How many samples are needed for model-based RL and Q-learning to learn an optimal policy under the feature-based linear transition model?
In what follows, we will show that the answer to this question scales linearly with the dimension of the feature space K and is independent of and under the feature-based linear transition model. With the aid of this structural assumption, model-based RL and Q-learning becomes significantly more sample-efficient than that in the tabular setting.
Our contributions.
We focus our attention on an infinite horizon MDP with discount factor γ ∈ (0, 1). We use ε-optimal policy to indicate the policy whose expected discounted cumulative rewards are ε close to the optimal value of the MDP. Our contributions are two-fold:
- We demonstrate that model-based RL provably learns an ε-optimal policy by performing planning based on an empirical MDP constructed from a total number of
samples, for all ε ∈ (0, (1 − γ)−1/2]. Here hides logarithmic factors compared to the usual O(·) notation. To the best of our knowledge, this is the first theoretical guarantee for model-based RL under the feature-based linear transition model. This sample complexity bound matches the minimax limit established in [YW19] up to logarithmic factor.
These results taken collectively show the minimax optimality of model-based RL and the sub-optimality of Q-learning in sample complexity.
2. Problem formulation
This paper focuses on tabular MDPs in the discounted infinite-horizon setting [B+00]. Here and throughout, stands for the d-dimensional probability simplex and [N] ≔ {1, 2, ⋯, N} for any .
Discounted infinite-horizon MDPs.
Denote a discounted infinite-horizon MDP by a tuple , where is a finite set of states, is a finite set of actions, represents the probability transition kernel where P(s′|s, a) denotes the probability of transiting from state s to state s′ when action a is taken, denotes the reward function where r(s, a) is the instantaneous reward received when taking action while in state , and γ ∈ (0, 1) is the discount factor.
Value function and Q-function.
Recall that the goal of RL is to learn a policy that maximizes the cumulative reward, which corresponds to value functions or Q-functions in the corresponding MDP. For a deterministic policy and a starting state , we define the value function as
for all . Here, the trajectory is generated by ak = π(sk) and sk+1 ~ P(sk+1|sk, ak) for every k ≥ 0. This function measures the expected discounted cumulative reward received on the trajectory {(sk, ak)}k≥0 and the expectation is taken with respect to the randomness of the transitions sk+1 ~ P(·|sk, ak) on the trajectory. Recall that the immediate rewards lie in [0, 1], it is easy to derive that for any policy π and state s. Accordingly, we define the Q-function for policy π as
for all . Here, the actions are chosen by the policy π except for the initial state (i.e. ak = π(sk) for all k ≥ 1). Similar to the value function, we can easily check that for any π and (s, a). To maximize the value function or Q function, previous literature [BD59,SB18] establishes that there exists an optimal policy π⋆ which simultaneously maximizes Vπ(s) (resp. Qπ(s, a)) for all . We define the optimal value function V⋆ and optimal Q-function Q⋆ respectively as
for any state-action pair .
Linear transition model.
Given a set of K feature functions , we define ϕ to be a feature mapping from to such that
Then we are ready to define the linear transition model [YW19] as follows.
Definition 1 (Linear transition model).
Given a discounted infinite-horizon MDP and a feature mapping , M admits the linear transition model if there exists some (unknown) functions , such that
(2.1) |
for every and .
Readers familiar with linear MDP literatures might immediately recognize that the above definition is the same as the structure imposed on the probability transition kernel P in the linear MDP model [YW19,JYWJ20,ZBB+20,HZG21,TV20,WDYS20,WJLJ21]. However unlike linear MDP which also requires the reward function r(s, a) to be linear in the feature mapping ϕ(s, a), here we do not impose any structural assumption on the reward.
Example 1 (Tabular MDP).
Each tabular MDP can be viewed as a linear transition model with feature mapping (i.e. the vector with all entries equal to 0 but the one corresponding to (s, a) equals to 1) for all . To see this, we can check that Definition 1 is satisfied with and for each s, and . This example is a sanity check of Definition 1, which also shows that our results (Theorem 1 and 2) can recover previous results on tabular MDP [AKY20,LCC+21a] by taking .
Example 2 (Simplex Feature Space).
If all feature vectors fall in the probability simplex ΔK−1, a linear transition model can be constructed by taking ψk(·) to be any probability measure over for all k ∈ [K].
A key observation is that the model size of linear transition model with known feature mapping ϕ is (the number of coefficients ψk (s′) in (2.1)), which is still large when the state space is large. In contrast, it will be established later that to learn a near-optimal policy or Q-function, we only need a much smaller number of samples, which depends linearly on K and is independent of .
Next, we introduce a critical assumption employed in prior literature [YW19,ZLKB19,SS20].
Assumption 1 (Anchor state-action pairs).
Assume there exists a set of anchor state-action pairs with 1 such that for any , its corresponding feature vector can be expressed as a convex combination of the feature vectors of anchor state-action pairs :
(2.2) |
Further, we assume that the vectors in are linearly independent.
We pause to develop some intuition of this assumption using Examples 1 and 2. In Example 1, it is straightforward to check that tabular MDPs satisfies Assumption 1 with . In terms of Example 2, without loss of generality we can assume that the subspace spanned by the features has full rank, i.e. (otherwise we can reduce the dimension of feature space). Then we can also check that Example 2 satisfies Assumption 1 with arbitrary such that the vectors in are linearly independent. In fact, this sort of “anchor” notion appears widely in the literature: [AGM12] considers “anchor word” in topic modeling; [DS04] defines “separability” in their study of non-negative matrix factorization; [SJJ95] introduces “aggregate” in reinforcement learning; [DKW19] studies “anchor state” in soft state aggregation models. These concepts all bear some kind of resemblance to our definition of anchor state-action pairs here.
Throughout this paper, we assume that the feature mapping ϕ is known, which is a widely adopted assumption in previous literature [YW19,JYWJ20,ZHG21,HZG21,TV20,WDYS20,WJLJ21]. In practice, large scale RL usually makes use of representation learning to obtain the feature mapping ϕ. Furthermore, the learned representations can be selected to satisfy the anchor state-action pairs assumption by design.
A useful implication of Assumption 1 is that we can represent the transition kernel as
(2.3) |
This follows simply from substituting (2.2) into (2.1) (see (A.4) in Appendix A for a formal proof).
3. Model-based RL with a generative model
We start with studying model-based RL with a generative model in this section. We propose a model-based planning algorithm and show that it returns an ε-optimal policy with minimax optimal sample size.
3.1. Main results
A generative model and an empirical MDP.
We assume access to a generative model that provides us with independent samples from M. For each anchor state-action pair , we collect N independent samples . This allows us to construct an empirical transition kernel where
(3.1) |
for each . Here, is an empirical estimate of P(s′|si, ai) and then (2.3) is employed. With in hand, we can construct an empirical MDP . Our goal here is to derive the sample complexity which guarantees that the optimal policy of is an ε-optimal policy for the true MDP M. The algorithm is summarized below.
Careful readers may note that in Algorithm 1, is used in the construction of , while is not input into the algorithm. This is because given and ϕ are known, can be calculated explicitly. The following theorem provides theoretical guarantees for the output policy of the chosen optimization algorithm on the empirical MDP .
Theorem 1.
Suppose that δ > 0 and ε ∈ (0, (1 − γ)−1/2]. Let be the policy returned by Algorithm 1. Assume that
(3.2) |
for some sufficiently large constant C > 0. Then with probability exceeding 1 − δ,
(3.3) |
for every . Here εopt is the target algorithmic error level in Algorithm 1.
We first remark that the two terms on the right hand side of (3.3) can be viewed as statistical error and algorithmic error, respectively. The first term ε denotes the statistical error coming from the deviation of the empirical MDP from the true MDP M. As the sample size N grows, ε could decrease towards 0. The other term 4εopt/(1 − γ) represents the algorithmic error where εopt is the target accuracy level of the planning algorithm applied to . Note that εopt can be arbitrarily small if we run the planning algorithm (e.g. value iteration) for enough iterations. A few implications of this theorem are in order.
- Minimax-optimal sample complexity. Assume that εopt is made negligibly small, e.g. εopt = O((1 − γ)ε) to be discussed in the next point. Note that we draw N independent samples for each state-action pair , therefore the requirement (3.2) for finding an O(ε)-optimal policy translates into the following sample complexity requirement
This matches the minimax optimal lower bound (up to a logarithm factor) established in [YW19, Theorem 1] for feature-based MDP. In comparison, for tabular MDP the minimax optimal sample complexity is [AMK13,AKY20]. Our sample complexity scales linearly with K instead of for tabular MDP as desired. - Computational complexity. An advantage of Theorem 1 is that it incorporates the use of any efficient planning algorithm applied to the empirical MDP . Classical algorithms include Q-value iteration (QVI) or policy iteration (PI) [Put14]. For example, QVI achieves the target level εopt in iterations, and each iteration takes time proportional to . To learn an O(ε)-optimal policy, which requires sample complexity (3.2) and the target level εopt = O((1 − γ)ε), the overall running time is
In comparison, for the tabular MDP the corresponding running time is [AKY20]. This suggests that under the feature-based linear transition model, the computational complexity is times lower than that for the tabular MDP (up to logarithm factors), which is significantly more efficient when K is not too large. - Stability vis-à-vis model misspecification. A more general version of Theorem 1 (Theorem 3 in Appendix B) shows that when P approximately (instead of exactly) admits the linear transition model, we can still achieve some meaningful result. Specifically, if there exists a linear transition kernel obeying for some ξ ≥ 0, we can show that returned by Algorithm 1 (with slight modification) satisfies
for every . This shows that the model-based method is stable vis-á-vis model misspecification. Interested readers are referred to Appendix B for more details.
In Algorithm 1, the reward function r is assumed to be known. If the information of r is unavailable, an alternative is to assume that r is linear with respect to the feature mapping ϕ, i.e. r(s, a) = θ⊤ϕ(s, a) for every , which is widely adopted in linear MDP literature [HZG21,JYWJ20,WDYS20,WJLJ21]. Under this linear assumption, one can obtain θ by solving the following linear system of equations
(3.4) |
which can be constructed by the observed reward r(s, a) for all anchor state-action pairs.
4. Model-free RL—vanilla Q Learning
In this section, we turn to study one of the most popular model-free RL algorithms—Q-learning. We provide tight sample complexity bound for vanilla Q-learning under the feature-based linear transition model, which shows its sample-efficiency (depends on |K| instead of or ) and sub-optimality in the dependency on the effective horizon.
4.1. Q-learning algorithm
The vanilla Q-learning algorithm maintains a Q-function estimate for all t ≥ 0, with initialization Q0 obeying for every . Assume we have access to a generative model. In each iteration t ≥ 1, we collect an independent sample for every anchor state-action pair and define to be
Then given the learning rate ηt ∈ (0, 1], the algorithm adopts the following update rule to update all entries of the Q-function estimate
Here, is an empirical Bellman operator associated with the linear transition model M and the set and is given by
where (2.3) is used in the construction. Clearly, this newly defined operator is an unbiased estimate of the famous Bellman operator [Bel52] defined as
A critical property is that the Bellman operator is contractive with a unique fixed point which is the optimal Q-function Q⋆ [Bel52]. To solve the fixed-point equation , Q-learning was then introduced by [WD92] based on the idea of stochastic approximation [RM51]. This procedure is precisely described in Algorithm 2.
4.2. Main results
We are now ready to provide our main result for vanilla Q-learning, assuming sampling access to a generative model.
Theorem 2.
Consider any δ ∈ (0, 1) and ε ∈ (0, 1]. Assume that for any 0 ≤ t ≤ T, the learning rates satisfy
(4.1) |
for some sufficiently small universal constants c1 ≥ c2 > 0. Suppose that the total number of iterations T exceeds
(4.2) |
for some sufficiently large universal constant C3 > 0. If the initialization obeys for any , then with probability exceeding 1 − δ, the output QT of Algorithm 2 satisfies
(4.3) |
In addition, let πT (resp. VT) to be the policy (resp. value function) induced by QT, then one has
(4.4) |
This theorem provides theoretical guarantees on the performance of Algorithm 2. A few implications of this theorem are in order.
Learning rate. The condition (4.1) accommodates two commonly adopted choice of learning rates: (i) linearly rescaled learning rates ηt = [1 + c2(1 − γ)t/log2T]−1, and (ii) iteration-invariant learning rates ηt ≡ [1 + c1(1 − γ)T/log2T]. Interested readers are referred to the discussions in [LCC+21a, Section 3.1] for more details on these two learning rate schemes.
- Tight sample complexity bound. Note that we draw K independent samples in each iteration, therefore the iteration complexity (4.2) can be translated into the sample complexity bound TK in order for Q-learning to achieve ε-accuracy:
As we will see shortly, this result improves the state-of-the-art sample complexity bound presented in [YW19, Theorem 2]. In addition, the dependency on the effective horizon (1 − γ)−4 matches the lower bound established in [LCC+21a, Theorem 2] for vanilla Q-learning using either learning rate scheme covered in the previous remark, suggesting that our sample complexity bound (4.5) is sharp.(4.5) Stability vis-à-vis model misspecification. Just like the model-based approach, we can also show that Q-learning is also stable vis-á-vis model misspecification when P approximately admits the linear transition model. We refer interested readers to Theorem 4 in Appendix B for more details.
Comparison with [YW19].
We compare our result with the sample complexity bounds for Q-learning under the feature-based linear transition model in [YW19].
- We first compare our result with [YW19, Theorem 2], which is, to the best of our knowledge, the state-of-the-art theory for this problem. When there is no model misspecification, [YW19, Theorem 2] showed that in order for their Phased Parametric Q-learning2 (Algorithm 1 therein) to learn an ε-optimal policy, the sample size needs to be
Note that (4.5) is the sample complexity required for entrywise ε-accurate estimate of the optimal Q-function, thus a fair comparison requires to deduce the sample complexity for learning an ε-optimal policy from (4.4), which is
Hence, our sample complexity improves upon previous work by a factor at least on the order of (1 − γ)−1. However it is worth mentioning that [YW19, Theorem 2] is built upon weaker conditions and for some L ≥ 1, which does not require λi(s, a) ≥ 0. Our result holds under Assumption 1, which requires and λi(s, a) ≥ 0. Under the current analysis framework, it is difficult to obtain tight sample complexity bounds without assuming λi(s, a) ≥ 0. - Besides vanilla Q-learning, [YW19] also proposed a new variant of Q-learning called Optimal Phased Parametric Q-Learning (Algorithm 2 therein), which is essentially Q-learning with variance reduction. [YW19, Theorem 3] showed that the sample complexity for this algorithm is
which matches minimax optimal lower bound (up to a logarithm factor) established in [YW19, Theorem 1]. Careful reader might notice that this sample complexity bound is better than ours for vanilla Q-learning. We emphasize that as elucidated in the second implication under Theorem 2, our result is already tight for vanilla Q-learning. This observation reveals that while the sample complexity for vanilla Q-learning is provably sub-optimal, the variants of Q-learning can have better performance and achieve minimax optimal sample complexity.
We conclude this section by comparing model-based and model-free approaches. Theorem 1 shows that the sample complexity of the model-based approach is minimax optimal, whilst vanilla Q-learning, perhaps the most commonly adopted model-free method, is sub-optimal according to Theorem 2. However this does not mean that model-based method is better than model-free ones since (i) some variants of Q-learning (see [YW19, Algorithm 2] for example) also has minimax optimal sample complexity; and (ii) in many applications it might be unrealistic to estimate the model in advance.
5. A glimpse of our technical approaches
The establishment of Theorems 1 and 2 calls for a series of technical novelties in the proof. In what follows, we briefly highlight our key technical ideas and novelties.
For the model-based approach, we employ “leave-one-out” analysis to decouple the complicated statistical dependency between the empirical probability transition model and the corresponding optimal policy. Specifically, [AKY20] proposed to construct a collection of auxiliary MDPs where each one of them leaves out a single state s by setting s to be an absorbing state and keeping everything else untouched. We tailor this high level idea to the needs of linear transition model, then the independence between the newly constructed MDP with absorbing state s and data samples collected at state s will facilitate our analysis, as detailed in Lemma 1. This “leave-one-out” type of analysis has been utilized in studying numerous problems by a long line of work, such as [EK18,MWCC18,Wai19a,CFMW19,CCF+20,CFMY20,CFWY21], just to name a few.
To obtain tighter sample complexity bound than the previous one in [YW19] for vanilla Q-learning, we invoke Freedman’s inequality [Fre75] for the concentration of an error term with martingale structure as illustrated in Appendix C, while the classical ones used in analyzing Q-learning are Hoeffding’s inequality and Bernstein’s inequality [YW19]. The use of Freedman’s inequality helps us establish a recursive relation on , which consequently leads to the performance guarantee (4.3). It is worth mentioning that [LCC+21a] also studied vanilla Q-learning in the tabular MDP setting and adopted Freedman’s inequality, while we emphasize that it requires a lot of efforts and more delicate analyses in order to study linear transition model and also allow for model misspecification in the current paper, as detailed in the appendices.
6. Additional related literature
To remedy the issue of prohibitively high sample complexity, there exists a substantial body of literature proposing and studying many structural assumptions and complexity notions under different settings. This current paper focuses on linear transition model which is studied in MDP by numerous previous works [YW19,JYWJ20,YW20,ZHG21,MJTS20,HDL+21,WDYS20,TV20,HZG21,WJLJ21]. Among them, [YW19] studied linear transition model and provided tight sample complexity bounds for a new variant of Q-learning with the help of variance reduction. [JYWJ20] focused on linear MDP and designed an algorithm called “Least-Squares Value Iteration with UCB” with both polynomial runtime and polynomial sample complexity without accessing generative model. [WDYS20] extended the study of linear MDP to the framework of reward-free reinforcement learning. [ZHG21] considered a different feature mapping called linear kernel MDP and devised an algorithm with polynomial regret bound without generative model. Other popular structure assumptions include: [WVR17] studied fully deterministic transition dynamics; [JKA+17] introduced Bellman rank and proposed an algorithm which needs sample size polynomially dependent on Bellman rank to obtain a near-optimal policy in contextual decision processes; [DLWZ19] assumed that the value function has low variance compared to the mean for all deterministic policy; [MR07,PLT+08,ABA18,ZLKB20] used linear model to approximate the value function; [LCC+21b] assumed that the optimal Q-function can be linearly-parameterized by the features.
Apart from the linear transition model, another notion adopted in this work is the generative model, whose role in discounted MDP has been studied by extensive literature. The concept of generative model was originally introduced by [KS99], and then widely adopted in numerous works, including [Kak03,AMK13, YW19,Wai19b,AKY20,PW20], to name a few. Specifically, it is assumed that a generative model of the studied MDP is available and can be queried for every state-action pair and output the next state. Among previous works, [AMK13] proved that the minimax lower bound on the sample complexity to obtain an ε-optimal policy was . [AMK13] also showed that model-based approach can output an ε-optimal value function with near-optimal sample complexity for ε ∈ (0, 1). Then [AKY20] made significant progress on the challenging problem of establishing minimax optimal sample complexity in estimating an ε-optimal policy with the help of “leave-one-out” analysis.
In addition, after being proposed in [Wat89], Q-learning has become the focus of a rich line of research [WD92, BT96, KS99, EDMB03, AMGK11, JAZBJ18, Wai19a, CZD+19, LWC+20b, XG20]. Among them, [CZD+19,LWC+20b,XG20] studied Q-learning in the presence of Markovian data, i.e. a single sample trajectory. In contrast, under the generative setting of Q-learning where a fresh sample can be drawn from the simulator at each iteration, [Wai19b] analyzed a variant of Q-learning with the help of variance reduction, which was proved to enjoy minimax optimal sample complexity . Then more recently, [LCC+21a] improved the lower bound of the vanilla Q-learning algorithm in terms of its scaling with and proved a matching upper bound .
7. Discussion
This paper studies sample complexity of both model-based and model-free RL under a discounted infinite-horizon MDP with feature-based linear transition model. We establish tight sample complexity bounds for both model-based approaches and Q-learning, which scale linearly with the feature dimension K instead of , thus considerably reduce the required sample size for large-scale MDPs when K is relatively small. Our results are sharp, and the sample complexity bound for the model-based approach matches the minimax lower bound. The current work suggests a couple of directions for future investigation, as discussed in detail below.
Extension to episodic MDPs. An interesting direction for future research is to study linear transition model in episodic MDP. This focus of this work is infinite-horizon discounted MDPs, and hopefully the analysis here can be extended to study the episodic MDP as well ([DB15,DLB17,AOM17,JA18,WDYK20,HZG21]).
Continuous state and action space. The state and action spaces in this current paper are still assumed to be finite, since the proof relies heavily on the matrix operations. However, we expect that the results can be extended to accommodate continuous state and action space by employing more complicated analysis.
Accommodating entire range of ε. Since both value functions and Q-functions can take value in [0, (1 − γ)−1], ideally our theory should cover all choices of ε ∈ (0, (1 − γ)−1]. However we require that ε ∈ (0, (1 − γ)−1/2] in Theorem 1 and ε ∈ (0, 1] in Theorem 2. While most of the prior works like [AKY20,YW19] also impose these restrictions, a recent work [LWC+20a] proposed a perturbed model-based planning algorithm and proved minimax optimal guarantees for any ε ∈ (0, (1 − γ)−1]. While their work only focused on model-based RL under tabular MDP, an interesting future direction is to improve our theory to accommodate any ε ∈ (0, (1 − γ)−1].
General function approximation. Another future direction is to extend the study to more general function approximation starting from linear structure covered in this paper. There exists a rich body of work proposing and studying different structures, such as linear value function approximation [MR07,PLT+08,ABA18,ZLKB20], linear MDPs with infinite dimensional features [AHKS20], Eluder dimension [WSY20], Bellman rank [JKA+17] and Witness rank [SJK+19], etc. Therefore, it is hopeful to investigate these settings and improve the sample efficiency.
Acknowledgements
B. Wang is supported in part by Gordon Y. S. Wu Fellowships in Engineering. Y. Yan is supported in part by ARO grant W911NF-20-1-0097 and NSF grant CCF-1907661. Part of this work was done while Y. Yan was visiting the Simons Institute for the Theory of Computing. J. Fan is supported in part by the ONR grant N00014-19-1-2120 and the NSF grants DMS-1662139, DMS-1712591, DMS-2052926, DMS-2053832, and the NIH grant 2R01-GM072611-15.
APP1
A. Notations
In this section we gather the notations that will be used throughout the appendix.
For any vectors and , let denote the Hadamard product of u and v. We slightly abuse notations to use and | · | to define entry-wise operation, i.e. for any vector denote and . Furthermore, the binary notations ≤ and ≥ are both defined in entry-wise manner, i.e. u ≤ v (resp. u ≥ v) means ui ≤ vi (resp. ui ≥ vi) for all 1 ≤ i ≤ n. For a collection of vectors with , we define the max operator to be .
For any matrix , ∥M∥1 is defined as the largest row-wise ℓ1 norm of M, i.e. . In addition, we define 1 to be a vector with all the entries being 1, and I be the identity matrix. To express the probability transition function P in matrix form, we define the matrix to be a matrix whose (s, a)-th row Ps,a corresponds to P(·|s, a). In addition, we define Pπ to be the probability transition matrix induced by policy π, i.e. for all state-action pairs (s, a) and (s′, a′). We define πt to be the policy induced by Qt, i.e. Qt(s, πt(s)) = maxa Qt(s, a) for all . Furthermore, we denote the reward function r by vector , i.e. the (s, a)-th element of r equals r(s, a). In the same manner, we define , , , , and to represent Vπ, V⋆, Vt, Qπ, Q⋆ and Qt respectively. By using these notations, we can rewrite the Bellman equation as
(A.1) |
Further, for any vector , let be
(A.2) |
and define to be
(A.3) |
where Ps,a is the (s, a)-th row of P.
Next, we reconsider Assumption 1. For any state-action pair (s, a), we define vector with and matrix whose (s, a)-th row corresponds to λ(s, a)⊤ (resp. ϕ(s, a)⊤). Define vector with and matrix whose (s, a)-th column corresponds to ψ(s, a)⊤. Further, let to be a submatrix of P (resp. Φ) formed by concatenating the rows . By using the previous notations, we can express the relations in Definition 1 and Assumption 1 as , P = ΦΨ and . Note that Assumption 1 suggests is invertible. Taking these equations collectively yields
(A.4) |
which is reminiscent of the anchor word condition in topic modelling [AGM12]. In addition, for each iteration t, we denote the collected samples as and define a matrix to be
(A.5) |
for any and . Further, we define . Then it is obvious to see that has nonnegative entries and unit ℓ1 norm for each row due to Assumption 1, i.e. .
B. Analysis of model-based RL (Proof of Theorem 1)
In this section, we will provide complete proof for Theorem 1. As a matter of fact, our proof strategy here justifies a more general version of Theorem 1 that accounts for model misspecification, as stated below.
Theorem 3.
Suppose that δ > 0 and ε ∈ (0, (1 − γ)−1/2]. Assume that there exists a probability transition model obeying Definition 2.1 and Assumption 1 with feature vectors and anchor state-action pairs such that
for some ξ ≥ 0. Let be the policy returned by Algorithm 1. Assume that
(B.1) |
for some sufficiently large constant C > 0. Then with probability exceeding 1 − δ,
(B.2) |
for every state-action pair .
Theorem 3 subsumes Theorem 1 as a special case with ξ = 0. The remainder of this section is devoted to proving Theorem 3.
B.1. Proof of Theorem 3
The err can be decomposed as
(B.3) |
For policy satisfying the condition in Theorem 1, we have . It boils down to control and .
To begin with, we can use (A.1) to further decompose as
(B.4) |
Here the last inequality is due to
where we use the fact that and .
Similarly, for the term in (B.3), we have
(B.5) |
As can be seen from (B.4) and (B.5), it boils down to bound and . We have the following lemma.
Lemma 1.
With probability exceeding 1 − δ, one has
(B.6) |
(B.7) |
Proof.
See Appendix B.2. □
Applying (B.6) to (B.4) reveals that
(B.8) |
For the first term, one has
where the first inequality comes from the fact that for any random variables X and Y. It follows that
(B.9) |
where the second inequality utilizes [AMK13, Lemma 7].
Plugging (B.9) into (B.8) yields
Then we can rearrange terms to obtain
(B.10) |
as long as for some sufficiently large constant .
In a similar vein, we can use (B.5) and (B.7) to obtain that
(B.11) |
Finally, we can substitute (B.10) and (B.11) into (B.3) to achieve
This result implies that
as long as
for some sufficiently large constant C > 0.
B.2. Proof of Lemma 1
To prove this theorem, we invoke the idea of s-absorbing MDP proposed by [AKY20]. For a state and a scalar u, we define a new MDP Ms,u to be identical to M on all the other states except s; on state s, Ms,u is absorbing such that and for all . More formally, we define and as
To streamline notations, we will use and to denote the value function of Ms,u under policy π and the optimal value function of Ms,u respectively. Furthermore, we denote by the MDP whose probability transition kernel is identical to at all states except that state s is absorbing. Similar as before, we use to denote the optimal value function under . The construction of this collection of auxiliary MDPs will facilitate our analysis by decoupling the statistical dependency between and .
To begin with, we can decompose the quantity of interest as
(B.12) |
(B.13) |
(B.14) |
where (i) makes use of and ; (ii) depends on , ∥λ(s, a)∥1 = 1 and . For each state s, the value of u will be selected from a set . The choice of will be specified later. Then for some fixed u in and fixed state-action pair , due to the independence between . and , we can apply Bernstein’s inequality (cf. [Ver18, Theorem 2.8.4]) conditional on to reveal that with probability greater than 1 − δ/2,
(B.15) |
Invoking the union bound over all the K state-action pairs of and all the possible values of u in demonstrate that with probability greater than 1 − δ/2,
(B.16) |
holds for all state-action pair and all . Here, is defined in (A.3). Then we observe that
(B.17) |
where (i) is due to and (ii) holds since
(B.18) |
whose proof can be found in [AKY20, Lemma 8 and 9].
By substituting (B.16), (B.17) and (B.18) into (B.14), we arrive at
(B.19) |
Then it boils down to determining . The coarse bounds of and in the following lemma provide a guidance on the choice of .
Lemma 2.
For δ ∈ (0, 1), with probability exceeding 1 − δ/2 one has
(B.20) |
(B.21) |
Proof.
See Appendix B.3. □
This inspires us to choose Us to be the set consisting of equidistant points in [V⋆ (s) − R(δ), V⋆ (s)+R(δ)] with |Us| = ⌈1/(1 – γ)⌉2 and
Since , Lemma 2 implies that with probability over 1 − δ/2. Hence, we have
(B.22) |
Consequently, with probability exceeding 1 − δ, one has
where (i) follows from (B.19) and (ii) utilizes (B.22). This finishes the proof for the first inequality. The second inequality can be proved in a similar way and is omitted here for brevity.
B.3. Proof of Lemma 2
To begin with, one has
(B.23) |
where the first line uses and ; the last inequality comes from the facts that , ∥Λ∥1 = 1 and ∥V⋆∥∞ ≤ (1 − γ)−1. Then we turn to bound . In view of (3.1), Hoeffding’s inequality (cf. [Ver18, Theorem 2.2.6]) implies that for ,
Hence by the standard union bound argument we have
(B.24) |
with probability over 1 – δ/2.
C. Analysis of Q-learning (Proof of Theorem 2)
In this section, we will provide complete proof for Theorem 2. We actually prove a more general version of Theorem 2 that takes model misspecification into consideration, as stated below.
Theorem 4.
Consider any δ ∈ (0, 1) and ε ∈ (0, 1]. Suppose that there exists a probability transition model obeying Definition 2.1 and Assumption 1 with feature vectors and anchor state-action pairs such that
for some ξ ≥ 0. Assume that the initialization obeys for any and for any 0 ≤ t ≤ T, the learning rates satisfy
(C.1) |
for some sufficiently small universal constants c1 ≥ c2 > 0. Suppose that the total number of iterations T exceeds
(C.2) |
for some sufficiently large universal constant C3 > 0. If there exists a linear probability transition model satisfying Assumption 1 with feature vectors such that , then with probability exceeding 1 − δ, the output QT of Algorithm 2 satisfies
(C.3) |
for some constant C4 > 0. In addition, let πT (resp. VT) to be the policy (resp. value function) induced by QT, then one has
(C.4) |
Theorem 4 subsumes Theorem 2 as a special case with ξ = 0. The remainder of this section is devoted to proving Theorem 4.
C.1. Proof of Theorem 4
First we show that (C.4) can be easily obtained from (C.3). Since [SY94] gives rise to
we have
due to ∥VT − V⋆∥∞ ≤ ∥QT − Q⋆∥∞. Then (C.4) follows directly from (C.3).
Therefore, we are left to justify (C.3). To start with, we consider the update rule
By defining the error term Δt ≔ Qt − Q⋆, we can decompose Δt into
(C.5) |
Here in the penultimate equality, we make use of Q⋆ = r + γPV⋆; and the last equality comes from which is defined in (A.5). It is straightforward to check that is also a probability transition matrix. We denote by hereafter. The third term in the decomposition above can be upper and lower bounded by
and
Plugging these bounds into (C.5) yields
Repeatedly invoking these two recursive relations leads to
(C.6) |
(C.7) |
where
Here we adopt the same notations as [LCC+21a].
To begin with, we consider the upper bound (C.6). It can be further decomposed as
(C.8) |
where we define α ≔ C4(1 − γ)/log T for some constant C4 > 0. Next, we turn to bound θt and νt respectively for any t satisfying with stepsize choice (4.1).
Bounding ωt.
It is straightforward to bound
where the first equality comes from the fact that [LCC+21a, Equation (40)]; the second inequality utilizes ; the last line uses the facts that ∥Λ∥1 = 1, ∥V⋆∥∞ ≤ (1 − γ)−1 and .
Bounding θt.
By similar derivation as Step 1 in [LCC+21a, Appendix A.2], we have
(C.9) |
where (i) is due to the fact that and (ii) comes from [LCC+21a, Equation (39a)].
Bounding νt.
To control the second term, we apply the following Freedman’s inequality.
Lemma 3 (Freedman’s Inequality).
Consider a real-valued martingale {Yk : k = 0, 1, 2, ⋯} with difference sequence {Xk : k = 1, 2, 3, ⋯}. Assume that the difference sequence is uniformly bounded:
Let
Then for any given σ2 ≥ 0, one has
In addition, suppose that Wn ≤ σ2 holds deterministically. For any positive integer K ≥ 1, with probability at least 1 − δ one has
Proof.
See [LCC+21a, Theorem 4]. □
To apply this inequality, we can express νt as
with
(C.10) |
-
In order to calculate bound R in Lemma 3, one has
where the last inequality comes from [LCC+21a, Eqn (39b)] and the fact that .
-
Then regarding the variance term, we claim for the moment that
(C.11) Then we have(C.12) where the second line comes from [LCC+21a, Eqns (39b), (40)]. A trivial upper bound for Wt iswhich uses the fact that .
Then, we invoke Lemma 3 with and apply the union bound argument over to arrive at
(C.13) |
Hence if we define
then (C.9) and (C.13) implies that
(C.14) |
with probability over 1 − δ for all 2t/3 ≤ k ≤ t, as long as . Therefore, plugging (C.14) into (C.8), we arrive at the recursive relationship
This recursion is expressed in a similar way as [LCC+21a, Eqn. (46)] so we can invoke similar derivation in [LCC+21a, Appendix A.2] to obtain that
(C.15) |
Then we turn to (C.7). Applying a similar argument, we can deduce that
(C.16) |
For any t satisfying , taking (C.15) and (C.16) collectively gives rise to
(C.17) |
Let
By taking supremum over on both sides of (C.17), we have
(C.18) |
It is straightforward to bound . For k ≥ 1, it is straightforward to obtain from (C.18) that
(C.19) |
for . We analyze (C.19) under two different cases:
-
If there exists some integer k0 with , such thatthen it is straightforward to check from (C.19) that
(C.20) as long as T ≥ C3(1 − γ)−4 log4 T log(KT/δ) for some sufficiently large constant C3 > 0.
-
Otherwise we have for all . This together with (C.19) suggests thatand thereforefor all . LetThen we know from (C.18) thatBy applying the above two inequalities recursively, we know thatwhere the last inequality holds as long as T ≥ C3 log4 T log(KT/δ)(1 − γ)−4 for some sufficiently large constant C3 > 0. Let for some properly chosen constant such that k0 is an integer between 1 and , we have
When T ≥ C3 log4 T log(KT/δ)(1 − γ)−4 for some sufficiently large constant C3 > 0, this implies that , which contradicts with the preassumption that for all .
Consequently, (C.20) must hold true and then the definition of uk immediately leads to
Then for any ε ∈ (0, 1], one has
as long as
Hence, if the total number of iterations T satisfies
for some sufficiently large constant C3 > 0, (4.3) would hold for Algorithm 1 with probability over 1 − δ.
Finally, we are left to justify (C.11). Recall the definition of xi (cf. (C.10)), one has
where the notation is defined in (A.2). Plugging this into the definition of Wt leads to
(C.21) |
Then we introduce a useful claim as follows. The proof is deferred to Appendix C.2.
Claim 1.
For any state-action pair and vector , one has
(C.22) |
By invoking this claim with V = Vi−1 and taking collectively with (C.21), one has
which is the desired result.
C.2. Proof of Claim 1
To simplify notations in this proof, we use and to denote λ(s, a), and V respectively. Then one has
where in the penultimate equality, we use the fact that
It follows that
where in (i), we exchange the indices j and j′.
D. Feature dimension and the number of anchor state-action pairs
The assumption that the feature dimension (denoted by Kd) and the number of anchor state-action pairs (denoted by Kn) are equal is actually non-essential. In what follows, we will show that if Kd ≠ Kn, then we can modify the current feature mapping to achieve a new feature mapping that does not change the transition model P. By doing so, the new feature dimension Kn equals to the number of anchor state-action pairs.
To begin with, we recall from Definition 1 that there exists Kd unknown functions , such that
for every and . In addition, we also recall from Assumption 1 that there exists with such that for any ,
Case 1:
Kd > Kn. In this case, the vectors in are linearly independent. For ease of presentation and without loss of generality, we assume that Kd = Kn + 1. This indicates that the matrix whose columns are composed of the feature vectors of all state-action pairs has rank Kn and is hence not full row rank. This suggests that there exists Kn linearly independent rows (without loss of generality, we assume they are the first Kn rows). We can remove the last row from Φ to obtain such that Φ′ is full row rank. Then we show that we can actually use the columns of Φ′ as new feature mappings. To see why this is true, note that the last row can be represented as a linear combination of the first Kn rows, namely there must exist constants such that for any ,
Define for k = 1, …, Kn, we have
which is linear with respect to the new Kn dimensional feature vectors. It is also straightforward to check that the new feature mapping satisfies Assumption 1 with the original anchor state-action pairs K.
Case 2:
Kd < Kn. For ease of presentation and without loss of generality, we assume that Kn = Kd + 1 and that the subspace spanned by the feature vectors of anchor state-action pairs is non-degenerate, i.e., has rank Kd (otherwise we can use similar method as in Case 1 to further reduce the feature dimension Kd). In this case, the matrix whose columns are composed of the feature vectors of anchor state-action pairs has rank Kd. We can add Kn − Kd = 1 new row to to obtain such that has full rank Kn. Then we let the columns of to be the new feature vectors of the anchor state-action pairs, and define the new feature vectors for all other state-action pairs by
We can check that the transition model P is not changed if we let for every . It is also straightforward to check that Assumption 1 is satisfied.
To conclude, when Kd ≠ Kn, we can always construct a new set of feature mappings with dimension Kn such that: (i) the feature dimension equals to the number of anchor state-action pairs (they are both Kn); (ii) the transition model can still be linearly parameterized by this new set of feature mappings; and (iii) the anchor state-action pair assumption (Assumption 1) is satisfied with the original anchor state-action pairs.
Footnotes
Without loss of generality, one can always assume that the number of anchor state-action pairs equals to the feature dimension K. Interested readers are referred to Appendix D for detailed argument.
References
- [ABA18].Azizzadenesheli Kamyar, Brunskill Emma, and Anandkumar Animashree. Efficient exploration through bayesian deep q-networks. In 2018 Information Theory and Applications Workshop (ITA), pages 1–9. IEEE, 2018. [Google Scholar]
- [AGM12].Arora Sanjeev, Ge Rong, and Moitra Ankur. Learning topic models–going beyond svd. In 2012 IEEE 53rd annual symposium on foundations of computer science, pages 1–10. IEEE, 2012. [Google Scholar]
- [AHKS20].Agarwal Alekh, Henaff Mikael, Kakade Sham, and Sun Wen. Pc-pg: Policy cover directed exploration for provable policy gradient learning. arXiv preprint arXiv:2007.08459, 2020. [Google Scholar]
- [AKY20].Agarwal Alekh, Kakade Sham, and Yang Lin F. Model-based reinforcement learning with a generative model is minimax optimal. In Conference on Learning Theory, pages 67–83. PMLR, 2020. [Google Scholar]
- [AMGK11].Gheshlaghi Azar Mohammad, Munos Remi, Ghavamzadaeh M, and Kappen Hilbert J. Speedy q-learning. 2011.
- [AMK13].Gheshlaghi Azar Mohammad, Munos Rémi, and Kappen Hilbert J. Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91(3):325–349, 2013. [Google Scholar]
- [AOM17].Gheshlaghi Azar Mohammad, Osband Ian, and Munos Rémi. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263–272. PMLR, 2017. [Google Scholar]
- [B+00].Bertsekas Dimitri P et al. Dynamic programming and optimal control: Vol. 1. Athena scientific Belmont, 2000. [Google Scholar]
- [BD59].Bellman Richard and Dreyfus Stuart. Functional approximations and dynamic programming. Mathematical Tables and Other Aids to Computation, pages 247–251, 1959. [Google Scholar]
- [Bel52].Bellman Richard. On the theory of dynamic programming. Proceedings of the National Academy of Sciences of the United States of America, 38(8):716, 1952. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [BT96].Bertsekas Dimitri P and Tsitsiklis John N. Neuro-dynamic programming. Athena Scientific, 1996. [Google Scholar]
- [CCF+20].Chen Yuxin, Chi Yuejie, Fan Jianqing, Ma Cong, and Yan Yuling. Noisy matrix completion: Understanding statistical guarantees for convex relaxation via nonconvex optimization. SIAM Journal on Optimization, 30(4):3098–3121, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [CFMW19].Chen Yuxin, Fan Jianqing, Ma Cong, and Wang Kaizheng. Spectral method and regularized mle are both optimal for top-k ranking. Annals of statistics, 47(4):2204, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [CFMY20].Chen Yuxin, Fan Jianqing, Ma Cong, and Yan Yuling. Bridging convex and nonconvex optimization in robust pca: Noise, outliers, and missing data. arXiv preprint arXiv:2001.05484, accepted to Annals of Statistics, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [CFWY21].Chen Yuxin, Fan Jianqing, Wang Bingyan, and Yan Yuling. Convex and nonconvex optimization are both minimax-optimal for noisy blind deconvolution under random designs. Journal of the American Statistical Association, (just-accepted):1–27, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [CZD+19].Chen Zaiwei, Zhang Sheng, Doan Thinh T, Maguluri Siva Theja, and Clarke John-Paul. Performance of q-learning with linear function approximation: Stability and finite-time analysis. arXiv preprint arXiv:1905.11425, 2019. [Google Scholar]
- [DB15].Dann Christoph and Brunskill Emma. Sample complexity of episodic fixed-horizon reinforcement learning. Advances in Neural Information Processing Systems, 28:2818–2826, 2015. [Google Scholar]
- [DKW19].Duan Yaqi, Ke Zheng Tracy, and Wang Mengdi. State aggregation learning from markov transition data. Advances in Neural Information Processing Systems, 32, 2019. [Google Scholar]
- [DLB17].Dann Christoph, Lattimore Tor, and Brunskill Emma. Unifying pac and regret: uniform pac bounds for episodic reinforcement learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 5717–5727, 2017. [Google Scholar]
- [DLWZ19].Du Simon S, Luo Yuping, Wang Ruosong, and Zhang Hanrui. Provably efficient q-learning with function approximation via distribution shift error checking oracle. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 8060–8070, 2019. [Google Scholar]
- [DS04].Donoho David and Stodden Victoria. When does non-negative matrix factorization give a correct decomposition into parts? In 17th Annual Conference on Neural Information Processing Systems, NIPS 2003. Neural information processing systems foundation, 2004. [Google Scholar]
- [EDMB03].Even-Dar Eyal, Mansour Yishay, and Bartlett Peter. Learning rates for q-learning. Journal of machine learning Research, 5(1), 2003. [Google Scholar]
- [EK18].Karoui Noureddine El. On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators. Probability Theory and Related Fields, 170(1):95–175, 2018. [Google Scholar]
- [Fre75].Freedman David A. On tail probabilities for martingales. the Annals of Probability, pages 100–118, 1975. [Google Scholar]
- [HDL+21].Hao Botao, Duan Yaqi, Lattimore Tor, Szepesvári Csaba, and Wang Mengdi. Sparse feature selection makes batch reinforcement learning more sample efficient. In International Conference on Machine Learning, pages 4063–4073. PMLR, 2021. [Google Scholar]
- [HZG21].He Jiafan, Zhou Dongruo, and Gu Quanquan. Logarithmic regret for reinforcement learning with linear function approximation. In International Conference on Machine Learning, pages 4171–4180. PMLR, 2021. [Google Scholar]
- [JA18].Jiang Nan and Agarwal Alekh. Open problem: The dependence of sample complexity lower bounds on planning horizon. In Conference On Learning Theory, pages 3395–3398. PMLR, 2018. [Google Scholar]
- [JAZBJ18].Jin Chi, Allen-Zhu Zeyuan, Bubeck Sebastien, and Jordan Michael I. Is q-learning provably efficient? In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 4868–4878, 2018. [Google Scholar]
- [JKA+17].Jiang Nan, Krishnamurthy Akshay, Agarwal Alekh, Langford John, and Schapire Robert E. Contextual decision processes with low bellman rank are pac-learnable. In International Conference on Machine Learning, pages 1704–1713. PMLR, 2017. [Google Scholar]
- [JYWJ20].Jin Chi, Yang Zhuoran, Wang Zhaoran, and Jordan Michael I. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020. [Google Scholar]
- [Kak03].Kakade Sham Machandranath. On the sample complexity of reinforcement learning. PhD thesis, UCL (University College London), 2003. [Google Scholar]
- [KS99].Kearns Michael and Singh Satinder. Finite-sample convergence rates for q-learning and indirect algorithms. Advances in neural information processing systems, pages 996–1002, 1999. [Google Scholar]
- [KST+21].Kiran B Ravi, Sobh Ibrahim, Talpaert Victor, Mannion Patrick, Al Sallab Ahmad A, Yogamani Senthil, and Pérez Patrick. Deep reinforcement learning for autonomous driving: A survey. IEEE Transactions on Intelligent Transportation Systems, 2021. [Google Scholar]
- [LCC+21a].Li Gen, Cai Changxiao, Chen Yuxin, Gu Yuantao, Wei Yuting, and Chi Yuejie. Is q-learning minimax optimal? a tight sample complexity analysis. arXiv preprint arXiv:2102.06548, 2021. [Google Scholar]
- [LCC+21b].Li Gen, Chen Yuxin, Chi Yuejie, Gu Yuantao, and Wei Yuting. Sample-efficient reinforcement learning is feasible for linearly realizable mdps with limited revisiting. arXiv preprint arXiv:2105.08024, 2021. [Google Scholar]
- [LWC+20a].Li Gen, Wei Yuting, Chi Yuejie, Gu Yuantao, and Chen Yuxin. Breaking the sample size barrier in model-based reinforcement learning with a generative model. Advances in Neural Information Processing Systems, 33, 2020. [Google Scholar]
- [LWC+20b].Li Gen, Wei Yuting, Chi Yuejie, Gu Yuantao, and Chen Yuxin. Sample complexity of asynchronous q-learning: Sharper analysis and variance reduction. Advances in neural information processing systems, 2020. [Google Scholar]
- [MJTS20].Modi Aditya, Jiang Nan, Tewari Ambuj, and Singh Satinder. Sample complexity of reinforcement learning using linearly combined model ensembles. In International Conference on Artificial Intelligence and Statistics, pages 2010–2020. PMLR, 2020. [Google Scholar]
- [MR07].Melo Francisco S and Ribeiro M Isabel. Q-learning with linear function approximation. In International Conference on Computational Learning Theory, pages 308–322. Springer, 2007. [Google Scholar]
- [MWCC18].Ma Cong, Wang Kaizheng, Chi Yuejie, and Chen Yuxin. Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval and matrix completion. In International Conference on Machine Learning, pages 3345–3354. PMLR, 2018. [Google Scholar]
- [PLT+08].Parr Ronald, Li Lihong, Taylor Gavin, Painter-Wakefield Christopher, and Littman Michael L. An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning. In Proceedings of the 25th international conference on Machine learning, pages 752–759, 2008. [Google Scholar]
- [Put14].Puterman Martin L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014. [Google Scholar]
- [PW20].Pananjady Ashwin and Wainwright Martin J. Instance-dependent ℓ∞-bounds for policy evaluation in tabular reinforcement learning. IEEE Transactions on Information Theory, 67(1):566–585, 2020. [Google Scholar]
- [RM51].Robbins Herbert and Monro Sutton. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951. [Google Scholar]
- [SB18].Sutton Richard S and Barto Andrew G. Reinforcement learning: An introduction. MIT press, 2018. [Google Scholar]
- [SHM+16].Silver David, Huang Aja, Maddison Chris J, Guez Arthur, Sifre Laurent, Van Den Driessche George, Schrittwieser Julian, Antonoglou Ioannis, Panneershelvam Veda, Lanctot Marc, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016. [DOI] [PubMed] [Google Scholar]
- [SJJ95].Singh Satinder P, Jaakkola Tommi, and Jordan Michael I. Reinforcement learning with soft state aggregation. Advances in neural information processing systems 7, 7:361, 1995. [Google Scholar]
- [SJK+19].Sun Wen, Jiang Nan, Krishnamurthy Akshay, Agarwal Alekh, and Langford John. Model-based rl in contextual decision processes: Pac bounds and exponential improvements over model-free approaches. In Conference on Learning Theory, pages 2898–2933. PMLR, 2019. [Google Scholar]
- [SS20].Shariff Roshan and Szepesvári Csaba. Efficient planning in large mdps with weak linear function approximation. arXiv preprint arXiv:2007.06184, 2020. [Google Scholar]
- [SSS+17].Silver David, Schrittwieser Julian, Simonyan Karen, Antonoglou Ioannis, Huang Aja, Guez Arthur, Hubert Thomas, Baker Lucas, Lai Matthew, Bolton Adrian, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017. [DOI] [PubMed] [Google Scholar]
- [SY94].Singh Satinder P and Yee Richard C. An upper bound on the loss from approximate optimal-value functions. Machine Learning, 16(3):227–233, 1994. [Google Scholar]
- [TV20].Touati Ahmed and Vincent Pascal. Efficient learning in non-stationary linear markov decision processes. arXiv preprint arXiv:2010.12870, 2020. [Google Scholar]
- [Ver18].Vershynin Roman. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge University Press, 2018. [Google Scholar]
- [Wai19a].Wainwright Martin J. Stochastic approximation with cone-contractive operators: Sharp ℓ∞-bounds for q-learning. arXiv preprint arXiv:1905.06265, 2019. [Google Scholar]
- [Wai19b].Wainwright Martin J. Variance-reduced q-learning is minimax optimal. arXiv preprint arXiv:1906.04697, 2019. [Google Scholar]
- [Wat89].Watkins Christopher John Cornish Hellaby. Learning from delayed rewards. 1989.
- [WD92].Watkins Christopher JCH and Peter Dayan. Q-learning. Machine learning, 8(3–4):279–292, 1992. [Google Scholar]
- [WDYK20].Wang Ruosong, Du Simon S, Yang Lin, and Kakade Sham. Is long horizon rl more difficult than short horizon rl? Advances in Neural Information Processing Systems, 33, 2020. [Google Scholar]
- [WDYS20].Wang Ruosong, Du Simon S, Yang Lin F, and Salakhutdinov Ruslan. On reward-free reinforcement learning with linear function approximation. arXiv preprint arXiv:2006.11274, 2020. [Google Scholar]
- [WJLJ21].Wei Chen-Yu, Jahromi Mehdi Jafarnia, Luo Haipeng, and Jain Rahul. Learning infinite-horizon average-reward mdps with linear function approximation. In International Conference on Artificial Intelligence and Statistics, pages 3007–3015. PMLR, 2021. [Google Scholar]
- [WSY20].Wang Ruosong, Salakhutdinov Russ R, and Yang Lin. Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension. Advances in Neural Information Processing Systems, 33, 2020. [Google Scholar]
- [WVR17].Wen Zheng and Van Roy Benjamin. Efficient reinforcement learning in deterministic systems with value function generalization. Mathematics of Operations Research, 42(3):762–782, 2017. [Google Scholar]
- [XG20].Xu Pan and Gu Quanquan. A finite-time analysis of q-learning with neural network function approximation. In International Conference on Machine Learning, pages 10555–10565. PMLR, 2020. [Google Scholar]
- [YW19].Yang Lin and Wang Mengdi. Sample-optimal parametric q-learning using linearly additive features. In International Conference on Machine Learning, pages 6995–7004. PMLR, 2019. [Google Scholar]
- [YW20].Yang Lin and Wang Mengdi. Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. In International Conference on Machine Learning, pages 10746–10756. PMLR, 2020. [Google Scholar]
- [ZBB+20].Zanette Andrea, Brandfonbrener David, Brunskill Emma, Pirotta Matteo, and Lazaric Alessandro. Frequentist regret bounds for randomized least-squares value iteration. In International Conference on Artificial Intelligence and Statistics, pages 1954–1964. PMLR, 2020. [Google Scholar]
- [ZHG21].Zhou Dongruo, He Jiafan, and Gu Quanquan. Provably efficient reinforcement learning for discounted mdps with feature mapping. In International Conference on Machine Learning, pages 12793–12802. PMLR, 2021. [Google Scholar]
- [ZLKB19].Zanette Andrea, Lazaric Alessandro, Kochenderfer Mykel J, and Brunskill Emma. Limiting extrapolation in linear approximate value iteration. Advances in Neural Information Processing Systems, 32:5615–5624, 2019. [Google Scholar]
- [ZLKB20].Zanette Andrea, Lazaric Alessandro, Kochenderfer Mykel, and Brunskill Emma. Learning near optimal policies with low inherent bellman error. In International Conference on Machine Learning, pages 10978–10989. PMLR, 2020. [Google Scholar]