Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs with a Generative Model

Bingyan Wang; Yuling Yan; Jianqing Fan

. Author manuscript; available in PMC: 2022 Sep 26.

Published in final edited form as: Adv Neural Inf Process Syst. 2021 Dec;34:16671–16685.

Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs with a Generative Model

Bingyan Wang ^†,^*, Yuling Yan ^†,^*, Jianqing Fan ^†

PMCID: PMC9512142 NIHMSID: NIHMS1782585 PMID: 36168331

Abstract

The curse of dimensionality is a widely known issue in reinforcement learning (RL). In the tabular setting where the state space $S$ and the action space $A$ are both finite, to obtain a nearly optimal policy with sampling access to a generative model, the minimax optimal sample complexity scales linearly with $| S | \times | A |$ , which can be prohibitively large when $S$ or $A$ is large. This paper considers a Markov decision process (MDP) that admits a set of state-action features, which can linearly express (or approximate) its probability transition kernel. We show that a model-based approach (resp. Q-learning) provably learns an ε-optimal policy (resp. Q-function) with high probability as soon as the sample size exceeds the order of $\frac{K}{{(1 - γ)}^{3} ε^{2}} (resp . \frac{K}{{(1 - γ)}^{4} ε^{2}})$ , up to some logarithmic factor. Here K is the feature dimension and γ ∈ (0, 1) is the discount factor of the MDP. Both sample complexity bounds are provably tight, and our result for the model-based approach matches the minimax lower bound. Our results show that for arbitrarily large-scale MDP, both the model-based approach and Q-learning are sample-efficient when K is relatively small, and hence the title of this paper.

Keywords: model-based reinforcement learning, vanilla Q-learning, linear transition model, sample complexity, leave-one-out analysis

1. Introduction

Reinforcement learning (RL) studies the problem of learning and decision making in a Markov decision process (MDP). Recent years have seen exciting progress in applications of RL in real world decision-making problems such as AlphaGo [SHM+16, SSS+17] and autonomous driving [KST+21]. Specifically, the goal of RL is to search for an optimal policy that maximizes the cumulative reward, based on sequential noisy data. There are two popular approaches to RL: model-based and model-free ones.

The model-based approaches start with formulating an empirical MDP by learning the probability transition model from the collected data samples, and then estimating the optimal policy / value function based on the empirical MDP.
The model-free approaches (e.g. Q-learning) learn the optimal policy or the optimal (action-)value function from samples. As its name suggests, model-free approaches do not attempt to learn the model explicitly.

Generally speaking, model-based approaches enjoy great flexibility since after the transition model is learned in the first place, it can then be applied to any other problems without touching the raw data samples. In comparison, model-free methods, due to its online nature, are usually memory-efficient and can interact with the environment and update the estimate on the fly.

This paper is devoted to investigating the sample efficiency of both model-based RL and Q-learning (arguably one of the most commonly adopted model-free RL algorithms). It is well known that MDPs suffer from the curse of dimensionality. For example, in the tabular setting where the state space $S$ and the action space $A$ are both finite, to obtain a near optimal policy or value function given sampling access to a generative model, the minimax optimal sample complexity scales linearly with $| S | \times | A |$ [AMK13, AKY20]. However contemporary applications of RL often encounters environments with exceedingly large state and action spaces, whilst the data collection might be expensive or even high-stake. This suggests a large gap between the theoretical findings and practical decision-making problems where $| S |$ and $| A |$ are large or even infinite.

To close the aforementioned theory-practice gap, one natural idea is to impose certain structural assumption on the MDP. In this paper we follow the feature-based linear transition model studied in [YW19], where each state-action pair $(s, a) \in S \times A$ admits a K dimensional feature vector $ϕ (s, a) \in ℝ^{K}$ that expresses the transition dynamics $ℙ (\cdot ∣ s, a) = Ψ ϕ (s, a)$ for some unknown matrix $Ψ \in ℝ^{| S | \times K}$ which is common for all (s, a). This model encompasses both the tabular case and the homogeneous model in which the state space can be partitioned into K equivalent classes. Assuming access to a generative model [Kak03, KS99], under this structural assumption, this paper aims to answer the following two questions:

How many samples are needed for model-based RL and Q-learning to learn an optimal policy under the feature-based linear transition model?

In what follows, we will show that the answer to this question scales linearly with the dimension of the feature space K and is independent of $| S |$ and $| A |$ under the feature-based linear transition model. With the aid of this structural assumption, model-based RL and Q-learning becomes significantly more sample-efficient than that in the tabular setting.

Our contributions.

We focus our attention on an infinite horizon MDP with discount factor γ ∈ (0, 1). We use ε-optimal policy to indicate the policy whose expected discounted cumulative rewards are ε close to the optimal value of the MDP. Our contributions are two-fold:

We demonstrate that model-based RL provably learns an ε-optimal policy by performing planning based on an empirical MDP constructed from a total number of
$\tilde{O} (\frac{K}{{(1 - γ)}^{3} ε^{2}})$
samples, for all ε ∈ (0, (1 − γ)^−1/2]. Here $\tilde{O} (\cdot)$ hides logarithmic factors compared to the usual O(·) notation. To the best of our knowledge, this is the first theoretical guarantee for model-based RL under the feature-based linear transition model. This sample complexity bound matches the minimax limit established in [YW19] up to logarithmic factor.
We also show that Q-learning provably finds an entrywise ε-optimal Q-function using a total number of
$\tilde{O} (\frac{K}{{(1 - γ)}^{4} ε^{2}})$
samples, for all ε ∈ (0, 1]. This sample complexity upper bound improves the state-of-the-art result in [YW19] and the dependency on the effective horizon (1 − γ)⁻⁴ is sharp in view of [LCC+21a].

These results taken collectively show the minimax optimality of model-based RL and the sub-optimality of Q-learning in sample complexity.

2. Problem formulation

This paper focuses on tabular MDPs in the discounted infinite-horizon setting [B+00]. Here and throughout, $Δ_{d - 1} ≔ {v \in ℝ^{d} : \sum_{i = 1}^{d} v_{i} = 1, v_{i} \geq 0, \forall i \in [d]}$ stands for the d-dimensional probability simplex and [N] ≔ {1, 2, ⋯, N} for any $N \in ℕ^{+}$ .

Discounted infinite-horizon MDPs.

Denote a discounted infinite-horizon MDP by a tuple $M = (S, A, P, r, γ)$ , where $S = {1, \dots, | S |}$ is a finite set of states, $A = {1, \dots, | A |}$ is a finite set of actions, $P : S \times A \to Δ_{| S | - 1}$ represents the probability transition kernel where P(s′|s, a) denotes the probability of transiting from state s to state s′ when action a is taken, $r : S \times A \to [0, 1]$ denotes the reward function where r(s, a) is the instantaneous reward received when taking action $a \in A$ while in state $s \in S$ , and γ ∈ (0, 1) is the discount factor.

Value function and Q-function.

Recall that the goal of RL is to learn a policy that maximizes the cumulative reward, which corresponds to value functions or Q-functions in the corresponding MDP. For a deterministic policy $π : S \to A$ and a starting state $s \in S$ , we define the value function as

V^{π} (s) ≔ E [\sum_{k = 0}^{\infty} γ^{k} r (s_{k}, a_{k}) ∣ s_{0} = s]

for all $s \in S$ . Here, the trajectory is generated by a_k = π(s_k) and s_k+1 ~ P(s_k+1|s_k, a_k) for every k ≥ 0. This function measures the expected discounted cumulative reward received on the trajectory {(s_k, a_k)}_k≥0 and the expectation is taken with respect to the randomness of the transitions s_k+1 ~ P(·|s_k, a_k) on the trajectory. Recall that the immediate rewards lie in [0, 1], it is easy to derive that $0 \leq V^{π} (s) \leq \frac{1}{1 - γ}$ for any policy π and state s. Accordingly, we define the Q-function for policy π as

Q^{π} (s, a) ≔ E [\sum_{k = 0}^{\infty} γ^{k} r (s_{k}, a_{k}) ∣ s_{0} = s, a_{0} = a]

for all $(s, a) \in S \times A$ . Here, the actions are chosen by the policy π except for the initial state (i.e. a_k = π(s_k) for all k ≥ 1). Similar to the value function, we can easily check that $0 \leq Q^{π} (s, a) \leq \frac{1}{1 - γ}$ for any π and (s, a). To maximize the value function or Q function, previous literature [BD59,SB18] establishes that there exists an optimal policy π⋆ which simultaneously maximizes V^π(s) (resp. Q^π(s, a)) for all $s \in S (resp. (s, a) \in S \times A)$ . We define the optimal value function V⋆ and optimal Q-function Q⋆ respectively as

V^{⋆} (s) ≔ max_{π} V^{π} (s) = V^{π^{⋆}} (s), Q^{⋆} (s, a) ≔ max_{π} Q^{π} (s, a) = Q^{π^{⋆}} (s, a)

for any state-action pair $(s, a) \in S \times A$ .

Linear transition model.

Given a set of K feature functions $ϕ_{1}, ϕ_{2}, \dots, ϕ_{K} : S \times A \to ℝ$ , we define ϕ to be a feature mapping from $S \times A$ to $ℝ^{K}$ such that

ϕ (s, a) = [ϕ_{1} (s, a), \dots, ϕ_{K} (s, a)] \in ℝ^{K} .

Then we are ready to define the linear transition model [YW19] as follows.

Definition 1 (Linear transition model).

Given a discounted infinite-horizon MDP $M = (S, A, P, r, γ)$ and a feature mapping $ϕ : S \times A \to ℝ^{K}$ , M admits the linear transition model if there exists some (unknown) functions $ψ_{1}, \dots, ψ_{K} : S \to ℝ$ , such that

P (s^{'} ∣ s, a) = \sum_{k = 1}^{K} ϕ_{k} (s, a) ψ_{k} (s^{'})

(2.1)

for every $(s, a) \in S \times A$ and $s^{'} \in S$ .

Readers familiar with linear MDP literatures might immediately recognize that the above definition is the same as the structure imposed on the probability transition kernel P in the linear MDP model [YW19,JYWJ20,ZBB+20,HZG21,TV20,WDYS20,WJLJ21]. However unlike linear MDP which also requires the reward function r(s, a) to be linear in the feature mapping ϕ(s, a), here we do not impose any structural assumption on the reward.

Example 1 (Tabular MDP).

Each tabular MDP can be viewed as a linear transition model with feature mapping $ϕ (s, a) = e_{(s, a)} \in ℝ^{| S | \times | A |}$ (i.e. the vector with all entries equal to 0 but the one corresponding to (s, a) equals to 1) for all $(s, a) \in S \times A$ . To see this, we can check that Definition 1 is satisfied with $K = | S | \times | A |$ and $ψ_{(s, a)} (s^{'}) = ℙ (s^{'} ∣ s, a)$ for each s, $s^{'} \in S$ and $a \in A$ . This example is a sanity check of Definition 1, which also shows that our results (Theorem 1 and 2) can recover previous results on tabular MDP [AKY20,LCC+21a] by taking $K = | S | \times | A |$ .

Example 2 (Simplex Feature Space).

If all feature vectors ${ϕ (s, a)}_{(s, a) \in S \times A}$ fall in the probability simplex Δ_K−1, a linear transition model can be constructed by taking ψ_k(·) to be any probability measure over $S$ for all k ∈ [K].

A key observation is that the model size of linear transition model with known feature mapping ϕ is $| S | K$ (the number of coefficients ψ_k (s′) in (2.1)), which is still large when the state space $S$ is large. In contrast, it will be established later that to learn a near-optimal policy or Q-function, we only need a much smaller number of samples, which depends linearly on K and is independent of $| S |$ .

Next, we introduce a critical assumption employed in prior literature [YW19,ZLKB19,SS20].

Assumption 1 (Anchor state-action pairs).

Assume there exists a set of anchor state-action pairs $K \subset S \times A$ with $| K | = K$ ¹ such that for any $(s, a) \in S \times A$ , its corresponding feature vector can be expressed as a convex combination of the feature vectors of anchor state-action pairs ${(s, a) : (s, a) \in K}$ :

ϕ (s, a) = \sum_{i : (s_{i}, a_{i}) \in K} λ_{i} (s, a) ϕ (s_{i}, a_{i}) for \sum_{i = 1}^{K} λ_{i} (s, a) = 1 and λ_{i} (s, a) \geq 0.

(2.2)

Further, we assume that the vectors in ${ϕ (s, a) : (s, a) \in K}$ are linearly independent.

We pause to develop some intuition of this assumption using Examples 1 and 2. In Example 1, it is straightforward to check that tabular MDPs satisfies Assumption 1 with $K = S \times A$ . In terms of Example 2, without loss of generality we can assume that the subspace spanned by the features has full rank, i.e. $span {ϕ (s, a) : (s, a) \in S \times A} = ℝ^{K}$ (otherwise we can reduce the dimension of feature space). Then we can also check that Example 2 satisfies Assumption 1 with arbitrary $K \subseteq S \times A$ such that the vectors in ${ϕ (s, a) : (s, a) \in K}$ are linearly independent. In fact, this sort of “anchor” notion appears widely in the literature: [AGM12] considers “anchor word” in topic modeling; [DS04] defines “separability” in their study of non-negative matrix factorization; [SJJ95] introduces “aggregate” in reinforcement learning; [DKW19] studies “anchor state” in soft state aggregation models. These concepts all bear some kind of resemblance to our definition of anchor state-action pairs here.

Throughout this paper, we assume that the feature mapping ϕ is known, which is a widely adopted assumption in previous literature [YW19,JYWJ20,ZHG21,HZG21,TV20,WDYS20,WJLJ21]. In practice, large scale RL usually makes use of representation learning to obtain the feature mapping ϕ. Furthermore, the learned representations can be selected to satisfy the anchor state-action pairs assumption by design.

A useful implication of Assumption 1 is that we can represent the transition kernel as

P (\cdot ∣ s, a) = \sum_{i : (s_{i}, a_{i}) \in K} λ_{i} (s, a) P (\cdot ∣ s_{i}, a_{i}),

(2.3)

This follows simply from substituting (2.2) into (2.1) (see (A.4) in Appendix A for a formal proof).

3. Model-based RL with a generative model

We start with studying model-based RL with a generative model in this section. We propose a model-based planning algorithm and show that it returns an ε-optimal policy with minimax optimal sample size.

3.1. Main results

A generative model and an empirical MDP.

We assume access to a generative model that provides us with independent samples from M. For each anchor state-action pair $(s_{i}, a_{i}) \in K$ , we collect N independent samples $s_{i}^{(j)} ~ P (\cdot ∣ s_{i}, a_{i}), j \in [N]$ . This allows us to construct an empirical transition kernel $\hat{P}$ where

\hat{P} (s^{'} ∣ s, a) = \sum_{i = 1}^{K} λ_{i} (s, a) \cdot (\frac{1}{N} \sum_{j = 1}^{N} 1 {s_{i}^{(j)} = s^{'}}),

(3.1)

for each $(s, a) \in S \times A$ . Here, $\frac{1}{N} \sum_{j = 1}^{N} 1 {s_{i}^{(j)} = s^{'}}$ is an empirical estimate of P(s′|s_i, a_i) and then (2.3) is employed. With $\hat{P}$ in hand, we can construct an empirical MDP $\hat{M} = (S, A, \hat{P}, r, γ)$ . Our goal here is to derive the sample complexity which guarantees that the optimal policy of $\hat{M}$ is an ε-optimal policy for the true MDP M. The algorithm is summarized below.

3.1.

Careful readers may note that in Algorithm 1, ${λ (s, a) : (s, a) \in S \times A}$ is used in the construction of $\hat{P}$ , while ${λ (s, a) : (s, a) \in S \times A}$ is not input into the algorithm. This is because given $K$ and ϕ are known, ${λ (s, a) : (s, a) \in S \times A}$ can be calculated explicitly. The following theorem provides theoretical guarantees for the output policy $\hat{π}$ of the chosen optimization algorithm on the empirical MDP $\hat{M}$ .

Theorem 1.

Suppose that δ > 0 and ε ∈ (0, (1 − γ)^−1/2]. Let $\hat{π}$ be the policy returned by Algorithm 1. Assume that

N \geq \frac{C log (K / ((1 - γ) δ))}{{(1 - γ)}^{3} ε^{2}}

(3.2)

for some sufficiently large constant C > 0. Then with probability exceeding 1 − δ,

Q^{⋆} (s, a) - Q^{\hat{π}} (s, a) \leq ε + \frac{4 ε_{opt}}{1 - γ},

(3.3)

for every $(s, a) \in S \times A$ . Here ε_opt is the target algorithmic error level in Algorithm 1.

We first remark that the two terms on the right hand side of (3.3) can be viewed as statistical error and algorithmic error, respectively. The first term ε denotes the statistical error coming from the deviation of the empirical MDP $\hat{M}$ from the true MDP M. As the sample size N grows, ε could decrease towards 0. The other term 4ε_opt/(1 − γ) represents the algorithmic error where ε_opt is the target accuracy level of the planning algorithm applied to $\hat{M}$ . Note that ε_opt can be arbitrarily small if we run the planning algorithm (e.g. value iteration) for enough iterations. A few implications of this theorem are in order.

Minimax-optimal sample complexity. Assume that ε_opt is made negligibly small, e.g. ε_opt = O((1 − γ)ε) to be discussed in the next point. Note that we draw N independent samples for each state-action pair $(s, a) \in K$ , therefore the requirement (3.2) for finding an O(ε)-optimal policy translates into the following sample complexity requirement
$\tilde{O} (\frac{K}{{(1 - γ)}^{3} ε^{2}}) .$
This matches the minimax optimal lower bound (up to a logarithm factor) established in [YW19, Theorem 1] for feature-based MDP. In comparison, for tabular MDP the minimax optimal sample complexity is $\tilde{Ω} ({(1 - γ)}^{- 3} ε^{- 2} | S ∥ A |)$ [AMK13,AKY20]. Our sample complexity scales linearly with K instead of $| S | | A |$ for tabular MDP as desired.
Computational complexity. An advantage of Theorem 1 is that it incorporates the use of any efficient planning algorithm applied to the empirical MDP $\hat{M}$ . Classical algorithms include Q-value iteration (QVI) or policy iteration (PI) [Put14]. For example, QVI achieves the target level ε_opt in $O ({(1 - γ)}^{- 1} log ε_{opt}^{- 1})$ iterations, and each iteration takes time proportional to $O (N K + | S ∥ A | K)$ . To learn an O(ε)-optimal policy, which requires sample complexity (3.2) and the target level ε_opt = O((1 − γ)ε), the overall running time is
$\tilde{O} (\frac{| S | | A | K}{1 - γ} + \frac{K}{{(1 - γ)}^{4} ε^{2}}) .$
In comparison, for the tabular MDP the corresponding running time is $\tilde{O} ({(1 - γ)}^{- 4} ε^{- 2} | S | | A |)$ [AKY20]. This suggests that under the feature-based linear transition model, the computational complexity is $min {| S | | A | / K, {(1 - γ)}^{- 3} ε^{- 2} / K}$ times lower than that for the tabular MDP (up to logarithm factors), which is significantly more efficient when K is not too large.
Stability vis-à-vis model misspecification. A more general version of Theorem 1 (Theorem 3 in Appendix B) shows that when P approximately (instead of exactly) admits the linear transition model, we can still achieve some meaningful result. Specifically, if there exists a linear transition kernel $\tilde{P}$ obeying ${max}_{(s, a) \in S \times A} ∥ \tilde{P} (\cdot ∣ s, a) - P (\cdot ∣ s, a) ∥_{1} \leq ξ$ for some ξ ≥ 0, we can show that $\hat{π}$ returned by Algorithm 1 (with slight modification) satisfies
$Q^{⋆} (s, a) - Q^{\hat{π}} (s, a) \leq ε + \frac{4 ε_{opt}}{1 - γ} + \frac{22 ξ}{{(1 - γ)}^{2}},$
for every $(s, a) \in S \times A$ . This shows that the model-based method is stable vis-á-vis model misspecification. Interested readers are referred to Appendix B for more details.

In Algorithm 1, the reward function r is assumed to be known. If the information of r is unavailable, an alternative is to assume that r is linear with respect to the feature mapping ϕ, i.e. r(s, a) = θ^⊤ϕ(s, a) for every $(s, a) \in S \times A$ , which is widely adopted in linear MDP literature [HZG21,JYWJ20,WDYS20,WJLJ21]. Under this linear assumption, one can obtain θ by solving the following linear system of equations

r (s, a) = θ^{⊤} ϕ (s, a), \forall (s, a) \in K,

(3.4)

which can be constructed by the observed reward r(s, a) for all anchor state-action pairs.

4. Model-free RL—vanilla Q Learning

In this section, we turn to study one of the most popular model-free RL algorithms—Q-learning. We provide tight sample complexity bound for vanilla Q-learning under the feature-based linear transition model, which shows its sample-efficiency (depends on |K| instead of $| S |$ or $| A |$ ) and sub-optimality in the dependency on the effective horizon.

4.1. Q-learning algorithm

The vanilla Q-learning algorithm maintains a Q-function estimate $Q_{t} : S \times A \to ℝ$ for all t ≥ 0, with initialization Q₀ obeying $0 \leq Q_{0} (s, a) \leq \frac{1}{1 - γ}$ for every $(s, a) \in S \times A$ . Assume we have access to a generative model. In each iteration t ≥ 1, we collect an independent sample $s_{t} (s, a) ~ P (\cdot ∣ s, a)$ for every anchor state-action pair $(s, a) \in K$ and define $Q_{K}^{(t)} : K \to ℝ$ to be

Q_{K}^{(t)} (s, a) ≔ max_{a^{'} \in A} Q_{t} (s_{t}, a^{'}), s_{t} \equiv s_{t} (s, a) ~ P (\cdot ∣ s, a) .

Then given the learning rate η_t ∈ (0, 1], the algorithm adopts the following update rule to update all entries of the Q-function estimate

Q_{t} = (1 - η_{t}) Q_{t - 1} + η_{t} T_{K}^{(t)} (Q_{t - 1}) .

Here, $T_{K}^{(t)}$ is an empirical Bellman operator associated with the linear transition model M and the set $K$ and is given by

T_{K}^{(t)} (Q) (s, a) ≔ r (s, a) + γ λ (s, a) Q_{K}^{(t)},

where (2.3) is used in the construction. Clearly, this newly defined operator $T_{K}^{(t)}$ is an unbiased estimate of the famous Bellman operator $T$ [Bel52] defined as

\forall (s, a) \in S \times A : T (Q) (s, a) ≔ r (s, a) + γ E_{s^{'} ~ P (\cdot ∣ s, a)} [max_{a^{'} \in A} Q (s^{'}, a^{'})] .

A critical property is that the Bellman operator $T$ is contractive with a unique fixed point which is the optimal Q-function Q⋆ [Bel52]. To solve the fixed-point equation $T (Q^{⋆}) = Q^{⋆}$ , Q-learning was then introduced by [WD92] based on the idea of stochastic approximation [RM51]. This procedure is precisely described in Algorithm 2.

4.2. Main results

We are now ready to provide our main result for vanilla Q-learning, assuming sampling access to a generative model.

Theorem 2.

Consider any δ ∈ (0, 1) and ε ∈ (0, 1]. Assume that for any 0 ≤ t ≤ T, the learning rates satisfy

\frac{1}{1 + \frac{c_{1} (1 - γ) T}{{log}^{2} T}} \leq η_{t} \leq \frac{1}{1 + \frac{c_{2} (1 - γ) t}{{log}^{2} T}}

(4.1)

for some sufficiently small universal constants c₁ ≥ c₂ > 0. Suppose that the total number of iterations T exceeds

T \geq \frac{C_{3} log (K T / δ) {log}^{4} T}{{(1 - γ)}^{4} ε^{2}}

(4.2)

for some sufficiently large universal constant C₃ > 0. If the initialization obeys $0 \leq Q_{0} (s, a) \leq \frac{1}{1 - γ}$ for any $(s, a) \in S \times A$ , then with probability exceeding 1 − δ, the output Q_T of Algorithm 2 satisfies

max_{(s, a) \in S \times A} | Q_{T} (s, a) - Q^{⋆} (s, a) | \leq ε .

(4.3)

In addition, let π_T (resp. V_T) to be the policy (resp. value function) induced by Q_T, then one has

max_{s \in S} | V^{π_{T}} (s) - V^{⋆} (s) | \leq \frac{2 γ ε}{1 - γ} .

(4.4)

This theorem provides theoretical guarantees on the performance of Algorithm 2. A few implications of this theorem are in order.

Learning rate. The condition (4.1) accommodates two commonly adopted choice of learning rates: (i) linearly rescaled learning rates η_t = [1 + c₂(1 − γ)t/log²T]⁻¹, and (ii) iteration-invariant learning rates η_t ≡ [1 + c₁(1 − γ)T/log²T]. Interested readers are referred to the discussions in [LCC+21a, Section 3.1] for more details on these two learning rate schemes.
Tight sample complexity bound. Note that we draw K independent samples in each iteration, therefore the iteration complexity (4.2) can be translated into the sample complexity bound TK in order for Q-learning to achieve ε-accuracy:
$\tilde{O} (\frac{K}{{(1 - γ)}^{4} ε^{2}}) .$ (4.5)
As we will see shortly, this result improves the state-of-the-art sample complexity bound presented in [YW19, Theorem 2]. In addition, the dependency on the effective horizon (1 − γ)⁻⁴ matches the lower bound established in [LCC+21a, Theorem 2] for vanilla Q-learning using either learning rate scheme covered in the previous remark, suggesting that our sample complexity bound (4.5) is sharp.
Stability vis-à-vis model misspecification. Just like the model-based approach, we can also show that Q-learning is also stable vis-á-vis model misspecification when P approximately admits the linear transition model. We refer interested readers to Theorem 4 in Appendix B for more details.

Comparison with [YW19].

We compare our result with the sample complexity bounds for Q-learning under the feature-based linear transition model in [YW19].

We first compare our result with [YW19, Theorem 2], which is, to the best of our knowledge, the state-of-the-art theory for this problem. When there is no model misspecification, [YW19, Theorem 2] showed that in order for their Phased Parametric Q-learning² (Algorithm 1 therein) to learn an ε-optimal policy, the sample size needs to be
$\tilde{O} (\frac{K}{{(1 - γ)}^{7} ε^{2}}) .$
Note that (4.5) is the sample complexity required for entrywise ε-accurate estimate of the optimal Q-function, thus a fair comparison requires to deduce the sample complexity for learning an ε-optimal policy from (4.4), which is
$\tilde{O} (\frac{K}{{(1 - γ)}^{6} ε^{2}}) .$
Hence, our sample complexity improves upon previous work by a factor at least on the order of (1 − γ)⁻¹. However it is worth mentioning that [YW19, Theorem 2] is built upon weaker conditions $\sum_{i = 1}^{K} λ_{i} (s, a) = 1$ and $\sum_{i = 1}^{K} | λ_{i} (s, a) | \leq L$ for some L ≥ 1, which does not require λ_i(s, a) ≥ 0. Our result holds under Assumption 1, which requires $\sum_{i = 1}^{K} λ_{i} (s, a) = 1$ and λ_i(s, a) ≥ 0. Under the current analysis framework, it is difficult to obtain tight sample complexity bounds without assuming λ_i(s, a) ≥ 0.
Besides vanilla Q-learning, [YW19] also proposed a new variant of Q-learning called Optimal Phased Parametric Q-Learning (Algorithm 2 therein), which is essentially Q-learning with variance reduction. [YW19, Theorem 3] showed that the sample complexity for this algorithm is
$\tilde{O} (\frac{K}{{(1 - γ)}^{3} ε^{2}}),$
which matches minimax optimal lower bound (up to a logarithm factor) established in [YW19, Theorem 1]. Careful reader might notice that this sample complexity bound is better than ours for vanilla Q-learning. We emphasize that as elucidated in the second implication under Theorem 2, our result is already tight for vanilla Q-learning. This observation reveals that while the sample complexity for vanilla Q-learning is provably sub-optimal, the variants of Q-learning can have better performance and achieve minimax optimal sample complexity.

We conclude this section by comparing model-based and model-free approaches. Theorem 1 shows that the sample complexity of the model-based approach is minimax optimal, whilst vanilla Q-learning, perhaps the most commonly adopted model-free method, is sub-optimal according to Theorem 2. However this does not mean that model-based method is better than model-free ones since (i) some variants of Q-learning (see [YW19, Algorithm 2] for example) also has minimax optimal sample complexity; and (ii) in many applications it might be unrealistic to estimate the model in advance.

5. A glimpse of our technical approaches

The establishment of Theorems 1 and 2 calls for a series of technical novelties in the proof. In what follows, we briefly highlight our key technical ideas and novelties.

For the model-based approach, we employ “leave-one-out” analysis to decouple the complicated statistical dependency between the empirical probability transition model $\hat{P}$ and the corresponding optimal policy. Specifically, [AKY20] proposed to construct a collection of auxiliary MDPs where each one of them leaves out a single state s by setting s to be an absorbing state and keeping everything else untouched. We tailor this high level idea to the needs of linear transition model, then the independence between the newly constructed MDP with absorbing state s and data samples collected at state s will facilitate our analysis, as detailed in Lemma 1. This “leave-one-out” type of analysis has been utilized in studying numerous problems by a long line of work, such as [EK18,MWCC18,Wai19a,CFMW19,CCF+20,CFMY20,CFWY21], just to name a few.
To obtain tighter sample complexity bound than the previous one $\tilde{O} (\frac{K}{{(1 - γ)}^{7} ε^{2}})$ in [YW19] for vanilla Q-learning, we invoke Freedman’s inequality [Fre75] for the concentration of an error term with martingale structure as illustrated in Appendix C, while the classical ones used in analyzing Q-learning are Hoeffding’s inequality and Bernstein’s inequality [YW19]. The use of Freedman’s inequality helps us establish a recursive relation on ${{‖ Q_{t} - Q^{⋆} ‖}_{\infty}}_{t = 0}^{T}$ , which consequently leads to the performance guarantee (4.3). It is worth mentioning that [LCC+21a] also studied vanilla Q-learning in the tabular MDP setting and adopted Freedman’s inequality, while we emphasize that it requires a lot of efforts and more delicate analyses in order to study linear transition model and also allow for model misspecification in the current paper, as detailed in the appendices.

6. Additional related literature

To remedy the issue of prohibitively high sample complexity, there exists a substantial body of literature proposing and studying many structural assumptions and complexity notions under different settings. This current paper focuses on linear transition model which is studied in MDP by numerous previous works [YW19,JYWJ20,YW20,ZHG21,MJTS20,HDL+21,WDYS20,TV20,HZG21,WJLJ21]. Among them, [YW19] studied linear transition model and provided tight sample complexity bounds for a new variant of Q-learning with the help of variance reduction. [JYWJ20] focused on linear MDP and designed an algorithm called “Least-Squares Value Iteration with UCB” with both polynomial runtime and polynomial sample complexity without accessing generative model. [WDYS20] extended the study of linear MDP to the framework of reward-free reinforcement learning. [ZHG21] considered a different feature mapping called linear kernel MDP and devised an algorithm with polynomial regret bound without generative model. Other popular structure assumptions include: [WVR17] studied fully deterministic transition dynamics; [JKA+17] introduced Bellman rank and proposed an algorithm which needs sample size polynomially dependent on Bellman rank to obtain a near-optimal policy in contextual decision processes; [DLWZ19] assumed that the value function has low variance compared to the mean for all deterministic policy; [MR07,PLT+08,ABA18,ZLKB20] used linear model to approximate the value function; [LCC+21b] assumed that the optimal Q-function can be linearly-parameterized by the features.

Apart from the linear transition model, another notion adopted in this work is the generative model, whose role in discounted MDP has been studied by extensive literature. The concept of generative model was originally introduced by [KS99], and then widely adopted in numerous works, including [Kak03,AMK13, YW19,Wai19b,AKY20,PW20], to name a few. Specifically, it is assumed that a generative model of the studied MDP is available and can be queried for every state-action pair and output the next state. Among previous works, [AMK13] proved that the minimax lower bound on the sample complexity to obtain an ε-optimal policy was $\tilde{Ω} (\frac{| S | | A |}{{(1 - γ)}^{3} ε^{2}})$ . [AMK13] also showed that model-based approach can output an ε-optimal value function with near-optimal sample complexity for ε ∈ (0, 1). Then [AKY20] made significant progress on the challenging problem of establishing minimax optimal sample complexity in estimating an ε-optimal policy with the help of “leave-one-out” analysis.

In addition, after being proposed in [Wat89], Q-learning has become the focus of a rich line of research [WD92, BT96, KS99, EDMB03, AMGK11, JAZBJ18, Wai19a, CZD+19, LWC+20b, XG20]. Among them, [CZD+19,LWC+20b,XG20] studied Q-learning in the presence of Markovian data, i.e. a single sample trajectory. In contrast, under the generative setting of Q-learning where a fresh sample can be drawn from the simulator at each iteration, [Wai19b] analyzed a variant of Q-learning with the help of variance reduction, which was proved to enjoy minimax optimal sample complexity $\tilde{O} (\frac{| S ∥ A |}{{(1 - γ)}^{3} ε^{2}})$ . Then more recently, [LCC+21a] improved the lower bound of the vanilla Q-learning algorithm in terms of its scaling with $\frac{1}{1 - γ}$ and proved a matching upper bound $\tilde{O} (\frac{| S | | A |}{{(1 - γ)}^{4} ε^{2}})$ .

7. Discussion

This paper studies sample complexity of both model-based and model-free RL under a discounted infinite-horizon MDP with feature-based linear transition model. We establish tight sample complexity bounds for both model-based approaches and Q-learning, which scale linearly with the feature dimension K instead of $| S | \times | A |$ , thus considerably reduce the required sample size for large-scale MDPs when K is relatively small. Our results are sharp, and the sample complexity bound for the model-based approach matches the minimax lower bound. The current work suggests a couple of directions for future investigation, as discussed in detail below.

Extension to episodic MDPs. An interesting direction for future research is to study linear transition model in episodic MDP. This focus of this work is infinite-horizon discounted MDPs, and hopefully the analysis here can be extended to study the episodic MDP as well ([DB15,DLB17,AOM17,JA18,WDYK20,HZG21]).
Continuous state and action space. The state and action spaces in this current paper are still assumed to be finite, since the proof relies heavily on the matrix operations. However, we expect that the results can be extended to accommodate continuous state and action space by employing more complicated analysis.
Accommodating entire range of ε. Since both value functions and Q-functions can take value in [0, (1 − γ)⁻¹], ideally our theory should cover all choices of ε ∈ (0, (1 − γ)⁻¹]. However we require that ε ∈ (0, (1 − γ)^−1/2] in Theorem 1 and ε ∈ (0, 1] in Theorem 2. While most of the prior works like [AKY20,YW19] also impose these restrictions, a recent work [LWC+20a] proposed a perturbed model-based planning algorithm and proved minimax optimal guarantees for any ε ∈ (0, (1 − γ)⁻¹]. While their work only focused on model-based RL under tabular MDP, an interesting future direction is to improve our theory to accommodate any ε ∈ (0, (1 − γ)⁻¹].
General function approximation. Another future direction is to extend the study to more general function approximation starting from linear structure covered in this paper. There exists a rich body of work proposing and studying different structures, such as linear value function approximation [MR07,PLT+08,ABA18,ZLKB20], linear MDPs with infinite dimensional features [AHKS20], Eluder dimension [WSY20], Bellman rank [JKA+17] and Witness rank [SJK+19], etc. Therefore, it is hopeful to investigate these settings and improve the sample efficiency.

Acknowledgements

B. Wang is supported in part by Gordon Y. S. Wu Fellowships in Engineering. Y. Yan is supported in part by ARO grant W911NF-20-1-0097 and NSF grant CCF-1907661. Part of this work was done while Y. Yan was visiting the Simons Institute for the Theory of Computing. J. Fan is supported in part by the ONR grant N00014-19-1-2120 and the NSF grants DMS-1662139, DMS-1712591, DMS-2052926, DMS-2053832, and the NIH grant 2R01-GM072611-15.

APP1

A. Notations

In this section we gather the notations that will be used throughout the appendix.

For any vectors $u = {[u_{i}]}_{i = 1}^{n} \in ℝ^{n}$ and $v = {[u_{i}]}_{i = 1}^{n} \in ℝ^{n}$ , let $u \circ v = {[u_{i} v_{i}]}_{i = 1}^{n}$ denote the Hadamard product of u and v. We slightly abuse notations to use $\sqrt{\cdot}$ and | · | to define entry-wise operation, i.e. for any vector $v = {[v_{i}]}_{i = 1}^{n}$ denote $\sqrt{v} ≔ {[\sqrt{v_{i}}]}_{i = 1}^{n}$ and $| v | ≔ {[| v_{i} |]}_{i = 1}^{n}$ . Furthermore, the binary notations ≤ and ≥ are both defined in entry-wise manner, i.e. u ≤ v (resp. u ≥ v) means u_i ≤ v_i (resp. u_i ≥ v_i) for all 1 ≤ i ≤ n. For a collection of vectors $v_{1}, \dots, v_{m} \in ℝ^{n}$ with $v_{i} = {[v_{i, j}]}_{j = 1}^{n} \in ℝ^{n}$ , we define the max operator to be ${max}_{1 \leq i \leq m} v_{i} ≔ {[{max}_{1 \leq i \leq m} v_{i, j}]}_{j = 1}^{n}$ .

For any matrix $M \in ℝ^{m \times n}$ , ∥M∥₁ is defined as the largest row-wise ℓ₁ norm of M, i.e. $‖ M ‖_{1} ≔ {max}_{i} \sum_{j} | M_{i, j} |$ . In addition, we define 1 to be a vector with all the entries being 1, and I be the identity matrix. To express the probability transition function P in matrix form, we define the matrix $P \in ℝ^{| S | | A | \times | S |}$ to be a matrix whose (s, a)-th row P_s,a corresponds to P(·|s, a). In addition, we define P^π to be the probability transition matrix induced by policy π, i.e. $P_{(s, a), (s^{'}, a^{'})}^{π} = P_{s, a} (s^{'}) 1_{π (s^{'}) = a^{'}}$ for all state-action pairs (s, a) and (s′, a′). We define π_t to be the policy induced by Q_t, i.e. Q_t(s, π_t(s)) = max_a Q_t(s, a) for all $s \in S$ . Furthermore, we denote the reward function r by vector $r \in ℝ^{| S | | A |}$ , i.e. the (s, a)-th element of r equals r(s, a). In the same manner, we define $V^{π} \in ℝ^{| S |}$ , $V^{⋆} \in ℝ^{| S |}$ , $V_{t} \in ℝ^{| S |}$ , $Q^{π} \in ℝ^{| S | | A |}$ , $Q^{⋆} \in ℝ^{| S ∥ A |}$ and $Q_{t} \in ℝ^{| S | | A |}$ to represent V^π, V^⋆, V_t, Q^π, Q^⋆ and Q_t respectively. By using these notations, we can rewrite the Bellman equation as

Q^{π} = r + γ P V^{π} = r + γ P^{π} Q^{π} .

(A.1)

Further, for any vector $V \in ℝ^{| S |}$ , let ${Var}_{P} (V) \in ℝ^{| S | | A |}$ be

{Var}_{P} (V) ≔ P (V \circ V) - (P V) \circ (P V),

(A.2)

and define ${Var}_{P_{s, a}} (V) \in ℝ$ to be

{Var}_{P_{s, a}} (V) ≔ P_{s, a} (V \circ V) - {(P_{s, a} V)}^{2},

(A.3)

where P_s,a is the (s, a)-th row of P.

Next, we reconsider Assumption 1. For any state-action pair (s, a), we define vector $λ (s, a) \in ℝ^{K} (resp . ϕ (s, a) \in ℝ^{K})$ with $λ (s, a) = {[λ_{i} (s, a)]}_{i = 1}^{K} (resp . ϕ (s, a) = {[ϕ_{i} (s, a)]}_{i = 1}^{K})$ and matrix $Λ \in ℝ^{| S | | A | \times K} (resp. Φ \in ℝ^{| S | | A | \times K})$ whose (s, a)-th row corresponds to λ(s, a)^⊤ (resp. ϕ(s, a)^⊤). Define vector $ψ (s, a) \in ℝ^{K}$ with $ψ (s, a) = {[ψ_{i} (s, a)]}_{i = 1}^{K}$ and matrix $Ψ \in ℝ^{K \times | S |}$ whose (s, a)-th column corresponds to ψ(s, a)^⊤. Further, let $P_{K} \in ℝ^{K \times | S |} (resp. Φ_{K} \in ℝ^{K \times K})$ to be a submatrix of P (resp. Φ) formed by concatenating the rows ${P_{s, a}, (s, a) \in K} (resp . {Φ_{s, a}, (s, a) \in K})$ . By using the previous notations, we can express the relations in Definition 1 and Assumption 1 as $P_{K} = Φ_{K} Ψ$ , P = ΦΨ and $Φ = Λ Φ_{K}$ . Note that Assumption 1 suggests $Φ_{K}$ is invertible. Taking these equations collectively yields

P = Φ Ψ = Φ Φ_{K}^{- 1} P_{K} = Λ Φ_{K} Φ_{K}^{- 1} P_{K} = Λ P_{K},

(A.4)

which is reminiscent of the anchor word condition in topic modelling [AGM12]. In addition, for each iteration t, we denote the collected samples as ${s_{t} (s, a)}_{(s, a) \in K}$ and define a matrix ${\hat{P}}_{K}^{(t)} \in {0, 1}^{K \times | S |}$ to be

{\hat{P}}_{K}^{(t)} ((s, a), s^{'}) ≔ {\begin{array}{l} 1, & if s^{'} = s_{t} (s, a) \\ 0, & otherwise \end{array}

(A.5)

for any $(s, a) \in K$ and $s^{'} \in S$ . Further, we define ${\hat{P}}_{t} = Λ {\hat{P}}_{K}^{(t)}$ . Then it is obvious to see that ${\hat{P}}_{t}$ has nonnegative entries and unit ℓ₁ norm for each row due to Assumption 1, i.e. ${‖ {\hat{P}}_{t} ‖}_{1} = 1$ .

B. Analysis of model-based RL (Proof of Theorem 1)

In this section, we will provide complete proof for Theorem 1. As a matter of fact, our proof strategy here justifies a more general version of Theorem 1 that accounts for model misspecification, as stated below.

Theorem 3.

Suppose that δ > 0 and ε ∈ (0, (1 − γ)^−1/2]. Assume that there exists a probability transition model $\tilde{P}$ obeying Definition 2.1 and Assumption 1 with feature vectors ${ϕ (s, a)}_{(s, a) \in S \times A} \subset ℝ^{K}$ and anchor state-action pairs $K$ such that

∥ \tilde{P} - P ∥_{1} \leq ξ

for some ξ ≥ 0. Let $\hat{π}$ be the policy returned by Algorithm 1. Assume that

N \geq \frac{C log (K / ((1 - γ) δ))}{{(1 - γ)}^{3} ε^{2}}

(B.1)

for some sufficiently large constant C > 0. Then with probability exceeding 1 − δ,

Q^{⋆} (s, a) - Q^{\hat{π}} (s, a) \leq ε + \frac{4 ε_{opt}}{1 - γ} + \frac{22 ξ}{{(1 - γ)}^{2}},

(B.2)

for every state-action pair $(s, a) \in S \times A$ .

Theorem 3 subsumes Theorem 1 as a special case with ξ = 0. The remainder of this section is devoted to proving Theorem 3.

B.1. Proof of Theorem 3

The err $Q^{\hat{π}} - Q^{⋆}$ can be decomposed as

Q^{\hat{π}} - Q^{⋆} = Q^{\hat{π}} - {\hat{Q}}^{\hat{π}} + {\hat{Q}}^{\hat{π}} - {\hat{Q}}^{⋆} + {\hat{Q}}^{⋆} - Q^{⋆} \geq Q^{\hat{π}} - {\hat{Q}}^{\hat{π}} + {\hat{Q}}^{\hat{π}} - {\hat{Q}}^{⋆} + {\hat{Q}}^{π^{⋆}} - Q^{⋆} \geq - ({‖ Q^{\hat{π}} - {\hat{Q}}^{\hat{π}} ‖}_{\infty} + {‖ {\hat{Q}}^{\hat{π}} - {\hat{Q}}^{⋆} ‖}_{\infty} + {‖ {\hat{Q}}^{π^{⋆}} - Q^{⋆} ‖}_{\infty}) 1.

(B.3)

For policy $\hat{π}$ satisfying the condition in Theorem 1, we have ${‖ {\hat{Q}}^{\hat{π}} - {\hat{Q}}^{⋆} ‖}_{\infty} \leq ε_{opt}$ . It boils down to control ${‖ Q^{\hat{π}} - {\hat{Q}}^{\hat{π}} ‖}_{\infty}$ and ${‖ {\hat{Q}}^{π^{⋆}} - Q^{⋆} ‖}_{\infty}$ .

To begin with, we can use (A.1) to further decompose ${‖ Q^{\hat{π}} - {\hat{Q}}^{\hat{π}} ‖}_{\infty}$ as

{‖ Q^{\hat{π}} - {\hat{Q}}^{\hat{π}} ‖}_{\infty} = {‖ {(I - γ P^{\hat{π}})}^{- 1} r - {(I - γ {\hat{P}}^{\hat{π}})}^{- 1} r ‖}_{\infty} = {‖ {(I - γ P^{\hat{π}})}^{- 1} [(I - γ {\hat{P}}^{\hat{π}}) - (I - γ P^{\hat{π}})] {\hat{Q}}^{\hat{π}} ‖}_{\infty} = {‖ γ {(I - γ P^{\hat{π}})}^{- 1} (P - \hat{P}) {\hat{V}}^{\hat{π}} ‖}_{\infty} \leq {‖ γ {(I - γ P^{\hat{π}})}^{- 1} (P - \hat{P}) {\hat{V}}^{⋆} ‖}_{\infty} + {‖ γ {(I - γ P^{\hat{π}})}^{- 1} (P - \hat{P}) ({\hat{V}}^{\hat{π}} - {\hat{V}}^{⋆}) ‖}_{\infty} \leq {‖ γ {(I - γ P^{\hat{π}})}^{- 1} | (P - \hat{P}) {\hat{V}}^{⋆} | ‖}_{\infty} + \frac{2 γ ε_{opt}}{1 - γ} .

(B.4)

Here the last inequality is due to

{‖ γ {(I - γ P^{\hat{π}})}^{- 1} (P - \hat{P}) ({\hat{V}}^{\hat{π}} - {\hat{V}}^{⋆}) ‖}_{\infty} \leq γ {‖ {(I - γ P^{\hat{π}})}^{- 1} ‖}_{1} {‖ (P - \hat{P}) ({\hat{V}}^{\hat{π}} - {\hat{V}}^{⋆}) ‖}_{\infty} \leq γ {‖ {(I - γ P^{\hat{π}})}^{- 1} ‖}_{1} (‖ P ‖_{1} + ‖ \hat{P} ‖_{1}) {‖ {\hat{V}}^{\hat{π}} - {\hat{V}}^{⋆} ‖}_{\infty} \leq \frac{2 γ ε_{opt}}{1 - γ},

where we use the fact that ${‖ {(I - γ P^{\hat{π}})}^{- 1} ‖}_{1} \leq 1 / (1 - γ)$ and $‖ P ‖_{1} = ‖ \hat{P} ‖_{1} = 1$ .

Similarly, for the term ${‖ {\hat{Q}}^{π^{⋆}} - Q^{⋆} ‖}_{\infty}$ in (B.3), we have

{‖ {\hat{Q}}^{π^{⋆}} - Q^{⋆} ‖}_{\infty} = {‖ γ {(I - γ P^{π^{⋆}})}^{- 1} (P - \hat{P}) {\hat{V}}^{π^{⋆}} ‖}_{\infty} \leq {‖ γ {(I - γ P^{π^{⋆}})}^{- 1} | (P - \hat{P}) {\hat{V}}^{π^{⋆}} | ‖}_{\infty} .

(B.5)

As can be seen from (B.4) and (B.5), it boils down to bound $| (P - \hat{P}) {\hat{V}}^{⋆} |$ and $| (P - \hat{P}) {\hat{V}}^{π^{⋆}} |$ . We have the following lemma.

Lemma 1.

With probability exceeding 1 − δ, one has

| {(P - \hat{P})}_{s, a} {\hat{V}}^{⋆} | \leq \frac{10 ξ}{1 - γ} + 4 \sqrt{\frac{2 log (4 K / δ)}{N}} + \frac{4 log (8 K / ((1 - γ) δ))}{(1 - γ) N} + \sqrt{\frac{4 log (8 K / ((1 - γ) δ))}{N}} \sqrt{{Var}_{P_{s, a}} ({\hat{V}}^{⋆})},

(B.6)

| {(P - \hat{P})}_{s, a} {\hat{V}}^{π^{⋆}} | \leq \frac{10 ξ}{1 - γ} + 4 \sqrt{\frac{2 log (4 K / δ)}{N}} + \frac{4 log (8 K / ((1 - γ) δ))}{(1 - γ) N} + \sqrt{\frac{4 log (8 K / ((1 - γ) δ))}{N}} \sqrt{{Var}_{P_{s, a}} ({\hat{V}}^{π^{⋆}})} .

(B.7)

Proof.

See Appendix B.2. □

Applying (B.6) to (B.4) reveals that

{‖ Q^{\hat{π}} - {\hat{Q}}^{\hat{π}} ‖}_{\infty} \leq \sqrt{\frac{4 log (8 K / ((1 - γ) δ))}{N}} {‖ γ {(I - γ P^{\hat{π}})}^{- 1} \sqrt{{Var}_{P_{s, a}} ({\hat{V}}^{⋆})} ‖}_{\infty} + \frac{γ}{1 - γ} [4 \sqrt{\frac{2 log (4 K / δ)}{N}} + \frac{4 log (8 K / ((1 - γ) δ))}{(1 - γ) N}] + \frac{10 γ ξ}{{(1 - γ)}^{2}} + \frac{2 γ ε_{opt}}{1 - γ} .

(B.8)

For the first term, one has

\sqrt{{Var}_{P_{s, a}} ({\hat{V}}^{⋆})} \leq \sqrt{{Var}_{P_{s, a}} (V^{\hat{π}})} + \sqrt{{Var}_{P_{s, a}} (V^{\hat{π}} - {\hat{V}}^{\hat{π}})} + \sqrt{{Var}_{P_{s, a}} ({\hat{V}}^{\hat{π}} - {\hat{V}}^{⋆})} \leq \sqrt{{Var}_{P_{s, a}} (V^{\hat{π}})} + {‖ V^{\hat{π}} - {\hat{V}}^{\hat{π}} ‖}_{\infty} + ε_{opt} \leq \sqrt{{Var}_{P_{s, a}} (V^{\hat{π}})} + {‖ Q^{\hat{π}} - {\hat{Q}}^{\hat{π}} ‖}_{\infty} + ε_{opt},

where the first inequality comes from the fact that $\sqrt{Var (X + Y)} \leq \sqrt{Var (X)} + \sqrt{Var (Y)}$ for any random variables X and Y. It follows that

{‖ γ {(I - γ P^{\hat{π}})}^{- 1} \sqrt{{Var}_{P_{s, a}} ({\hat{V}}^{⋆})} ‖}_{\infty} \leq {‖ γ {(I - γ P^{\hat{π}})}^{- 1} \sqrt{{Var}_{P_{a, a}} (V^{\hat{π}})} ‖}_{\infty} + \frac{γ}{1 - γ} ({‖ Q^{\hat{π}} - {\hat{Q}}^{\hat{π}} ‖}_{\infty} + ε_{opt}) \leq γ \sqrt{\frac{2}{{(1 - γ)}^{3}}} + \frac{γ}{1 - γ} ({‖ Q^{\hat{π}} - {\hat{Q}}^{\hat{π}} ‖}_{\infty} + ε_{opt}),

(B.9)

where the second inequality utilizes [AMK13, Lemma 7].

Plugging (B.9) into (B.8) yields

{‖ Q^{\hat{π}} - {\hat{Q}}^{\hat{π}} ‖}_{\infty} \leq \sqrt{\frac{4 log (8 K / ((1 - γ) δ))}{N}} [γ \sqrt{\frac{2}{{(1 - γ)}^{3}}} + \frac{γ}{1 - γ} ({‖ Q^{\hat{π}} - {\hat{Q}}^{\hat{π}} ‖}_{\infty} + ε_{opt})] + \frac{γ}{1 - γ} [4 \sqrt{\frac{2 log (4 K / δ)}{N}} + \frac{4 log (8 K / ((1 - γ) δ))}{(1 - γ) N}] + \frac{10 γ ξ}{{(1 - γ)}^{2}} + \frac{2 γ ε_{opt}}{1 - γ} .

Then we can rearrange terms to obtain

{‖ Q^{\hat{π}} - {\hat{Q}}^{\hat{π}} ‖}_{\infty} \leq 10 γ \sqrt{\frac{log (8 K / ((1 - γ) δ))}{N {(1 - γ)}^{3}}} + \frac{11 γ ξ}{{(1 - γ)}^{2}} + \frac{3 γ ε_{opt}}{1 - γ}

(B.10)

as long as $N \geq \tilde{C} log (8 K / ((1 - γ) δ)) / {(1 - γ)}^{2}$ for some sufficiently large constant $\tilde{C} > 0$ .

In a similar vein, we can use (B.5) and (B.7) to obtain that

{‖ {\hat{Q}}^{π^{⋆}} - Q^{⋆} ‖}_{\infty} \leq 10 γ \sqrt{\frac{log (8 K / ((1 - γ) δ))}{N {(1 - γ)}^{3}}} + \frac{11 γ ξ}{{(1 - γ)}^{2}} .

(B.11)

Finally, we can substitute (B.10) and (B.11) into (B.3) to achieve

Q^{\hat{π}} - Q^{⋆} \geq - (20 γ \sqrt{\frac{log (8 K / ((1 - γ) δ))}{N {(1 - γ)}^{3}}} + \frac{22 γ ξ}{{(1 - γ)}^{2}} + \frac{3 γ ε_{opt}}{1 - γ} + ε_{opt}) 1 .

This result implies that

Q^{\hat{π}} \geq Q^{⋆} - (ε + \frac{22 ξ}{{(1 - γ)}^{2}} + \frac{4 ε_{opt}}{1 - γ}) 1,

as long as

N \geq \frac{C log (8 K / ((1 - γ) δ))}{{(1 - γ)}^{3} ε^{2}},

for some sufficiently large constant C > 0.

B.2. Proof of Lemma 1

To prove this theorem, we invoke the idea of s-absorbing MDP proposed by [AKY20]. For a state $s \in S$ and a scalar u, we define a new MDP M_s,u to be identical to M on all the other states except s; on state s, M_s,u is absorbing such that $P_{M_{s, u}} (s ∣ s, a) = 1$ and $r_{M_{s, u}} (s, a) = (1 - γ) u$ for all $a \in A$ . More formally, we define $P_{M_{u, s}}$ and $r_{M_{u, s}}$ as

\begin{array}{l} P_{M_{s, u}} (s ∣ s, a) = 1, r_{M_{s, u}} (s, a) = (1 - γ) u, & for all a \in A, \\ P_{M_{s, u}} (\cdot ∣ s^{'}, a^{'}) = P (\cdot ∣ s^{'}, a^{'}), r_{M_{s, u}} (s, a) = r (s, a), & for all s^{'} \neq s and a^{'} \in A . \end{array}

To streamline notations, we will use $V_{s, u}^{π} \in ℝ^{| S |}$ and $V_{s, u}^{⋆} \in ℝ^{| S |}$ to denote the value function of M_s,u under policy π and the optimal value function of M_s,u respectively. Furthermore, we denote by ${\hat{M}}_{s, u}$ the MDP whose probability transition kernel is identical to $\hat{P}$ at all states except that state s is absorbing. Similar as before, we use ${\hat{V}}_{s, u}^{⋆} \in ℝ^{| S |}$ to denote the optimal value function under ${\hat{M}}_{s, u}$ . The construction of this collection of auxiliary MDPs will facilitate our analysis by decoupling the statistical dependency between $\hat{P}$ and ${\hat{π}}^{⋆}$ .

To begin with, we can decompose the quantity of interest as

| {(P - \hat{P})}_{s, a} {\hat{V}}^{⋆} | = | {(P - \hat{P})}_{s, a} ({\hat{V}}^{⋆} - {\hat{V}}_{s, u}^{⋆} + {\hat{V}}_{s, u}^{⋆}) | \leq | {(P - \hat{P})}_{s, a} {\hat{V}}_{s, u}^{⋆} | + | {(P - \hat{P})}_{s, a} ({\hat{V}}^{⋆} - {\hat{V}}_{s, u}^{⋆}) | \overset{(i)}{\leq} | {(P - \tilde{P})}_{s, a} {\hat{V}}_{s, u}^{⋆} | + | λ (s, a) ({\tilde{P}}_{K} - P_{K}) {\hat{V}}_{s, u}^{⋆} | + | λ (s, a) (P_{K} - {\hat{P}}_{K}) {\hat{V}}_{s, u}^{⋆} | + ({‖ P_{s, a} ‖}_{1} + {‖ {\hat{P}}_{s, a} ‖}_{1}) {‖ {\hat{V}}^{⋆} - {\hat{V}}_{s, u}^{⋆} ‖}_{\infty}

(B.12)

\leq {‖ {(P - \tilde{P})}_{s, a} ‖}_{1} {‖ {\hat{V}}_{s, u}^{⋆} ‖}_{\infty} + ‖ λ (s, a) ‖_{1} \cdot {‖ ({\tilde{P}}_{K} - P_{K}) {\hat{V}}_{s, u}^{⋆} ‖}_{\infty} + ‖ λ (s, a) ‖_{1} \cdot {‖ (P_{K} - {\hat{P}}_{K}) {\hat{V}}_{s, u}^{⋆} ‖}_{\infty} + 2 {‖ {\hat{V}}^{⋆} - {\hat{V}}_{s, u}^{⋆} ‖}_{\infty}

(B.13)

\overset{(ii)}{\leq} \frac{2 ξ}{1 - γ} + max_{(s, a) \in K} | {(P - \hat{P})}_{s, a} {\hat{V}}_{s, u}^{⋆} | + 2 {‖ {\hat{V}}^{⋆} - {\hat{V}}_{s, u}^{⋆} ‖}_{\infty},

(B.14)

where (i) makes use of ${\tilde{P}}_{s, a} = λ (s, a) {\tilde{P}}_{K}$ and ${\hat{P}}_{s, a} = λ (s, a) {\hat{P}}_{K}$ ; (ii) depends on $‖ P - \tilde{P} ‖_{1} \leq ξ$ , ∥λ(s, a)∥₁ = 1 and ${‖ {\hat{V}}_{s, u}^{⋆} ‖}_{\infty} \leq {(1 - γ)}^{- 1}$ . For each state s, the value of u will be selected from a set $U_{s}$ . The choice of $U_{s}$ will be specified later. Then for some fixed u in $U_{s}$ and fixed state-action pair $(s, a) \in K$ , due to the independence between ${\hat{P}}_{s, a}$ . and ${\hat{V}}_{s, u}^{⋆}$ , we can apply Bernstein’s inequality (cf. [Ver18, Theorem 2.8.4]) conditional on ${\hat{V}}_{s, u}^{⋆}$ to reveal that with probability greater than 1 − δ/2,

| {(P - \hat{P})}_{s, a} {\hat{V}}_{s, u}^{⋆} | \leq \sqrt{\frac{2 log (4 / δ)}{N} {Var}_{P_{s, a}} ({\hat{V}}_{s, u}^{⋆})} + \frac{2 log (4 / δ)}{3 (1 - γ) N} .

(B.15)

Invoking the union bound over all the K state-action pairs of $K$ and all the possible values of u in $U_{s}$ demonstrate that with probability greater than 1 − δ/2,

| {(P - \hat{P})}_{s, a} {\hat{V}}_{s, u}^{⋆} | \leq \sqrt{\frac{2 log (4 K | U_{s} | / δ)}{N} {Var}_{P_{s, a}} ({\hat{V}}_{s, u}^{⋆})} + \frac{2 log (4 K | U_{s} | / δ)}{3 (1 - γ) N},

(B.16)

holds for all state-action pair $(s, a) \in K$ and all $u \in U_{s}$ . Here, ${Var}_{P_{s, a}} (\cdot)$ is defined in (A.3). Then we observe that

\sqrt{{Var}_{P_{s, a}} ({\hat{V}}_{s, u}^{⋆})} \leq \sqrt{{Var}_{P_{s, a}} ({\hat{V}}^{⋆} - {\hat{V}}_{s, u}^{⋆})} + \sqrt{{Var}_{P_{s, a}} ({\hat{V}}^{⋆})} \leq {‖ {\hat{V}}^{⋆} - {\hat{V}}_{s, u}^{⋆} ‖}_{\infty} + \sqrt{{Var}_{P_{s, a}} ({\hat{V}}^{⋆})} \leq | {\hat{V}}^{⋆} (s) - u | + \sqrt{{Var}_{P_{s, a}} ({\hat{V}}^{⋆})},

(B.17)

where (i) is due to $\sqrt{{Var}_{P_{s, a}} (V_{1} + V_{2})} \leq \sqrt{{Var}_{P_{s, a}} (V_{1})} + \sqrt{{Var}_{P_{s, a}} (V_{2})}$ and (ii) holds since

{‖ {\hat{V}}^{⋆} - {\hat{V}}_{s, u}^{⋆} ‖}_{\infty} = {‖ {\hat{V}}_{s, {\hat{V}}^{⋆} (s)}^{⋆} - {\hat{V}}_{s, u}^{⋆} ‖}_{\infty} \leq | {\hat{V}}^{⋆} (s) - u |,

(B.18)

whose proof can be found in [AKY20, Lemma 8 and 9].

By substituting (B.16), (B.17) and (B.18) into (B.14), we arrive at

| {(P - \hat{P})}_{s, a} {\hat{V}}^{⋆} | \leq \frac{2 ξ}{1 - γ} + | {\hat{V}}^{⋆} (s) - u | (2 + \sqrt{\frac{2 log (4 K | U_{s} | / δ)}{N}}) + \sqrt{\frac{2 log (4 K | U_{s} | / δ)}{N}} \sqrt{{Var}_{P_{s, a}} ({\hat{V}}^{⋆})} + \frac{2 log (4 K | U_{s} | / δ)}{3 (1 - γ) N} .

(B.19)

Then it boils down to determining $U_{s}$ . The coarse bounds of ${\hat{Q}}^{π^{⋆}}$ and ${\hat{Q}}^{⋆}$ in the following lemma provide a guidance on the choice of $U_{s}$ .

Lemma 2.

For δ ∈ (0, 1), with probability exceeding 1 − δ/2 one has

{‖ Q^{⋆} - {\hat{Q}}^{π^{⋆}} ‖}_{\infty} \leq \frac{γ}{1 - γ} \sqrt{\frac{log (4 K / δ)}{2 N {(1 - γ)}^{2}}} + \frac{2 γ ξ}{{(1 - γ)}^{2}},

(B.20)

{‖ Q^{⋆} - {\hat{Q}}^{⋆} ‖}_{\infty} \leq \frac{γ}{1 - γ} \sqrt{\frac{log (4 K / δ)}{2 N {(1 - γ)}^{2}}} + \frac{2 γ ξ}{{(1 - γ)}^{2}} .

(B.21)

Proof.

See Appendix B.3. □

This inspires us to choose U_s to be the set consisting of equidistant points in [V^⋆ (s) − R(δ), V^⋆ (s)+R(δ)] with |U_s| = ⌈1/(1 – γ)⌉² and

R (δ) ≔ \frac{γ}{1 - γ} \sqrt{\frac{log (4 K / δ)}{2 N {(1 - γ)}^{2}}} + \frac{2 γ ξ}{{(1 - γ)}^{2}} .

Since ${‖ V^{⋆} - {\hat{V}}^{⋆} ‖}_{\infty} \leq {‖ Q^{⋆} - {\hat{Q}}^{⋆} ‖}_{\infty}$ , Lemma 2 implies that ${\hat{V}}^{⋆} (s) \in [V^{⋆} (s) - R (δ), V^{⋆} (s) + R (δ)]$ with probability over 1 − δ/2. Hence, we have

min_{u \in U_{s}} | {\hat{V}}^{⋆} (s) - u | \leq \frac{2 R (δ)}{| U_{s} | + 1} \leq 2 γ \sqrt{\frac{2 log (4 K / δ)}{N}} + 4 γ ξ .

(B.22)

Consequently, with probability exceeding 1 − δ, one has

| {(P - \hat{P})}_{s, a} {\hat{V}}^{⋆} | \overset{(i)}{\leq} \frac{2 ξ}{1 - γ} + min_{u \in U_{s}} | {\hat{V}}^{⋆} (s) - u | (2 + \sqrt{\frac{2 log (4 K | U_{s} | / δ)}{N}}) + \sqrt{\frac{2 log (4 K | U_{s} | / δ)}{N}} \sqrt{{Var}_{P_{s, a}} ({\hat{V}}^{⋆})} + \frac{2 log (4 K | U_{s} | / δ)}{3 (1 - γ) N} \overset{(ii)}{\leq} \frac{2 ξ}{1 - γ} + (2 γ \sqrt{\frac{2 log (4 K / δ)}{N}} + 4 γ ξ) (2 + \sqrt{\frac{4 log (8 K / ((1 - γ) δ))}{N}}) + \sqrt{\frac{4 log (8 K / ((1 - γ) δ))}{N}} \sqrt{{Var}_{P_{s, a}} ({\hat{V}}^{⋆})} + \frac{2 log (8 K / ((1 - γ) δ))}{3 (1 - γ) N} \leq \frac{10 ξ}{1 - γ} + 4 \sqrt{\frac{2 log (4 K / δ)}{N}} + \frac{4 log (8 K / ((1 - γ) δ))}{(1 - γ) N} + \sqrt{\frac{4 log (8 K / ((1 - γ) δ))}{N}} \sqrt{{Var}_{P_{s, a}} ({\hat{V}}^{⋆})},

where (i) follows from (B.19) and (ii) utilizes (B.22). This finishes the proof for the first inequality. The second inequality can be proved in a similar way and is omitted here for brevity.

B.3. Proof of Lemma 2

To begin with, one has

{‖ (\hat{P} - P) V^{⋆} ‖}_{\infty} \leq {‖ Λ ({\hat{P}}_{K} - P_{K}) V^{⋆} ‖}_{\infty} + {‖ Λ (P_{K} - {\tilde{P}}_{K}) V^{⋆} ‖}_{\infty} + {‖ (\tilde{P} - P) V^{⋆} ‖}_{\infty} \leq ‖ Λ ‖_{1} {‖ ({\hat{P}}_{K} - P_{K}) V^{⋆} ‖}_{\infty} + ‖ Λ ‖_{1} {‖ (P_{K} - {\tilde{P}}_{K}) V^{⋆} ‖}_{\infty} + ‖ \tilde{P} - P ‖_{1} {‖ V^{⋆} ‖}_{\infty} \leq {‖ ({\hat{P}}_{K} - P_{K}) V^{⋆} ‖}_{\infty} + \frac{2 ξ}{1 - γ},

(B.23)

where the first line uses $\hat{P} = Λ {\hat{P}}_{K}$ and $\tilde{P} = Λ {\tilde{P}}_{K}$ ; the last inequality comes from the facts that $‖ \tilde{P} - P ‖_{1} \leq ξ,$ , ∥Λ∥₁ = 1 and ∥V^⋆∥_∞ ≤ (1 − γ)⁻¹. Then we turn to bound ${‖ ({\hat{P}}_{K} - P_{K}) V^{⋆} ‖}_{\infty}$ . In view of (3.1), Hoeffding’s inequality (cf. [Ver18, Theorem 2.2.6]) implies that for $(s, a) \in K$ ,

ℙ (| {(\hat{P} - P)}_{s, a} V^{⋆} | \geq t) \leq 2 exp (- \frac{2 t^{2}}{{‖ V^{⋆} ‖}_{\infty}^{2} / N}) .

Hence by the standard union bound argument we have

{‖ ({\hat{P}}_{K} - P_{K}) V^{⋆} ‖}_{\infty} \leq \sqrt{\frac{{‖ V^{⋆} ‖}_{\infty}^{2} log (4 K / δ)}{2 N}} \leq \sqrt{\frac{log (4 K / δ)}{2 N {(1 - γ)}^{2}}},

(B.24)

with probability over 1 – δ/2.

Now we are ready to bound

Q^{π^{⋆}} - {\hat{Q}}^{π^{⋆}}

. One has

Q^{π^{⋆}} - {\hat{Q}}^{π^{⋆}} = {(I - γ P^{π^{⋆}})}^{- 1} r - {(I - γ {\hat{P}}^{π^{⋆}})}^{- 1} r = {(I - γ {\hat{P}}^{π^{⋆}})}^{- 1} ((I - γ {\hat{P}}^{π^{⋆}}) - (I - γ P^{π^{⋆}})) Q^{π^{⋆}} = γ {(I - γ {\hat{P}}^{π^{⋆}})}^{- 1} (P^{π^{⋆}} - {\hat{P}}^{π^{⋆}}) Q^{π^{⋆}} = γ {(I - γ {\hat{P}}^{π^{⋆}})}^{- 1} (P - \hat{P}) V^{π^{⋆}},

where the first equality makes use of (A.1). Then we take (B.23) and (B.24) collectively to achieve

{‖ γ {(I - γ {\hat{P}}^{π^{⋆}})}^{- 1} (P - \hat{P}) V^{⋆} ‖}_{\infty} \leq γ \sum_{i = 0}^{\infty} {‖ γ^{i} {({\hat{P}}^{π^{⋆}})}^{i} (P - \hat{P}) V^{⋆} ‖}_{\infty} \leq γ \sum_{i = 0}^{\infty} γ^{i} {‖ {({\hat{P}}^{π^{⋆}})}^{i} ‖}_{1} {‖ (P - \hat{P}) V^{⋆} ‖}_{\infty} \leq \frac{γ}{1 - γ} \sqrt{\frac{log (4 K / δ)}{2 N {(1 - γ)}^{2}}} + \frac{2 γ ξ}{{(1 - γ)}^{2}},

where the last line comes from the fact that for all $i \geq 1, {({\hat{P}}^{π^{⋆}})}^{i}$ is a probability transition matrix so that ${‖ {({\hat{P}}^{π^{⋆}})}^{i} ‖}_{1} = 1$ . This justifies the first inequality (B.20).

In terms of the second one, [AKY20, Section A.4] implies that
${‖ Q^{⋆} - {\hat{Q}}^{⋆} ‖}_{\infty} \leq \frac{γ}{1 - γ} {‖ (P - \hat{P}) V^{⋆} ‖}_{\infty} .$

Substitution of (B.23) and (B.24) into the above inequality yields
${‖ Q^{⋆} - {\hat{Q}}^{⋆} ‖}_{\infty} \leq \frac{γ}{1 - γ} \sqrt{\frac{log (4 K / δ)}{2 N {(1 - γ)}^{2}}} + \frac{2 γ ξ}{{(1 - γ)}^{2}} .$

C. Analysis of Q-learning (Proof of Theorem 2)

In this section, we will provide complete proof for Theorem 2. We actually prove a more general version of Theorem 2 that takes model misspecification into consideration, as stated below.

Theorem 4.

Consider any δ ∈ (0, 1) and ε ∈ (0, 1]. Suppose that there exists a probability transition model $\tilde{P}$ obeying Definition 2.1 and Assumption 1 with feature vectors ${ϕ (s, a)}_{(s, a) \in S \times A} \subset ℝ^{K}$ and anchor state-action pairs $K$ such that

‖ \tilde{P} - P ‖_{1} \leq ξ

for some ξ ≥ 0. Assume that the initialization obeys $0 \leq Q_{0} (s, a) \leq \frac{1}{1 - γ}$ for any $(s, a) \in S \times A$ and for any 0 ≤ t ≤ T, the learning rates satisfy

\frac{1}{1 + \frac{c_{1} (1 - γ) T}{{log}^{2} T}} \leq η_{t} \leq \frac{1}{1 + \frac{c_{2} (1 - γ) t}{{log}^{2} T}},

(C.1)

for some sufficiently small universal constants c₁ ≥ c₂ > 0. Suppose that the total number of iterations T exceeds

T \geq \frac{C_{3} log (K T / δ) {log}^{4} T}{{(1 - γ)}^{4} ε^{2}},

(C.2)

for some sufficiently large universal constant C₃ > 0. If there exists a linear probability transition model $\tilde{P}$ satisfying Assumption 1 with feature vectors ${ϕ (s, a)}_{(s, a) \in S \times A}$ such that $‖ \tilde{P} - P ‖_{1} \leq ξ$ , then with probability exceeding 1 − δ, the output Q_T of Algorithm 2 satisfies

max_{(s, a) \in S \times A} | Q_{T} (s, a) - Q^{⋆} (s, a) | \leq ε + \frac{6 γ ξ}{{(1 - γ)}^{2}},

(C.3)

for some constant C₄ > 0. In addition, let π_T (resp. V_T) to be the policy (resp. value function) induced by Q_T, then one has

max_{s \in S} | V^{π_{T}} (s) - V^{⋆} (s) | \leq \frac{2 γ}{1 - γ} (ε + \frac{6 γ ξ}{{(1 - γ)}^{2}}) .

(C.4)

Theorem 4 subsumes Theorem 2 as a special case with ξ = 0. The remainder of this section is devoted to proving Theorem 4.

C.1. Proof of Theorem 4

First we show that (C.4) can be easily obtained from (C.3). Since [SY94] gives rise to

{‖ V^{π_{T}} - V^{⋆} ‖}_{\infty} \leq \frac{2 γ {‖ V_{T} - V^{⋆} ‖}_{\infty}}{1 - γ},

we have

{‖ V^{π_{T}} - V^{⋆} ‖}_{\infty} \leq \frac{2 γ {‖ Q_{T} - Q^{⋆} ‖}_{\infty}}{1 - γ},

due to ∥V_T − V^⋆∥_∞ ≤ ∥Q_T − Q^⋆∥_∞. Then (C.4) follows directly from (C.3).

Therefore, we are left to justify (C.3). To start with, we consider the update rule

Q_{t} = (1 - η_{t}) Q_{t - 1} + η_{t} (r + γ {\hat{P}}_{t} V_{t - 1}) .

By defining the error term Δ_t ≔ Q_t − Q^⋆, we can decompose Δ_t into

Δ_{t} = (1 - η_{t}) Q_{t - 1} + η_{t} (r + γ {\hat{P}}_{t} V_{t - 1}) - Q^{⋆} = (1 - η_{t}) (Q_{t - 1} - Q^{⋆}) + η_{t} (r + γ {\hat{P}}_{t} V_{t - 1} - Q^{⋆}) = (1 - η_{t}) (Q_{t - 1} - Q^{⋆}) + γ η_{t} ({\hat{P}}_{t} V_{t - 1} - P V^{⋆}) = (1 - η_{t}) Δ_{t - 1} + γ η_{t} Λ ({\hat{P}}_{K}^{(t)} - P_{K}) V_{t - 1} + γ η_{t} Λ P_{K} (V_{t - 1} - V^{⋆}) + γ η_{t} (Λ P_{K} - P) V^{⋆} .

(C.5)

Here in the penultimate equality, we make use of Q^⋆ = r + γPV^⋆; and the last equality comes from ${\hat{P}}_{t} = Λ {\hat{P}}_{K}^{(t)}$ which is defined in (A.5). It is straightforward to check that $Λ P_{K}$ is also a probability transition matrix. We denote by $\bar{P} = Λ P_{K}$ hereafter. The third term in the decomposition above can be upper and lower bounded by

\bar{P} (V_{t - 1} - V^{⋆}) = {\bar{P}}^{π_{t - 1}} Q_{t - 1} - {\bar{P}}^{π^{⋆}} Q^{⋆} \leq {\bar{P}}^{π_{t - 1}} Q_{t - 1} - {\bar{P}}^{π_{t - 1}} Q^{⋆} = {\bar{P}}^{π_{t - 1}} Δ_{t - 1},

and

\bar{P} (V_{t - 1} - V^{⋆}) = {\bar{P}}^{π_{t - 1}} Q_{t - 1} - {\bar{P}}^{π^{⋆}} Q^{⋆} \geq {\bar{P}}^{π^{⋆}} Q_{t - 1} - {\bar{P}}^{π^{⋆}} Q^{⋆} = {\bar{P}}^{π^{⋆}} Δ_{t - 1} .

Plugging these bounds into (C.5) yields

Δ_{t} \leq (1 - η_{t}) Δ_{t - 1} + γ η_{t} Λ ({\hat{P}}_{K}^{(t)} - P_{K}) V_{t - 1} + γ η_{t} {\bar{P}}^{π_{t - 1}} Δ_{t - 1} + γ η_{t} (Λ P_{K} - P) V^{⋆}, Δ_{t} \geq (1 - η_{t}) Δ_{t - 1} + γ η_{t} Λ ({\hat{P}}_{K}^{(t)} - P_{K}) V_{t - 1} + γ η_{t} {\bar{P}}^{π^{⋆}} Δ_{t - 1} + γ η_{t} (Λ P_{K} - P) V^{⋆} .

Repeatedly invoking these two recursive relations leads to

Δ_{t} \leq η_{0}^{(t)} Δ_{0} + \sum_{i = 1}^{t} η_{i}^{(t)} γ ({\bar{P}}^{π_{t - 1}} Δ_{t - 1} + Λ ({\hat{P}}_{K}^{(t)} - P_{K}) V_{t - 1} + (Λ P_{K} - P) V^{⋆}),

(C.6)

Δ_{t} \geq η_{0}^{(t)} Δ_{0} + \sum_{i = 1}^{t} η_{i}^{(t)} γ ({\bar{P}}^{π^{⋆}} Δ_{t - 1} + Λ ({\hat{P}}_{K}^{(t)} - P_{K}) V_{t - 1} + (Λ P_{K} - P) V^{⋆}),

(C.7)

where

η_{i}^{(t)} ≔ {\begin{array}{l} \prod_{j = 1}^{t} (1 - η_{j}), & if i = 0, \\ η_{i} \prod_{j = i + 1}^{t} (1 - η_{j}), & if 0 < i < t, \\ η_{t}, & if i = t . \end{array}

Here we adopt the same notations as [LCC+21a].

To begin with, we consider the upper bound (C.6). It can be further decomposed as

Δ_{t} \leq \underset{≕ θ_{t}}{\underset{︸}{η_{0}^{(t)} Δ_{0} + \sum_{i = 1}^{(1 - α) t} η_{i}^{(t)} γ ({\bar{P}}^{π_{t - 1}} Δ_{t - 1} + Λ ({\hat{P}}_{K}^{(t)} - P_{K}) V_{t - 1})}} + \underset{≕ ν_{t}}{\underset{︸}{\sum_{i = (1 - α) t + 1}^{t} η_{i}^{(t)} γ Λ ({\hat{P}}_{K}^{(t)} - P_{K}) V_{i - 1}}} + \underset{≕ ω_{t}}{\underset{︸}{\sum_{i = 1}^{t} η_{i}^{(t)} γ (Λ P_{K} - P) V^{⋆}}} + \sum_{i = (1 - α) t + 1}^{t} η_{i}^{(t)} γ {\bar{P}}^{π_{t - 1}} Δ_{i - 1},

(C.8)

where we define α ≔ C₄(1 − γ)/log T for some constant C₄ > 0. Next, we turn to bound θ_t and ν_t respectively for any t satisfying $\frac{T}{c_{2} log \frac{1}{1 - γ}} \leq t \leq T$ with stepsize choice (4.1).

Bounding ω_t.

It is straightforward to bound

{‖ ω_{t} ‖}_{\infty} \overset{(i)}{=} {‖ γ (Λ P_{K} - P) V^{⋆} ‖}_{\infty} \overset{(ii)}{\leq} γ (‖ Λ ‖_{1} {‖ (P_{K} - {\tilde{P}}_{K}) V^{⋆} ‖}_{\infty} + {‖ (\tilde{P} - P) V^{⋆} ‖}_{\infty}) \overset{(iii)}{\leq} \frac{2 γ ξ}{1 - γ},

where the first equality comes from the fact that $\sum_{i = 1}^{t} η_{i}^{(t)} = 1$ [LCC+21a, Equation (40)]; the second inequality utilizes $\tilde{P} = Λ {\tilde{P}}_{K}$ ; the last line uses the facts that ∥Λ∥₁ = 1, ∥V^⋆∥_∞ ≤ (1 − γ)⁻¹ and ${‖ {\tilde{P}}_{K} - P_{K} ‖}_{1} \leq ‖ \tilde{P} - P ‖_{1} \leq ξ$ .

Bounding θ_t.

By similar derivation as Step 1 in [LCC+21a, Appendix A.2], we have

{‖ θ_{t} ‖}_{\infty} \leq η_{0}^{(t)} {‖ Δ_{0} ‖}_{\infty} + t max_{1 \leq i \leq (1 - α) t} η_{i}^{(t)} max_{1 \leq i \leq (1 - α) t} ({‖ {\bar{P}}^{π_{t - 1}} Δ_{i - 1} ‖}_{\infty} + {‖ Λ {\hat{P}}_{K}^{(t)} V_{i - 1} ‖}_{\infty} + {‖ Λ P_{K} V_{i - 1} ‖}_{\infty}) \overset{(i)}{\leq} η_{0}^{(t)} {‖ Δ_{0} ‖}_{\infty} + t max_{1 \leq i \leq (1 - α) t} η_{i}^{(t)} max_{1 \leq i \leq (1 - α) t} ({‖ Δ_{i - 1} ‖}_{\infty} + 2 {‖ V_{i - 1} ‖}_{\infty}) \overset{(ii)}{\leq} \frac{1}{2 T^{2}} \cdot \frac{1}{1 - γ} + \frac{1}{2 T^{2}} \cdot t \cdot \frac{3}{1 - γ} \leq \frac{2}{(1 - γ) T},

(C.9)

where (i) is due to the fact that ${‖ {\bar{P}}^{π_{t - 1}} ‖}_{1} = {‖ Λ {\hat{P}}_{K}^{(t)} ‖}_{1} = {‖ Λ P_{K} ‖}_{1} = 1$ and (ii) comes from [LCC+21a, Equation (39a)].

Bounding ν_t.

To control the second term, we apply the following Freedman’s inequality.

Lemma 3 (Freedman’s Inequality).

Consider a real-valued martingale {Y_k : k = 0, 1, 2, ⋯} with difference sequence {X_k : k = 1, 2, 3, ⋯}. Assume that the difference sequence is uniformly bounded:

| X_{k} | \leq R a n d E [X_{k} ∣ {X_{j}}_{j = 1}^{k - 1}] = 0 f o r a l l k \geq 1.

Let

S_{n} ≔ \sum_{k = 1}^{n} X_{i}, T_{n} ≔ \sum_{k = 1}^{n} Var {X_{k} ∣ {X_{j}}_{j = 1}^{k - 1}} .

Then for any given σ² ≥ 0, one has

ℙ (| S_{n} | \geq τ a n d T_{n} \leq σ^{2}) \leq 2 exp (- \frac{τ^{2} / 2}{σ^{2} + R τ / 3}) .

In addition, suppose that W_n ≤ σ² holds deterministically. For any positive integer K ≥ 1, with probability at least 1 − δ one has

| S_{n} | \leq \sqrt{8 max {T_{n}, \frac{σ^{2}}{2^{K}}} log \frac{2 K}{δ}} + \frac{4}{3} R log \frac{2 K}{δ} .

Proof.

See [LCC+21a, Theorem 4]. □

To apply this inequality, we can express ν_t as

ν_{t} ≔ \sum_{i = (1 - α) t + 1}^{t} x_{i},

with

x_{i} ≔ η_{i}^{(t)} γ Λ ({\hat{P}}_{K}^{(t)} - P_{K}) V_{i - 1}, and E [x_{i} ∣ V_{i - 1}, \dots, V_{0}] = 0 .

(C.10)

In order to calculate bound R in Lemma 3, one has

B ≔ max_{(1 - α) t < t \leq t} {‖ x_{i} ‖}_{\infty} \leq max_{(1 - α) t < t \leq t} {‖ η_{i}^{(t)} Λ ({\hat{P}}_{K}^{(t)} - P_{K}) V_{i - 1} ‖}_{\infty} \leq max_{(1 - α) t < t \leq t} η_{i}^{(t)} ({‖ Λ {\hat{P}}_{K}^{(t)} ‖}_{1} + {‖ Λ P_{K} ‖}_{1}) {‖ V_{i - 1} ‖}_{\infty} \leq max_{(1 - α) t < t \leq t} η_{i}^{(t)} \cdot \frac{2}{1 - γ} \leq \frac{4 {log}^{4} T}{{(1 - γ)}^{2} T},

where the last inequality comes from [LCC+21a, Eqn (39b)] and the fact that ${‖ V_{i - 1} ‖}_{\infty} \leq \frac{1}{1 - γ}$ .

Then regarding the variance term, we claim for the moment that
$W_{t} ≔ \sum_{i = (1 - α) t + 1}^{t} diag (Var (x_{i} ∣ V_{i - 1}, \dots, V_{0})) \leq γ^{2} \sum_{i = (1 - α) t + 1}^{t} {(η_{i}^{(t)})}^{2} {Var}_{\bar{P}} (V_{i - 1}) .$ (C.11)

Then we have
$W_{t} \leq max_{(1 - α) t \leq i \leq t} η_{i}^{(t)} (\sum_{i = (1 - α) t + 1}^{t} η_{i}^{(t)}) max_{(1 - α) t \leq i < t} {Var}_{\bar{P}} (V_{i}) \leq \frac{2 {log}^{4} T}{(1 - γ) T} max_{(1 - α) t \leq i < t} {Var}_{\bar{P}} (V_{i}),$ (C.12)

where the second line comes from [LCC+21a, Eqns (39b), (40)]. A trivial upper bound for W_t is
$| W_{t} | \leq \frac{2 {log}^{4} T}{(1 - γ) T} \cdot \frac{1}{{(1 - γ)}^{2}} 1 = \frac{2 {log}^{4} T}{{(1 - γ)}^{3} T} 1,$

which uses the fact that ${Var}_{P} (V_{i}) \leq {‖ V_{i} ‖}_{\infty}^{2} \leq 1 / {(1 - γ)}^{2}$ .

Then, we invoke Lemma 3 with $K = ⌈ 2 {log}_{2} \frac{1}{1 - γ} ⌉$ and apply the union bound argument over $K$ to arrive at

| ν_{t} | \leq \sqrt{8 (W_{t} + \frac{σ^{2}}{2^{K}} 1) log \frac{8 K T log \frac{1}{1 - γ}}{δ}} + \frac{4}{3} B log \frac{8 K T log \frac{1}{1 - γ}}{δ} 1 \leq \sqrt{8 (W_{t} + \frac{2 {log}^{4} T}{(1 - γ) T} 1) log \frac{8 K T}{δ}} + \frac{4}{3} B log \frac{8 K T log \frac{1}{1 - γ}}{δ} 1 \leq \sqrt{\frac{32 {log}^{4} T}{(1 - γ) T} log \frac{8 K T}{δ} (max_{(1 - α) t \leq i < t} {Var}_{Λ P_{K}} (V_{i}) + 1)} + \frac{12 {log}^{4} T}{{(1 - γ)}^{2} T} log \frac{8 K T}{δ} 1 .

(C.13)

Hence if we define

φ_{t} ≔ 64 \frac{{log}^{4} T log \frac{K T}{δ}}{(1 - γ) T} (max_{\frac{t}{2} \leq i \leq t} {Var}_{\bar{P}} (V_{i}) + 1),

then (C.9) and (C.13) implies that

| θ_{t} | + | ν_{t} | + | ω_{t} | \leq \sqrt{φ_{t}} + \frac{2 γ ξ}{1 - γ} 1,

(C.14)

with probability over 1 − δ for all 2t/3 ≤ k ≤ t, as long as $T ≫ {log}^{4} T log \frac{K T}{δ} / {(1 - γ)}^{3}$ . Therefore, plugging (C.14) into (C.8), we arrive at the recursive relationship

Δ_{t} \leq \sqrt{φ_{t}} + \frac{2 γ ξ}{1 - γ} 1 + \sum_{i = (1 - α) k + 1}^{k} η_{i}^{(k)} γ {\bar{P}}^{π_{i - 1}} Δ_{i - 1} = \sqrt{φ_{t}} + \frac{2 γ ξ}{1 - γ} 1 + \sum_{i = (1 - α) k}^{k - 1} η_{i}^{(k)} γ {\bar{P}}^{π_{i - 1}} Δ_{i} .

This recursion is expressed in a similar way as [LCC+21a, Eqn. (46)] so we can invoke similar derivation in [LCC+21a, Appendix A.2] to obtain that

Δ_{t} \leq 30 \sqrt{\frac{{log}^{4} T log \frac{K T}{δ}}{{(1 - γ)}^{4} T} (1 + max_{\frac{t}{2} \leq i < t} {‖ Δ_{i} ‖}_{\infty})} 1 + \frac{2 γ ξ}{{(1 - γ)}^{2}} 1 .

(C.15)

Then we turn to (C.7). Applying a similar argument, we can deduce that

Δ_{t} \geq - 30 \sqrt{\frac{{log}^{4} T log \frac{K T}{δ}}{{(1 - γ)}^{4} T} (1 + max_{\frac{t}{2} \leq i < t} {‖ Δ_{i} ‖}_{\infty})} 1 - \frac{2 γ ξ}{{(1 - γ)}^{2}} 1 .

(C.16)

For any t satisfying $\frac{T}{c_{2} log \frac{1}{1 - γ}} \leq t \leq T$ , taking (C.15) and (C.16) collectively gives rise to

{‖ Δ_{t} ‖}_{\infty} \leq 30 \sqrt{\frac{{log}^{4} T log \frac{K T}{δ}}{{(1 - γ)}^{4} T} (1 + max_{\frac{t}{2} \leq i < t} {‖ Δ_{i} ‖}_{\infty})} + \frac{2 γ ξ}{{(1 - γ)}^{2}} .

(C.17)

Let

u_{k} ≔ max {{‖ Δ_{t} ‖}_{\infty} : 2^{k} \frac{T}{c_{2} log \frac{1}{1 - γ}} \leq t \leq T} .

By taking supremum over $t \in {⌈ 2^{k} T / (c_{2} log \frac{1}{1 - γ}) ⌉, \dots, T}$ on both sides of (C.17), we have

u_{k} \leq 30 \sqrt{\frac{{log}^{4} T log \frac{K T}{δ}}{{(1 - γ)}^{4} T} (1 + u_{k - 1})} + \frac{2 γ ξ}{{(1 - γ)}^{2}} \forall 1 \leq k \leq log (c_{2} log \frac{1}{1 - γ}) .

(C.18)

It is straightforward to bound $u_{0} \leq \frac{1}{1 - γ}$ . For k ≥ 1, it is straightforward to obtain from (C.18) that

u_{k} \leq 3 max {30 \sqrt{\frac{{log}^{4} T log \frac{K T}{δ}}{{(1 - γ)}^{4} T}}, 30 \sqrt{\frac{{log}^{4} T log \frac{K T}{δ}}{{(1 - γ)}^{4} T} u_{k - 1}}, \frac{2 γ ξ}{{(1 - γ)}^{2}}},

(C.19)

for $1 \leq k \leq log (c_{2} log \frac{1}{1 - γ})$ . We analyze (C.19) under two different cases:

If there exists some integer k₀ with $1 \leq k_{0} < ⌈ log (c_{2} log \frac{1}{1 - γ}) ⌉$ , such that
$u_{k_{0}} \leq max {1, \frac{6 γ ξ}{{(1 - γ)}^{2}}},$

then it is straightforward to check from (C.19) that
$u_{k_{0} + 1} \leq 3 max {30 \sqrt{\frac{{log}^{4} T log \frac{K T}{δ}}{{(1 - γ)}^{4} T}}, \frac{2 γ ξ}{{(1 - γ)}^{2}}}$ (C.20)

as long as T ≥ C₃(1 − γ)⁻⁴ log⁴ T log(KT/δ) for some sufficiently large constant C₃ > 0.

Otherwise we have

u_{k} > max {1, \frac{6 γ ξ}{{(1 - γ)}^{2}}}

for all

1 \leq k < ⌈ log (c_{2} log \frac{1}{1 - γ}) ⌉

. This together with (C.19) suggests that

max {1, \frac{6 γ ξ}{{(1 - γ)}^{2}}} < 3 max {30 \sqrt{\frac{{log}^{4} T log \frac{K T}{δ}}{{(1 - γ)}^{4} T}}, 30 \sqrt{\frac{{log}^{4} T log \frac{K T}{δ}}{{(1 - γ)}^{4} T}} u_{k - 1}, \frac{2 γ ξ}{{(1 - γ)}^{2}}},

and therefore

max {30 \sqrt{\frac{{log}^{4} T log \frac{K T}{δ}}{{(1 - γ)}^{4} T}}, 30 \sqrt{\frac{{log}^{4} T log \frac{K T}{δ}}{{(1 - γ)}^{4} T} u_{k - 1}}, \frac{2 γ ξ}{{(1 - γ)}^{2}}} = 30 \sqrt{\frac{{log}^{4} T log \frac{K T}{δ}}{{(1 - γ)}^{4} T} u_{k - 1}}

for all

1 \leq k \leq log (c_{2} log \frac{1}{1 - γ})

. Let

v_{k} ≔ 90 \sqrt{\frac{{log}^{4} T log \frac{K T}{δ}}{{(1 - γ)}^{4} T} u_{k - 1} .}

Then we know from (C.18) that

u_{k} \leq v_{k} \forall 1 \leq k \leq log (c_{2} log \frac{1}{1 - γ}) .

By applying the above two inequalities recursively, we know that

u_{k} \leq v_{k} = {(\frac{8100 {log}^{4} T log \frac{K T}{δ}}{{(1 - γ)}^{4} T})}^{1 / 2} u_{k - 1}^{1 / 2} \leq {(\frac{8100 {log}^{4} T log \frac{K T}{δ}}{{(1 - γ)}^{4} T})}^{1 / 2} v_{k - 1}^{1 / 2} \leq {(\frac{8100 {log}^{4} T log \frac{K T}{δ}}{{(1 - γ)}^{4} T})}^{1 / 2 + 1 / 4} u_{k - 2}^{1 / 4} \leq {(\frac{8100 {log}^{4} T log \frac{K T}{δ}}{{(1 - γ)}^{4} T})}^{1 / 2 + 1 / 4} v_{k - 2}^{1 / 4} \leq \dots \leq {(\frac{8100 {log}^{4} T log \frac{K T}{δ}}{{(1 - γ)}^{4} T})}^{1 - 1 / 2^{k}} u_{0}^{1 / 2^{k}} \leq \sqrt{\frac{8100 {log}^{4} T log \frac{K T}{δ}}{{(1 - γ)}^{4} T}} {(\frac{1}{1 - γ})}^{1 / 2^{k}},

where the last inequality holds as long as T ≥ C₃ log⁴ T log(KT/δ)(1 − γ)⁻⁴ for some sufficiently large constant C₃ > 0. Let

k_{0} = \tilde{c} log log \frac{1}{1 - γ}

for some properly chosen constant

\tilde{c} > 0

such that k₀ is an integer between 1 and

log (c_{2} log \frac{1}{1 - γ})

, we have

u_{k_{0}} \leq \sqrt{\frac{8100 {log}^{4} T log \frac{K T}{δ}}{{(1 - γ)}^{4} T}} {(\frac{1}{1 - γ})}^{1 / 2^{k_{0}}} = O (\sqrt{\frac{{log}^{4} T log \frac{K T}{δ}}{{(1 - γ)}^{4} T}}) .

When T ≥ C₃ log⁴ T log(KT/δ)(1 − γ)⁻⁴ for some sufficiently large constant C₃ > 0, this implies that $u_{k_{0}} \leq 1$ , which contradicts with the preassumption that $u_{k} > max {1, \frac{6 γ ξ}{{(1 - γ)}^{2}}}$ for all $1 \leq k \leq c_{2} log \frac{1}{1 - γ}$ .

Consequently, (C.20) must hold true and then the definition of u_k immediately leads to

{‖ Δ_{T} ‖}_{\infty} \leq 90 \sqrt{\frac{{log}^{4} T log \frac{K T}{δ}}{{(1 - γ)}^{4} T}} + \frac{6 γ ξ}{{(1 - γ)}^{2}} .

Then for any ε ∈ (0, 1], one has

{‖ Δ_{T} ‖}_{\infty} \leq ε + \frac{6 γ ξ}{{(1 - γ)}^{2}},

as long as

90 \sqrt{\frac{{log}^{4} T log \frac{K T}{δ}}{{(1 - γ)}^{4} T}} \leq ε .

Hence, if the total number of iterations T satisfies

T \geq C_{3} \frac{{log}^{4} T log \frac{K T}{δ}}{{(1 - γ)}^{4} ε^{2}}

for some sufficiently large constant C₃ > 0, (4.3) would hold for Algorithm 1 with probability over 1 − δ.

Finally, we are left to justify (C.11). Recall the definition of x_i (cf. (C.10)), one has

diag (Var (x_{i} ∣ V_{i - 1}, \dots, V_{0})) = γ^{2} {(η_{i}^{(t)})}^{2} diag (Var (Λ ({\hat{P}}_{K}^{(t)} - P_{K}) V_{i - 1} ∣ V_{i - 1})) = γ^{2} {(η_{i}^{(t)})}^{2} diag (Λ Var (({\hat{P}}_{K}^{(i)} - P_{K}) V_{i - 1} ∣ V_{i - 1}) Λ^{⊤}) = γ^{2} {(η_{i}^{(t)})}^{2} {λ {(s, a)}^{2} {Var}_{P_{K}} (V_{i - 1})}_{s, a},

where the notation ${Var}_{P_{K}} (V_{i - 1})$ is defined in (A.2). Plugging this into the definition of W_t leads to

W_{t} = γ^{2} \sum_{i = (1 - α) t + 1}^{t} {(η_{i}^{(t)})}^{2} {λ {(s, a)}^{2} {Var}_{P_{K}} (V_{i - 1})}_{s, a} = γ^{2} \sum_{i = (1 - α) t + 1}^{t} {(η_{i}^{(t)})}^{2} {λ {(s, a)}^{2} (P_{K} (V_{i - 1} \circ V_{i - 1}) - (P_{K} V_{i - 1}) \circ (P_{K} V_{i - 1}))}_{s, a} .

(C.21)

Then we introduce a useful claim as follows. The proof is deferred to Appendix C.2.

Claim 1.

For any state-action pair $(s, a) \in S \times A$ and vector $V \in ℝ^{| S |}$ , one has

λ {(s, a)}^{2} (P_{K} (V \circ V) - (P_{K} V) \circ (P_{K} V)) \leq λ (s, a) P_{K} (V \circ V) - (λ (s, a) P_{K} V) \circ (λ (s, a) P_{K} V) .

(C.22)

By invoking this claim with V = Vⁱ⁻¹ and taking collectively with (C.21), one has

W_{t} \leq γ^{2} \sum_{i = (1 - β) t + 1}^{t} {(η_{i}^{(t)})}^{2} {λ (s, a) P_{K} (V_{i - 1} \circ V_{i - 1}) - (λ (s, a) P_{K} V_{i - 1}) \circ (λ (s, a) P_{K} V_{i - 1})}_{s, a} = γ^{2} \sum_{i = (1 - β) t + 1}^{t} {(η_{i}^{(t)})}^{2} [Λ P_{K} (V_{i - 1} \circ V_{i - 1}) - (Λ P_{K} V_{i - 1}) \circ (Λ P_{K} V_{i - 1})] = γ^{2} \sum_{i = (1 - β) t + 1}^{t} {(η_{i}^{(t)})}^{2} {Var}_{\bar{P}} (V_{i - 1}),

which is the desired result.

C.2. Proof of Claim 1

To simplify notations in this proof, we use ${[λ_{i}]}_{i = 1}^{K}, {[P_{i, j}]}_{1 \leq i \leq K, 1 \leq j \leq | S |}$ and ${[V_{i}]}_{i = 1}^{| S |}$ to denote λ(s, a), $P_{K}$ and V respectively. Then one has

λ (s, a) P_{K} (V \circ V) - (λ (s, a) P_{K} V) \circ (λ (s, a) P_{K} V) - λ {(s, a)}^{2} (P_{K} (V \circ V) - (P_{K} V) \circ (P_{K} V)) = \sum_{i = 1}^{K} \sum_{j = 1}^{| S |} λ_{i} P_{i, j} V_{j}^{2} - {(\sum_{i = 1}^{K} \sum_{j = 1}^{| S |} λ_{i} P_{i, j} V_{j})}^{2} - \sum_{i = 1}^{K} \sum_{j = 1}^{| S |} λ_{i}^{2} P_{i, j} V_{j}^{2} + \sum_{i = 1}^{K} λ_{i}^{2} {(\sum_{j = 1}^{| S |} P_{i, j} V_{j})}^{2} = \sum_{i = 1}^{K} \sum_{j = 1}^{| S |} λ_{i} P_{i, j} V_{j} [(1 - λ_{i}) V_{j} - \sum_{i^{'} \neq i} \sum_{j^{'} = 1}^{| S |} λ_{i^{'}} P_{i^{'}, j^{'}} V_{j^{'}}] . = \sum_{i = 1}^{K} \sum_{j = 1}^{| S |} λ_{i} P_{i, j} V_{j} [(\sum_{i^{'} = 1}^{K} \sum_{j^{'} = 1}^{| S |} λ_{i^{'}} P_{i^{'}, j^{'}} - λ_{i}) V_{j} - \sum_{i^{'} \neq i} \sum_{j^{'} = 1}^{| S |} λ_{i^{'}} P_{i^{'}, j^{'}} V_{j^{'}}] = \sum_{i = 1}^{K} \sum_{j = 1}^{| S |} \sum_{i^{'} \neq i} \sum_{j^{'} = 1}^{| S |} λ_{i} P_{i, j} V_{j} λ_{i^{'}} P_{i^{'}, j^{'}} (V_{j} - V_{j^{'}})

where in the penultimate equality, we use the fact that

\sum_{i^{'} = 1}^{K} \sum_{j^{'} = 1}^{| S |} λ_{i^{'}} P_{i^{'}, j^{'}} = λ (s, a) P_{K} 1 = 1.

It follows that

λ (s, a) P_{K} (V \circ V) - (λ (s, a) P_{K} V) \circ (λ (s, a) P_{K} V) - λ {(s, a)}^{2} (P_{K} (V \circ V) - (P_{K} V) \circ (P_{K} V)) = \sum_{i = 1}^{K} \sum_{1 \leq i^{'} < i} \sum_{j = 1}^{| S |} \sum_{j^{'} = 1}^{| S |} [λ_{i} P_{i, j} V_{j} λ_{i^{'}} P_{i^{'}, j^{'}} (V_{j} - V_{j^{'}}) + λ_{i^{'}} P_{i^{'}, j} V_{j} λ_{i} P_{i, j^{'}} (V_{j} - V_{j^{'}})] = \sum_{i = 1}^{K} \sum_{1 \leq i^{'} < i} λ_{i} λ_{i^{'}} [\sum_{j = 1}^{| S |} \sum_{j^{'} = 1}^{| S |} P_{i, j} V_{j} P_{i^{'}, j^{'}} (V_{j} - V_{j^{'}}) + \sum_{j = 1}^{| S |} \sum_{j^{'} = 1}^{| S |} P_{i^{'}, j} V_{j} P_{i, j^{'}} (V_{j} - V_{j^{'}})] \overset{(i)}{=} \sum_{i = 1}^{K} \sum_{1 \leq i^{'} < i} λ_{i} λ_{i^{'}} [\sum_{j = 1}^{| S |} \sum_{j^{'} = 1}^{| S |} P_{i, j} V_{j} P_{i^{'}, j^{'}} (V_{j} - V_{j^{'}}) + \sum_{j = 1}^{| S |} \sum_{j^{'} = 1}^{| S |} P_{i^{'}, j^{'}} V_{j^{'}} P_{i, j} (V_{j^{'}} - V_{j})] = \sum_{i = 1}^{K} \sum_{1 \leq i^{'} < i} λ_{i} λ_{i^{'}} [\sum_{j = 1}^{| S |} \sum_{j^{'} = 1}^{| S |} P_{i, j} P_{i^{'}, j^{'}} {(V_{j} - V_{j^{'}})}^{2}] \geq 0,

where in (i), we exchange the indices j and j′.

D. Feature dimension and the number of anchor state-action pairs

The assumption that the feature dimension (denoted by K_d) and the number of anchor state-action pairs (denoted by K_n) are equal is actually non-essential. In what follows, we will show that if K_d ≠ K_n, then we can modify the current feature mapping $ϕ : S \times A \to ℝ^{K_{d}}$ to achieve a new feature mapping $ϕ^{'} : S \times A \to ℝ^{K_{n}}$ that does not change the transition model P. By doing so, the new feature dimension K_n equals to the number of anchor state-action pairs.

To begin with, we recall from Definition 1 that there exists K_d unknown functions $ψ_{1}, \dots, ψ_{K_{d}} : S \to ℝ$ , such that

P (s^{'} ∣ s, a) = \sum_{k = 1}^{K_{d}} ϕ_{k} (s, a) ψ_{k} (s^{'}),

for every $(s, a) \in S \times A$ and $s^{'} \in S$ . In addition, we also recall from Assumption 1 that there exists $K \subseteq S \times A$ with $| K | = K_{n}$ such that for any $(s, a) \in S \times A$ ,

ϕ (s, a) = \sum_{i : (s_{i}, a_{i}) \in K} λ_{i} (s, a) ϕ (s_{i}, a_{i}) \in ℝ^{K_{d}} for \sum_{i = 1}^{K_{n}} λ_{i} (s, a) = 1 and λ_{i} (s, a) \geq 0.

Case 1:

K_d > K_n. In this case, the vectors in ${ϕ (s, a) : (s, a) \in K}$ are linearly independent. For ease of presentation and without loss of generality, we assume that K_d = K_n + 1. This indicates that the matrix $Φ \in ℝ^{K_{d} \times (| S | | A |)}$ whose columns are composed of the feature vectors of all state-action pairs has rank K_n and is hence not full row rank. This suggests that there exists K_n linearly independent rows (without loss of generality, we assume they are the first K_n rows). We can remove the last row from Φ to obtain $Φ^{'} ≔ Φ_{1 : K_{n}, :} \in ℝ^{K_{n} \times (| S | | A |)}$ such that Φ′ is full row rank. Then we show that we can actually use the columns of Φ′ as new feature mappings. To see why this is true, note that the last row $Φ_{K_{n} + 1, :}$ can be represented as a linear combination of the first K_n rows, namely there must exist constants ${c_{k}}_{k = 1}^{K_{n}}$ such that for any $(s, a) \in S \times A$ ,

ϕ_{K_{n} + 1} (s, a) = \sum_{k = 1}^{K_{n}} c_{k} ϕ_{k} (s, a) .

Define $ψ_{k}^{'} = ψ_{k} + c_{k} ψ_{K_{n} + 1}$ for k = 1, …, K_n, we have

P (s^{'} ∣ s, a) = \sum_{k = 1}^{K_{d}} ϕ_{k} (s, a) ψ_{k} (s^{'}) = ϕ_{K_{n} + 1} (s, a) ψ_{K_{n} + 1} (s^{'}) + \sum_{k = 1}^{K_{n}} ϕ_{k} (s, a) ψ_{k} (s^{'}) = \sum_{k = 1}^{K_{n}} ϕ_{k} (s, a) [ψ_{k} (s^{'}) + c_{k} ψ_{K_{n} + 1} (s^{'})] = \sum_{k = 1}^{K_{n}} ϕ_{k} (s, a) ψ_{k}^{'} (s^{'}),

which is linear with respect to the new K_n dimensional feature vectors. It is also straightforward to check that the new feature mapping satisfies Assumption 1 with the original anchor state-action pairs K.

Case 2:

K_d < K_n. For ease of presentation and without loss of generality, we assume that K_n = K_d + 1 and that the subspace spanned by the feature vectors of anchor state-action pairs is non-degenerate, i.e., has rank K_d (otherwise we can use similar method as in Case 1 to further reduce the feature dimension K_d). In this case, the matrix $Φ_{K} \in ℝ^{K_{d} \times K_{n}}$ whose columns are composed of the feature vectors of anchor state-action pairs has rank K_d. We can add K_n − K_d = 1 new row to $Φ_{K}$ to obtain $Φ_{K}^{'} \in ℝ^{K_{n} \times K_{n}}$ such that $Φ_{K}^{'}$ has full rank K_n. Then we let the columns of $Φ_{K}^{'} = {[ϕ^{'} (s, a)]}_{(s, a) \in K}$ to be the new feature vectors of the anchor state-action pairs, and define the new feature vectors for all other state-action pairs $(s, a) \notin K$ by

ϕ^{'} (s, a) = \sum_{i : (s_{i}, a_{i}) \in K} λ_{i} (s, a) ϕ^{'} (s_{i}, a_{i}) .

We can check that the transition model P is not changed if we let $ψ_{K_{n}} (s^{'}) = 0$ for every $s^{'} \in S$ . It is also straightforward to check that Assumption 1 is satisfied.

To conclude, when K_d ≠ K_n, we can always construct a new set of feature mappings with dimension K_n such that: (i) the feature dimension equals to the number of anchor state-action pairs (they are both K_n); (ii) the transition model can still be linearly parameterized by this new set of feature mappings; and (iii) the anchor state-action pair assumption (Assumption 1) is satisfied with the original anchor state-action pairs.

Footnotes

Without loss of generality, one can always assume that the number of anchor state-action pairs equals to the feature dimension K. Interested readers are referred to Appendix D for detailed argument.

The difference between Algorithm 2 and Phased Parametric Q-Learning in [YW19] is that Algorithm 2 maintains and updates a Q-function estimate Q_t, while Phased Parametric Q-Learning parameterized Q-function by

Q_{w} (s, a) ≔ r (s, a) + γ ϕ {(s, a)}^{⊤} w,

and then updates the parameters w.

References

[ABA18].Azizzadenesheli Kamyar, Brunskill Emma, and Anandkumar Animashree. Efficient exploration through bayesian deep q-networks. In 2018 Information Theory and Applications Workshop (ITA), pages 1–9. IEEE, 2018. [Google Scholar]
[AGM12].Arora Sanjeev, Ge Rong, and Moitra Ankur. Learning topic models–going beyond svd. In 2012 IEEE 53rd annual symposium on foundations of computer science, pages 1–10. IEEE, 2012. [Google Scholar]
[AHKS20].Agarwal Alekh, Henaff Mikael, Kakade Sham, and Sun Wen. Pc-pg: Policy cover directed exploration for provable policy gradient learning. arXiv preprint arXiv:2007.08459, 2020. [Google Scholar]
[AKY20].Agarwal Alekh, Kakade Sham, and Yang Lin F. Model-based reinforcement learning with a generative model is minimax optimal. In Conference on Learning Theory, pages 67–83. PMLR, 2020. [Google Scholar]
[AMGK11].Gheshlaghi Azar Mohammad, Munos Remi, Ghavamzadaeh M, and Kappen Hilbert J. Speedy q-learning. 2011.
[AMK13].Gheshlaghi Azar Mohammad, Munos Rémi, and Kappen Hilbert J. Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91(3):325–349, 2013. [Google Scholar]
[AOM17].Gheshlaghi Azar Mohammad, Osband Ian, and Munos Rémi. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263–272. PMLR, 2017. [Google Scholar]
[B+00].Bertsekas Dimitri P et al. Dynamic programming and optimal control: Vol. 1. Athena scientific Belmont, 2000. [Google Scholar]
[BD59].Bellman Richard and Dreyfus Stuart. Functional approximations and dynamic programming. Mathematical Tables and Other Aids to Computation, pages 247–251, 1959. [Google Scholar]
[Bel52].Bellman Richard. On the theory of dynamic programming. Proceedings of the National Academy of Sciences of the United States of America, 38(8):716, 1952. [DOI] [PMC free article] [PubMed] [Google Scholar]
[BT96].Bertsekas Dimitri P and Tsitsiklis John N. Neuro-dynamic programming. Athena Scientific, 1996. [Google Scholar]
[CCF+20].Chen Yuxin, Chi Yuejie, Fan Jianqing, Ma Cong, and Yan Yuling. Noisy matrix completion: Understanding statistical guarantees for convex relaxation via nonconvex optimization. SIAM Journal on Optimization, 30(4):3098–3121, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[CFMW19].Chen Yuxin, Fan Jianqing, Ma Cong, and Wang Kaizheng. Spectral method and regularized mle are both optimal for top-k ranking. Annals of statistics, 47(4):2204, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[CFMY20].Chen Yuxin, Fan Jianqing, Ma Cong, and Yan Yuling. Bridging convex and nonconvex optimization in robust pca: Noise, outliers, and missing data. arXiv preprint arXiv:2001.05484, accepted to Annals of Statistics, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[CFWY21].Chen Yuxin, Fan Jianqing, Wang Bingyan, and Yan Yuling. Convex and nonconvex optimization are both minimax-optimal for noisy blind deconvolution under random designs. Journal of the American Statistical Association, (just-accepted):1–27, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
[CZD+19].Chen Zaiwei, Zhang Sheng, Doan Thinh T, Maguluri Siva Theja, and Clarke John-Paul. Performance of q-learning with linear function approximation: Stability and finite-time analysis. arXiv preprint arXiv:1905.11425, 2019. [Google Scholar]
[DB15].Dann Christoph and Brunskill Emma. Sample complexity of episodic fixed-horizon reinforcement learning. Advances in Neural Information Processing Systems, 28:2818–2826, 2015. [Google Scholar]
[DKW19].Duan Yaqi, Ke Zheng Tracy, and Wang Mengdi. State aggregation learning from markov transition data. Advances in Neural Information Processing Systems, 32, 2019. [Google Scholar]
[DLB17].Dann Christoph, Lattimore Tor, and Brunskill Emma. Unifying pac and regret: uniform pac bounds for episodic reinforcement learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 5717–5727, 2017. [Google Scholar]
[DLWZ19].Du Simon S, Luo Yuping, Wang Ruosong, and Zhang Hanrui. Provably efficient q-learning with function approximation via distribution shift error checking oracle. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 8060–8070, 2019. [Google Scholar]
[DS04].Donoho David and Stodden Victoria. When does non-negative matrix factorization give a correct decomposition into parts? In 17th Annual Conference on Neural Information Processing Systems, NIPS 2003. Neural information processing systems foundation, 2004. [Google Scholar]
[EDMB03].Even-Dar Eyal, Mansour Yishay, and Bartlett Peter. Learning rates for q-learning. Journal of machine learning Research, 5(1), 2003. [Google Scholar]
[EK18].Karoui Noureddine El. On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators. Probability Theory and Related Fields, 170(1):95–175, 2018. [Google Scholar]
[Fre75].Freedman David A. On tail probabilities for martingales. the Annals of Probability, pages 100–118, 1975. [Google Scholar]
[HDL+21].Hao Botao, Duan Yaqi, Lattimore Tor, Szepesvári Csaba, and Wang Mengdi. Sparse feature selection makes batch reinforcement learning more sample efficient. In International Conference on Machine Learning, pages 4063–4073. PMLR, 2021. [Google Scholar]
[HZG21].He Jiafan, Zhou Dongruo, and Gu Quanquan. Logarithmic regret for reinforcement learning with linear function approximation. In International Conference on Machine Learning, pages 4171–4180. PMLR, 2021. [Google Scholar]
[JA18].Jiang Nan and Agarwal Alekh. Open problem: The dependence of sample complexity lower bounds on planning horizon. In Conference On Learning Theory, pages 3395–3398. PMLR, 2018. [Google Scholar]
[JAZBJ18].Jin Chi, Allen-Zhu Zeyuan, Bubeck Sebastien, and Jordan Michael I. Is q-learning provably efficient? In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 4868–4878, 2018. [Google Scholar]
[JKA+17].Jiang Nan, Krishnamurthy Akshay, Agarwal Alekh, Langford John, and Schapire Robert E. Contextual decision processes with low bellman rank are pac-learnable. In International Conference on Machine Learning, pages 1704–1713. PMLR, 2017. [Google Scholar]
[JYWJ20].Jin Chi, Yang Zhuoran, Wang Zhaoran, and Jordan Michael I. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020. [Google Scholar]
[Kak03].Kakade Sham Machandranath. On the sample complexity of reinforcement learning. PhD thesis, UCL (University College London), 2003. [Google Scholar]
[KS99].Kearns Michael and Singh Satinder. Finite-sample convergence rates for q-learning and indirect algorithms. Advances in neural information processing systems, pages 996–1002, 1999. [Google Scholar]
[KST+21].Kiran B Ravi, Sobh Ibrahim, Talpaert Victor, Mannion Patrick, Al Sallab Ahmad A, Yogamani Senthil, and Pérez Patrick. Deep reinforcement learning for autonomous driving: A survey. IEEE Transactions on Intelligent Transportation Systems, 2021. [Google Scholar]
[LCC+21a].Li Gen, Cai Changxiao, Chen Yuxin, Gu Yuantao, Wei Yuting, and Chi Yuejie. Is q-learning minimax optimal? a tight sample complexity analysis. arXiv preprint arXiv:2102.06548, 2021. [Google Scholar]
[LCC+21b].Li Gen, Chen Yuxin, Chi Yuejie, Gu Yuantao, and Wei Yuting. Sample-efficient reinforcement learning is feasible for linearly realizable mdps with limited revisiting. arXiv preprint arXiv:2105.08024, 2021. [Google Scholar]
[LWC+20a].Li Gen, Wei Yuting, Chi Yuejie, Gu Yuantao, and Chen Yuxin. Breaking the sample size barrier in model-based reinforcement learning with a generative model. Advances in Neural Information Processing Systems, 33, 2020. [Google Scholar]
[LWC+20b].Li Gen, Wei Yuting, Chi Yuejie, Gu Yuantao, and Chen Yuxin. Sample complexity of asynchronous q-learning: Sharper analysis and variance reduction. Advances in neural information processing systems, 2020. [Google Scholar]
[MJTS20].Modi Aditya, Jiang Nan, Tewari Ambuj, and Singh Satinder. Sample complexity of reinforcement learning using linearly combined model ensembles. In International Conference on Artificial Intelligence and Statistics, pages 2010–2020. PMLR, 2020. [Google Scholar]
[MR07].Melo Francisco S and Ribeiro M Isabel. Q-learning with linear function approximation. In International Conference on Computational Learning Theory, pages 308–322. Springer, 2007. [Google Scholar]
[MWCC18].Ma Cong, Wang Kaizheng, Chi Yuejie, and Chen Yuxin. Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval and matrix completion. In International Conference on Machine Learning, pages 3345–3354. PMLR, 2018. [Google Scholar]
[PLT+08].Parr Ronald, Li Lihong, Taylor Gavin, Painter-Wakefield Christopher, and Littman Michael L. An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning. In Proceedings of the 25th international conference on Machine learning, pages 752–759, 2008. [Google Scholar]
[Put14].Puterman Martin L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014. [Google Scholar]
[PW20].Pananjady Ashwin and Wainwright Martin J. Instance-dependent ℓ_∞-bounds for policy evaluation in tabular reinforcement learning. IEEE Transactions on Information Theory, 67(1):566–585, 2020. [Google Scholar]
[RM51].Robbins Herbert and Monro Sutton. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951. [Google Scholar]
[SB18].Sutton Richard S and Barto Andrew G. Reinforcement learning: An introduction. MIT press, 2018. [Google Scholar]
[SHM+16].Silver David, Huang Aja, Maddison Chris J, Guez Arthur, Sifre Laurent, Van Den Driessche George, Schrittwieser Julian, Antonoglou Ioannis, Panneershelvam Veda, Lanctot Marc, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016. [DOI] [PubMed] [Google Scholar]
[SJJ95].Singh Satinder P, Jaakkola Tommi, and Jordan Michael I. Reinforcement learning with soft state aggregation. Advances in neural information processing systems 7, 7:361, 1995. [Google Scholar]
[SJK+19].Sun Wen, Jiang Nan, Krishnamurthy Akshay, Agarwal Alekh, and Langford John. Model-based rl in contextual decision processes: Pac bounds and exponential improvements over model-free approaches. In Conference on Learning Theory, pages 2898–2933. PMLR, 2019. [Google Scholar]
[SS20].Shariff Roshan and Szepesvári Csaba. Efficient planning in large mdps with weak linear function approximation. arXiv preprint arXiv:2007.06184, 2020. [Google Scholar]
[SSS+17].Silver David, Schrittwieser Julian, Simonyan Karen, Antonoglou Ioannis, Huang Aja, Guez Arthur, Hubert Thomas, Baker Lucas, Lai Matthew, Bolton Adrian, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017. [DOI] [PubMed] [Google Scholar]
[SY94].Singh Satinder P and Yee Richard C. An upper bound on the loss from approximate optimal-value functions. Machine Learning, 16(3):227–233, 1994. [Google Scholar]
[TV20].Touati Ahmed and Vincent Pascal. Efficient learning in non-stationary linear markov decision processes. arXiv preprint arXiv:2010.12870, 2020. [Google Scholar]
[Ver18].Vershynin Roman. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge University Press, 2018. [Google Scholar]
[Wai19a].Wainwright Martin J. Stochastic approximation with cone-contractive operators: Sharp ℓ_∞-bounds for q-learning. arXiv preprint arXiv:1905.06265, 2019. [Google Scholar]
[Wai19b].Wainwright Martin J. Variance-reduced q-learning is minimax optimal. arXiv preprint arXiv:1906.04697, 2019. [Google Scholar]
[Wat89].Watkins Christopher John Cornish Hellaby. Learning from delayed rewards. 1989.
[WD92].Watkins Christopher JCH and Peter Dayan. Q-learning. Machine learning, 8(3–4):279–292, 1992. [Google Scholar]
[WDYK20].Wang Ruosong, Du Simon S, Yang Lin, and Kakade Sham. Is long horizon rl more difficult than short horizon rl? Advances in Neural Information Processing Systems, 33, 2020. [Google Scholar]
[WDYS20].Wang Ruosong, Du Simon S, Yang Lin F, and Salakhutdinov Ruslan. On reward-free reinforcement learning with linear function approximation. arXiv preprint arXiv:2006.11274, 2020. [Google Scholar]
[WJLJ21].Wei Chen-Yu, Jahromi Mehdi Jafarnia, Luo Haipeng, and Jain Rahul. Learning infinite-horizon average-reward mdps with linear function approximation. In International Conference on Artificial Intelligence and Statistics, pages 3007–3015. PMLR, 2021. [Google Scholar]
[WSY20].Wang Ruosong, Salakhutdinov Russ R, and Yang Lin. Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension. Advances in Neural Information Processing Systems, 33, 2020. [Google Scholar]
[WVR17].Wen Zheng and Van Roy Benjamin. Efficient reinforcement learning in deterministic systems with value function generalization. Mathematics of Operations Research, 42(3):762–782, 2017. [Google Scholar]
[XG20].Xu Pan and Gu Quanquan. A finite-time analysis of q-learning with neural network function approximation. In International Conference on Machine Learning, pages 10555–10565. PMLR, 2020. [Google Scholar]
[YW19].Yang Lin and Wang Mengdi. Sample-optimal parametric q-learning using linearly additive features. In International Conference on Machine Learning, pages 6995–7004. PMLR, 2019. [Google Scholar]
[YW20].Yang Lin and Wang Mengdi. Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. In International Conference on Machine Learning, pages 10746–10756. PMLR, 2020. [Google Scholar]
[ZBB+20].Zanette Andrea, Brandfonbrener David, Brunskill Emma, Pirotta Matteo, and Lazaric Alessandro. Frequentist regret bounds for randomized least-squares value iteration. In International Conference on Artificial Intelligence and Statistics, pages 1954–1964. PMLR, 2020. [Google Scholar]
[ZHG21].Zhou Dongruo, He Jiafan, and Gu Quanquan. Provably efficient reinforcement learning for discounted mdps with feature mapping. In International Conference on Machine Learning, pages 12793–12802. PMLR, 2021. [Google Scholar]
[ZLKB19].Zanette Andrea, Lazaric Alessandro, Kochenderfer Mykel J, and Brunskill Emma. Limiting extrapolation in linear approximate value iteration. Advances in Neural Information Processing Systems, 32:5615–5624, 2019. [Google Scholar]
[ZLKB20].Zanette Andrea, Lazaric Alessandro, Kochenderfer Mykel, and Brunskill Emma. Learning near optimal policies with low inherent bellman error. In International Conference on Machine Learning, pages 10978–10989. PMLR, 2020. [Google Scholar]

[R1] [ABA18].Azizzadenesheli Kamyar, Brunskill Emma, and Anandkumar Animashree. Efficient exploration through bayesian deep q-networks. In 2018 Information Theory and Applications Workshop (ITA), pages 1–9. IEEE, 2018. [Google Scholar]

[R2] [AGM12].Arora Sanjeev, Ge Rong, and Moitra Ankur. Learning topic models–going beyond svd. In 2012 IEEE 53rd annual symposium on foundations of computer science, pages 1–10. IEEE, 2012. [Google Scholar]

[R3] [AHKS20].Agarwal Alekh, Henaff Mikael, Kakade Sham, and Sun Wen. Pc-pg: Policy cover directed exploration for provable policy gradient learning. arXiv preprint arXiv:2007.08459, 2020. [Google Scholar]

[R4] [AKY20].Agarwal Alekh, Kakade Sham, and Yang Lin F. Model-based reinforcement learning with a generative model is minimax optimal. In Conference on Learning Theory, pages 67–83. PMLR, 2020. [Google Scholar]

[R5] [AMGK11].Gheshlaghi Azar Mohammad, Munos Remi, Ghavamzadaeh M, and Kappen Hilbert J. Speedy q-learning. 2011.

[R6] [AMK13].Gheshlaghi Azar Mohammad, Munos Rémi, and Kappen Hilbert J. Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91(3):325–349, 2013. [Google Scholar]

[R7] [AOM17].Gheshlaghi Azar Mohammad, Osband Ian, and Munos Rémi. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263–272. PMLR, 2017. [Google Scholar]

[R8] [B+00].Bertsekas Dimitri P et al. Dynamic programming and optimal control: Vol. 1. Athena scientific Belmont, 2000. [Google Scholar]

[R9] [BD59].Bellman Richard and Dreyfus Stuart. Functional approximations and dynamic programming. Mathematical Tables and Other Aids to Computation, pages 247–251, 1959. [Google Scholar]

[R10] [Bel52].Bellman Richard. On the theory of dynamic programming. Proceedings of the National Academy of Sciences of the United States of America, 38(8):716, 1952. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] [BT96].Bertsekas Dimitri P and Tsitsiklis John N. Neuro-dynamic programming. Athena Scientific, 1996. [Google Scholar]

[R12] [CCF+20].Chen Yuxin, Chi Yuejie, Fan Jianqing, Ma Cong, and Yan Yuling. Noisy matrix completion: Understanding statistical guarantees for convex relaxation via nonconvex optimization. SIAM Journal on Optimization, 30(4):3098–3121, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] [CFMW19].Chen Yuxin, Fan Jianqing, Ma Cong, and Wang Kaizheng. Spectral method and regularized mle are both optimal for top-k ranking. Annals of statistics, 47(4):2204, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] [CFMY20].Chen Yuxin, Fan Jianqing, Ma Cong, and Yan Yuling. Bridging convex and nonconvex optimization in robust pca: Noise, outliers, and missing data. arXiv preprint arXiv:2001.05484, accepted to Annals of Statistics, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] [CFWY21].Chen Yuxin, Fan Jianqing, Wang Bingyan, and Yan Yuling. Convex and nonconvex optimization are both minimax-optimal for noisy blind deconvolution under random designs. Journal of the American Statistical Association, (just-accepted):1–27, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] [CZD+19].Chen Zaiwei, Zhang Sheng, Doan Thinh T, Maguluri Siva Theja, and Clarke John-Paul. Performance of q-learning with linear function approximation: Stability and finite-time analysis. arXiv preprint arXiv:1905.11425, 2019. [Google Scholar]

[R17] [DB15].Dann Christoph and Brunskill Emma. Sample complexity of episodic fixed-horizon reinforcement learning. Advances in Neural Information Processing Systems, 28:2818–2826, 2015. [Google Scholar]

[R18] [DKW19].Duan Yaqi, Ke Zheng Tracy, and Wang Mengdi. State aggregation learning from markov transition data. Advances in Neural Information Processing Systems, 32, 2019. [Google Scholar]

[R19] [DLB17].Dann Christoph, Lattimore Tor, and Brunskill Emma. Unifying pac and regret: uniform pac bounds for episodic reinforcement learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 5717–5727, 2017. [Google Scholar]

[R20] [DLWZ19].Du Simon S, Luo Yuping, Wang Ruosong, and Zhang Hanrui. Provably efficient q-learning with function approximation via distribution shift error checking oracle. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 8060–8070, 2019. [Google Scholar]

[R21] [DS04].Donoho David and Stodden Victoria. When does non-negative matrix factorization give a correct decomposition into parts? In 17th Annual Conference on Neural Information Processing Systems, NIPS 2003. Neural information processing systems foundation, 2004. [Google Scholar]

[R22] [EDMB03].Even-Dar Eyal, Mansour Yishay, and Bartlett Peter. Learning rates for q-learning. Journal of machine learning Research, 5(1), 2003. [Google Scholar]

[R23] [EK18].Karoui Noureddine El. On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators. Probability Theory and Related Fields, 170(1):95–175, 2018. [Google Scholar]

[R24] [Fre75].Freedman David A. On tail probabilities for martingales. the Annals of Probability, pages 100–118, 1975. [Google Scholar]

[R25] [HDL+21].Hao Botao, Duan Yaqi, Lattimore Tor, Szepesvári Csaba, and Wang Mengdi. Sparse feature selection makes batch reinforcement learning more sample efficient. In International Conference on Machine Learning, pages 4063–4073. PMLR, 2021. [Google Scholar]

[R26] [HZG21].He Jiafan, Zhou Dongruo, and Gu Quanquan. Logarithmic regret for reinforcement learning with linear function approximation. In International Conference on Machine Learning, pages 4171–4180. PMLR, 2021. [Google Scholar]

[R27] [JA18].Jiang Nan and Agarwal Alekh. Open problem: The dependence of sample complexity lower bounds on planning horizon. In Conference On Learning Theory, pages 3395–3398. PMLR, 2018. [Google Scholar]

[R28] [JAZBJ18].Jin Chi, Allen-Zhu Zeyuan, Bubeck Sebastien, and Jordan Michael I. Is q-learning provably efficient? In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 4868–4878, 2018. [Google Scholar]

[R29] [JKA+17].Jiang Nan, Krishnamurthy Akshay, Agarwal Alekh, Langford John, and Schapire Robert E. Contextual decision processes with low bellman rank are pac-learnable. In International Conference on Machine Learning, pages 1704–1713. PMLR, 2017. [Google Scholar]

[R30] [JYWJ20].Jin Chi, Yang Zhuoran, Wang Zhaoran, and Jordan Michael I. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020. [Google Scholar]

[R31] [Kak03].Kakade Sham Machandranath. On the sample complexity of reinforcement learning. PhD thesis, UCL (University College London), 2003. [Google Scholar]

[R32] [KS99].Kearns Michael and Singh Satinder. Finite-sample convergence rates for q-learning and indirect algorithms. Advances in neural information processing systems, pages 996–1002, 1999. [Google Scholar]

[R33] [KST+21].Kiran B Ravi, Sobh Ibrahim, Talpaert Victor, Mannion Patrick, Al Sallab Ahmad A, Yogamani Senthil, and Pérez Patrick. Deep reinforcement learning for autonomous driving: A survey. IEEE Transactions on Intelligent Transportation Systems, 2021. [Google Scholar]

[R34] [LCC+21a].Li Gen, Cai Changxiao, Chen Yuxin, Gu Yuantao, Wei Yuting, and Chi Yuejie. Is q-learning minimax optimal? a tight sample complexity analysis. arXiv preprint arXiv:2102.06548, 2021. [Google Scholar]

[R35] [LCC+21b].Li Gen, Chen Yuxin, Chi Yuejie, Gu Yuantao, and Wei Yuting. Sample-efficient reinforcement learning is feasible for linearly realizable mdps with limited revisiting. arXiv preprint arXiv:2105.08024, 2021. [Google Scholar]

[R36] [LWC+20a].Li Gen, Wei Yuting, Chi Yuejie, Gu Yuantao, and Chen Yuxin. Breaking the sample size barrier in model-based reinforcement learning with a generative model. Advances in Neural Information Processing Systems, 33, 2020. [Google Scholar]

[R37] [LWC+20b].Li Gen, Wei Yuting, Chi Yuejie, Gu Yuantao, and Chen Yuxin. Sample complexity of asynchronous q-learning: Sharper analysis and variance reduction. Advances in neural information processing systems, 2020. [Google Scholar]

[R38] [MJTS20].Modi Aditya, Jiang Nan, Tewari Ambuj, and Singh Satinder. Sample complexity of reinforcement learning using linearly combined model ensembles. In International Conference on Artificial Intelligence and Statistics, pages 2010–2020. PMLR, 2020. [Google Scholar]

[R39] [MR07].Melo Francisco S and Ribeiro M Isabel. Q-learning with linear function approximation. In International Conference on Computational Learning Theory, pages 308–322. Springer, 2007. [Google Scholar]

[R40] [MWCC18].Ma Cong, Wang Kaizheng, Chi Yuejie, and Chen Yuxin. Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval and matrix completion. In International Conference on Machine Learning, pages 3345–3354. PMLR, 2018. [Google Scholar]

[R41] [PLT+08].Parr Ronald, Li Lihong, Taylor Gavin, Painter-Wakefield Christopher, and Littman Michael L. An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning. In Proceedings of the 25th international conference on Machine learning, pages 752–759, 2008. [Google Scholar]

[R42] [Put14].Puterman Martin L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014. [Google Scholar]

[R43] [PW20].Pananjady Ashwin and Wainwright Martin J. Instance-dependent ℓ_∞-bounds for policy evaluation in tabular reinforcement learning. IEEE Transactions on Information Theory, 67(1):566–585, 2020. [Google Scholar]

[R44] [RM51].Robbins Herbert and Monro Sutton. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951. [Google Scholar]

[R45] [SB18].Sutton Richard S and Barto Andrew G. Reinforcement learning: An introduction. MIT press, 2018. [Google Scholar]

[R46] [SHM+16].Silver David, Huang Aja, Maddison Chris J, Guez Arthur, Sifre Laurent, Van Den Driessche George, Schrittwieser Julian, Antonoglou Ioannis, Panneershelvam Veda, Lanctot Marc, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016. [DOI] [PubMed] [Google Scholar]

[R47] [SJJ95].Singh Satinder P, Jaakkola Tommi, and Jordan Michael I. Reinforcement learning with soft state aggregation. Advances in neural information processing systems 7, 7:361, 1995. [Google Scholar]

[R48] [SJK+19].Sun Wen, Jiang Nan, Krishnamurthy Akshay, Agarwal Alekh, and Langford John. Model-based rl in contextual decision processes: Pac bounds and exponential improvements over model-free approaches. In Conference on Learning Theory, pages 2898–2933. PMLR, 2019. [Google Scholar]

[R49] [SS20].Shariff Roshan and Szepesvári Csaba. Efficient planning in large mdps with weak linear function approximation. arXiv preprint arXiv:2007.06184, 2020. [Google Scholar]

[R50] [SSS+17].Silver David, Schrittwieser Julian, Simonyan Karen, Antonoglou Ioannis, Huang Aja, Guez Arthur, Hubert Thomas, Baker Lucas, Lai Matthew, Bolton Adrian, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017. [DOI] [PubMed] [Google Scholar]

[R51] [SY94].Singh Satinder P and Yee Richard C. An upper bound on the loss from approximate optimal-value functions. Machine Learning, 16(3):227–233, 1994. [Google Scholar]

[R52] [TV20].Touati Ahmed and Vincent Pascal. Efficient learning in non-stationary linear markov decision processes. arXiv preprint arXiv:2010.12870, 2020. [Google Scholar]

[R53] [Ver18].Vershynin Roman. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge University Press, 2018. [Google Scholar]

[R54] [Wai19a].Wainwright Martin J. Stochastic approximation with cone-contractive operators: Sharp ℓ_∞-bounds for q-learning. arXiv preprint arXiv:1905.06265, 2019. [Google Scholar]

[R55] [Wai19b].Wainwright Martin J. Variance-reduced q-learning is minimax optimal. arXiv preprint arXiv:1906.04697, 2019. [Google Scholar]

[R56] [Wat89].Watkins Christopher John Cornish Hellaby. Learning from delayed rewards. 1989.

[R57] [WD92].Watkins Christopher JCH and Peter Dayan. Q-learning. Machine learning, 8(3–4):279–292, 1992. [Google Scholar]

[R58] [WDYK20].Wang Ruosong, Du Simon S, Yang Lin, and Kakade Sham. Is long horizon rl more difficult than short horizon rl? Advances in Neural Information Processing Systems, 33, 2020. [Google Scholar]

[R59] [WDYS20].Wang Ruosong, Du Simon S, Yang Lin F, and Salakhutdinov Ruslan. On reward-free reinforcement learning with linear function approximation. arXiv preprint arXiv:2006.11274, 2020. [Google Scholar]

[R60] [WJLJ21].Wei Chen-Yu, Jahromi Mehdi Jafarnia, Luo Haipeng, and Jain Rahul. Learning infinite-horizon average-reward mdps with linear function approximation. In International Conference on Artificial Intelligence and Statistics, pages 3007–3015. PMLR, 2021. [Google Scholar]

[R61] [WSY20].Wang Ruosong, Salakhutdinov Russ R, and Yang Lin. Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension. Advances in Neural Information Processing Systems, 33, 2020. [Google Scholar]

[R62] [WVR17].Wen Zheng and Van Roy Benjamin. Efficient reinforcement learning in deterministic systems with value function generalization. Mathematics of Operations Research, 42(3):762–782, 2017. [Google Scholar]

[R63] [XG20].Xu Pan and Gu Quanquan. A finite-time analysis of q-learning with neural network function approximation. In International Conference on Machine Learning, pages 10555–10565. PMLR, 2020. [Google Scholar]

[R64] [YW19].Yang Lin and Wang Mengdi. Sample-optimal parametric q-learning using linearly additive features. In International Conference on Machine Learning, pages 6995–7004. PMLR, 2019. [Google Scholar]

[R65] [YW20].Yang Lin and Wang Mengdi. Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. In International Conference on Machine Learning, pages 10746–10756. PMLR, 2020. [Google Scholar]

[R66] [ZBB+20].Zanette Andrea, Brandfonbrener David, Brunskill Emma, Pirotta Matteo, and Lazaric Alessandro. Frequentist regret bounds for randomized least-squares value iteration. In International Conference on Artificial Intelligence and Statistics, pages 1954–1964. PMLR, 2020. [Google Scholar]

[R67] [ZHG21].Zhou Dongruo, He Jiafan, and Gu Quanquan. Provably efficient reinforcement learning for discounted mdps with feature mapping. In International Conference on Machine Learning, pages 12793–12802. PMLR, 2021. [Google Scholar]

[R68] [ZLKB19].Zanette Andrea, Lazaric Alessandro, Kochenderfer Mykel J, and Brunskill Emma. Limiting extrapolation in linear approximate value iteration. Advances in Neural Information Processing Systems, 32:5615–5624, 2019. [Google Scholar]

[R69] [ZLKB20].Zanette Andrea, Lazaric Alessandro, Kochenderfer Mykel, and Brunskill Emma. Learning near optimal policies with low inherent bellman error. In International Conference on Machine Learning, pages 10978–10989. PMLR, 2020. [Google Scholar]

PERMALINK

Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs with a Generative Model

Bingyan Wang

Yuling Yan

Jianqing Fan

Abstract

1. Introduction

Our contributions.

2. Problem formulation

Discounted infinite-horizon MDPs.

Value function and Q-function.

Linear transition model.

Definition 1 (Linear transition model).

Example 1 (Tabular MDP).

Example 2 (Simplex Feature Space).

Assumption 1 (Anchor state-action pairs).

3. Model-based RL with a generative model

3.1. Main results

A generative model and an empirical MDP.

Theorem 1.

4. Model-free RL—vanilla Q Learning

4.1. Q-learning algorithm

4.2. Main results

Theorem 2.

Comparison with [YW19].

5. A glimpse of our technical approaches

6. Additional related literature

7. Discussion

Acknowledgements

APP1

A. Notations

B. Analysis of model-based RL (Proof of Theorem 1)

Theorem 3.

B.1. Proof of Theorem 3

Lemma 1.

Proof.

B.2. Proof of Lemma 1

Lemma 2.

Proof.

B.3. Proof of Lemma 2

C. Analysis of Q-learning (Proof of Theorem 2)

Theorem 4.

C.1. Proof of Theorem 4

Bounding ωt.

Bounding θt.

Bounding νt.

Lemma 3 (Freedman’s Inequality).

Proof.

Claim 1.

C.2. Proof of Claim 1

D. Feature dimension and the number of anchor state-action pairs

Case 1:

Case 2:

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Bounding ω_t.

Bounding θ_t.

Bounding ν_t.