Learn Quasi-Stationary Distributions of Finite State Markov Chain

Zhiqiang Cai; Ling Lin; Xiang Zhou

doi:10.3390/e24010133

. 2022 Jan 17;24(1):133. doi: 10.3390/e24010133

Learn Quasi-Stationary Distributions of Finite State Markov Chain

Zhiqiang Cai ^1,^*, Ling Lin ², Xiang Zhou ^1,³

Editors: Michael Dellnitz, Carsten Hartmann, Feliks Nüske

PMCID: PMC8774945 PMID: 35052159

Abstract

We propose a reinforcement learning (RL) approach to compute the expression of quasi-stationary distribution. Based on the fixed-point formulation of quasi-stationary distribution, we minimize the KL-divergence of two Markovian path distributions induced by candidate distribution and true target distribution. To solve this challenging minimization problem by gradient descent, we apply a reinforcement learning technique by introducing the reward and value functions. We derive the corresponding policy gradient theorem and design an actor-critic algorithm to learn the optimal solution and the value function. The numerical examples of finite state Markov chain are tested to demonstrate the new method.

Keywords: quasi-stationary distribution, reinforcement learning, KL-divergence, actor-critic algorithm

1. Introduction

Quasi-stationary distribution (QSD) is the long time statistical behavior of a stochastic process that will be surely killed when this process is conditioned to survive [1]. This concept has been widely used in applications, such as in biology and ecology [2,3], chemical kinetics [4,5], epidemics [6,7,8], medicine [9] and neuroscience [10,11]. Many works for rare events in meta-stable systems also focus on this quasi-stationary distribution [12,13]. In addition, some new Monte Carlo sampling methods, for instance, the Quasi-stationary Monte Carlo method [14,15], also arise by using QSD instead of true stationary distribution, for instance, the Quasi-stationary Monte Carlo method [14,15]

We are interested in the numerical computation of QSD and focus on the finite state Markov chain in this paper. Mathematically, the quasi-stationary distribution can be solved as the principal left eigenvector of a sub-Markovian transition matrix. Thus, traditional numerical algebra methods can be applied to solve the quasi-stationary distribution in finite state space, for example, the power method [16], the multi-grid method [17] and Arnoldi’s algorithm [18]. These eigenvector methods can produce a stochastic vector for QSD instead of generating samples of QSD.

In search of efficient algorithms for large state space, stochastic approaches are in favor of either sampling the QSD or computing the expression of QSD, and these methods can be applied or extended easily to continuous state space. A popular approach for sampling quasi-stationary distribution is the Fleming–Viot stochastic method [19]. The Flemming–Viot method first simulates N particles independently. When any one of the particles falls into the absorbing state and becomes killed, a new particle is uniformly selected from the remaining $N - 1$ surviving particles to replace the dead one, and the simulation continues. When time and N tend to infinity, the particles’ empirical distribution can converge to the quasi-stationary distribution.

In [20,21,22], the authors proposed to recursively update the expression of QSD at each iteration based on the empirical distribution of a single-particle simulation. It is shown in [21] that the convergence rate can be $O (n^{- 1 / 2})$ , where n is the iteration number. This method is later improved in [23,24] by applying the stochastic approximation method [25] and the Polyak–Ruppert averaging technique [26]. These improved algorithms have a choice of flexible step size but require a projection operator onto probability simplex, which carries some extra computational overhead increasing with the number of states. Ref. [15] extended the algorithm to the diffusion process.

In this paper, we focus on how to compute the expression of the quasi-stationary distribution, which is denoted by $α (x)$ on a metric space $E$ . If $E$ is finite, $α$ is a probability vector, and if $E$ is a domain in $R^{d}$ , then $α$ is a probability density function on $E$ . We assume $α$ can be numerically represented in parametric form $α_{θ}$ and $θ \in Θ$ . This family ${α_{θ}}$ can be in tabular form or any neural network. Then, the problem of finding the QSD $α$ becomes answering the question of how to compute the optimal parameter $θ$ in $Θ$ . We call this problem the learning problem for QSD. In addition, we want to directly learn QSD and not use the distribution family ${α_{θ}}$ to fit the simulated samples generated by other traditional simulation methods.

Our minimization problem for QSD is similar to the variational inference (VI) [27], which minimizes an objective functional measuring the distance between the target and candidate distributions. However, unlike the mainstream VI methods such as evidence lower bound (ELBO) technique [28] or particle-based [29], flow-based methods [30], our approach is based on recent important progresses from reinforcement learning (RL) method [31], particularly the policy gradient method and actor-critic algorithm. We first regard the learning process of the quasi-stationary distribution as the interaction with the environment, which is constructed by the property of QSD. Reinforcement learning has recently shown tremendous advancements and remarkable successes in applications (e.g., [32,33,34]). The RL framework provides an innovative and powerful modeling and computation approach for many scientific computing problems.

The essential question is how to formulate the QSD problem as an RL problem. Firstly, for the sub-Markovian kernel K of a Markov process, we can define a Markovian kernel $K_{α}$ on $E$ (see Definition 1) and then QSD is defined by the equation $α = α K_{α}$ , which equals $α$ as the initial distribution and the distribution after one step. Secondly, we consider an optimal $α$ (in our parametric family of distribution) to minimize the Kullback–Leibler divergence (i.e., relative entropy) of two path distributions, denoted by $P$ and $Q$ , associated with two Markovian kernels $K_{α}$ and $K_{β}$ where $β : = α K_{α}$ . Thirdly, inspired by the recent work [35] of using RL for rare events sampling problems, we transform the minimization of KL divergence between $P$ and $Q$ into the maximization of a time-averaged reward function and defined the corresponding value function $V (x)$ at each state x. This completes our modeling of RL for the quasi-stationary distribution problem. Lastly, we derive the policy gradient theorem (Theorem 1) to compute the gradient with respect to $θ$ of the averaged reward for the learning dynamic for the averaged reward. This is known as the “actor” part. The “critic” part is to learn the value function V in its parametric form $V_{ψ}$ . The actor-critic algorithm uses the stochastic gradient descent to train the parameter $θ$ for the action $α_{θ}$ and the parameter $ψ$ for the value function $V_{ψ}$ (see Algorithm 1).

Our contribution is that we first devise a method to transform the QSD problem into the RL problem. Similar to [35], our paper also uses the KL-divergence to define the RL problem. However, our paper fully adapts the unique property of QSD that is a fixed point problem $α = α K_{α}$ to define the RL problem.

Our learning method allows the flexible parametrization of the distributions and uses the stochastic gradient method to train the optimal distribution. It is easy to implement optimization with scale up to large state spaces. The numerical examples we tested have shown our that methods converge faster than other existing methods [22,23].

Finally, we remark that our method works very well for QSD of the strict sub-Markovian kernel K but is not applicable to compute the invariant distribution when K is Markovian. This is because we transform the problem into the variational problem between two Markovian kernels $K_{α}$ and $K_{β}$ (where $β = α K_{α}$ ). Note that $K_{α} (x, y) = K (x, y) + (1 - K (x, E)) α (y)$ (Definition 1), and our method is based on the fact that $α = β$ if and only if $K_{α} = K_{β}$ . If K is Markovian kernel, then $K_{α} \equiv K$ for any $α$ , and our method cannot work. Thus, $K (x, E)$ has to be strictly less than 1 for some $x \in E$ .

This paper is organized as follows. Section 2 is a short review of the quasi-stationary distribution and some basic simulation methods of QSD. In Section 3, we first formulate the reinforcement learning problem by KL-divergence and derive the policy gradient theorem (Theorem 1). Using the above formulation, we then develop the actor-critic algorithm to estimate the quasi-stationary distribution. In Section 4, the efficiency of our algorithms is illustrated by four examples compared with the simulation methods in [24].

Algorithm 1: (ac- $α$ method) Actor-critic algorithm for quasi-stationary distribution

α_{θ}

graphic file with name entropy-24-00133-i001.jpg

Open in a new tab

2. Problem Setup and Review

2.1. Quasi-Stationary Distribution

We start with an abstract setting. Let $E$ be a finite state equipped with the Borel $σ$ -field $B (E)$ , and let $P (E)$ be the space of probabilities over $E$ . A sub-Markovian kernel on $E$ is defined as a map $K : E \times B (E) \mapsto [0, 1]$ such that for all $x \in E, A \mapsto K (x, A)$ is a nonzero measure with $K (x, E) \leq 1$ and for all $A \in B (E), x \mapsto K (x, A)$ is measurable. In particular, if $K (x, E) = 1$ for all $x \in E$ , then K is called a Markovian kernel. Throughout the paper, we assume that K is strictly sub-Markovian, i.e., $K (x, E) < 1$ for some x.

Let $X_{t}$ be a Markov chain with values in $E \cup \{\partial\}$ where $\partial \notin E$ denotes an absorbing state. We define the extinction time

τ : = inf \{t > 0 : X_{t} = \partial\} .

We define the quasi-stationary distribution (QSD) $α$ as the long time limit of the conditional distribution, if there exists a probability distribution $ν$ on $E$ such that the following is the case:

α (A) : = lim_{t \to \infty} P_{ν} (X_{t} \in A ∣ τ > t), A \in B (E) .

(1)

where $P_{ν}$ refers to the probability distribution of $X_{t}$ associated with the initial distribution $ν$ on $E$ . Such a conditional distribution well describes the behavior of the process before extinction, and it is easy to see that $α$ satisfies the following fixed point problem:

P_{α} (X_{t} \in A ∣ τ > t) = α (A)

(2)

where $P_{α}$ refers to the probability distribution of $X_{t}$ associated with the initial distribution $α$ on $E$ . Equation (2) is equivalent to the following stationary condition such that the following is the case:

α = \frac{α K}{α K 1}, or α (y) = \frac{\sum_{x} α (x) K (x, y)}{\sum_{x} α (x) K (x, E)}

(3)

where $α$ is a row vector and $1$ denotes the column vector with all entries being one and

K (x, E) = \sum_{x^{'} \in E} K (x, x^{'}) .

For any sub-Markovian kernel K, we can associate K with a Markovian kernel $\tilde{K}$ on $E \cup {\partial}$ defined by the following:

\{\begin{matrix} \tilde{K} (x, A) = K (x, A) \\ \tilde{K} (x, {\partial}) = 1 - K (x, E) \\ \tilde{K} (\partial, {\partial}) = 1 . \end{matrix}

for all $x \in E, A \in B (E)$ . The kernel $\tilde{K}$ can be understood as the Markovian transition kernel of the Markov chain $(X_{t})$ on $E \cup {\partial}$ for which its transitions in $E$ is specified by K, but it is “killed” forever once it leaves $E$ .

In this paper, we assume $E$ is a finite state space and the process in consideration has a unique QSD. Assume that K is irreducible, then existence and uniqueness of the quasi-stationary distribution can be obtained by the Perron–Frobenius theorem [36].

An important Markovian kernel is the following $K_{α}$ , which is defined on $E$ only and has a “regenerative probability” $α$ .

Definition 1.

For any given $α \in P (E)$ and a sub-Markovian kernel K on $E$ , we define $K_{α}$ , a Markovian kernel on $E$ , as follows:

$K_{α} (x, A) : = K (x, A) + (1 - K (x, E)) α (A)$ (4)

for all $x \in E$ and $A \in B (E)$ .

$K_{α}$ is a Markovian kernel because $K_{α} (x, E) = 1$ . It is easy to sample $X_{t + 1} \sim K_{α} (X_{t}, \cdot)$ from any state $X_{t} \in E$ : run the transition as normal by using $\tilde{K}$ to have a next state denoted by Y, then $X_{t + 1} = Y$ if $Y \in E$ ; otherwise, sample $X_{t + 1}$ from $α$ .

We know that $α$ is the quasi-stationary distribution of K if and only if it is the stationary distribution of $K_{α}$ .

α = α K_{α} .

(5)

It is easy to observe that $α = β$ if and only if $K_{α} = K_{β}$ for any two distributions $α$ and $β$ . Moreover, for every $α^{'}$ , $K_{α^{'}}$ has a unique invariant probability denoted by $Γ (α^{'})$ . Then, $α^{'} \mapsto Γ (α^{'})$ is continuous in $P (E)$ (i.e., for the topology of weak convergence), and there exists $α \in P (E)$ such that $α = Γ (α)$ or, equivalently, $α$ is a QSD for K.

2.2. Review of Simulation Methods for Quasi-Stationary Distribution

According to the above subsection, the QSD $α$ satisfies the fixed point problem as follows:

α = Γ (α),

(6)

where $Γ (α)$ is the stationary distribution of $K_{α}$ on $E$ . In general, (6) can be solved recursively by $α_{n + 1} \leftarrow Γ (α_{n})$ .

The Fleming–Viot (FV) method [19] evolves N particles independently of each other as a Markov process associated with the transition kernel $K_{α}$ until one succeeds in jumping to the absorbing state ∂. At that time, this killed particle is immediately reset to $E$ as an initial state uniformly chosen from one of the remaining $N - 1$ particles. The QSD $α$ is approximated by the empirical distribution of the N particles in total, and these particles can be regarded as samples from the quasi-stationary distribution $α$ such as the MCMC method.

Ref. [37] proposed a simulation method by only using one particle at each iteration to update $α$ . At iteration n, given an $α_{n} \in P (E)$ , one can run a discrete-time Markov chain $X^{(n + 1)}$ as normal on $\partial \cup E$ with initial $X_{0}^{(n + 1)} \sim α_{n}$ ; then, $α_{n + 1}$ is computed as the following weighted average of empirical distributions:

\begin{matrix} α_{n + 1} (x) : = α_{n} (x) + \frac{1}{n + 1} \sum_{k = 0}^{τ^{(n + 1)} - 1} \frac{I (X_{k}^{(n + 1)} = x ∣ X_{0}^{(n + 1)} \sim α_{n}) - α_{n} (x)}{\frac{1}{n + 1} \sum_{j = 1}^{n + 1} τ^{(j)}} \end{matrix}

(7)

where $n \geq 0$ and I are the indicator functions, and $τ^{(j)} = min \{k \geq 0 ∣ X_{k}^{(j)} \in \partial\}$ is the first extinction time for the process $X^{(j)}$ . This iterative scheme has a convergence rate of $O (\frac{1}{\sqrt{n}})$ .

In [23,24], the above method is extended to the stochastic approximations framework:

α_{n + 1} (x) = Θ_{H} [α_{n} + ϵ_{n} \sum_{k = 0}^{τ^{(n + 1)} - 1} (I (X_{k}^{(n + 1)} = x | X_{0}^{(n + 1)} \sim α_{n}) - α_{n} (x))]

(8)

where $Θ_{H}$ denotes the $L_{2}$ projection into the probability simplex, and $ϵ_{n}$ is the step size satisfying $\sum ϵ_{n} = \infty$ and $\sum ϵ_{n}^{2} < \infty$ . Specifically, if $ϵ_{n} = O (\frac{1}{n^{r}})$ for $0.5 < r < 1$ , under a sufficient condition, they have $\sqrt{n^{r}} (α_{n} - α) \overset{d}{\to} N (0, V)$ for some matrix V [23,24]. If the Polyak–Ruppert averaging technique is applied to generate the following:

ν_{n} : = \frac{1}{n} \sum_{k = 1}^{n} α_{k},

(9)

then the convergence rate of $ν_{n} \to α$ becomes $\frac{1}{\sqrt{n}}$ [23,24].

The simulation schemes (7) and (8) need to sample the initial states according to $α_{n}$ and to add the empirical distribution and $α_{n}$ at each x point wisely. Thus, they are suitable for finite state space where $α$ is a probability vector saved in the tabular form. In (8), there is no need to record all exit times $τ^{(j)}, j = 1, \dots, n$ , but the additional projection operation in (8) is computationally expensive since the cost is $O (m log m)$ where $m = | E |$ [38,39].

3. Learn Quasi-Stationary Distribution

We focus on the computation of the expression of the quasi-stationary distribution. In particular, when this distribution is parametrized in a certain manner by $θ$ , we can extend the tabular form for finite-state Markov chain to any flexible form, even in the neural networks for probability density function in $R^{d}$ . However, we do not pursue this representation and expressivity issue here and restrict our discussion to finite state space only to illustrate our main idea first. In finite state space, $α (x)$ for $x \in E = {1, \dots, m}$ can be simply described as a softmax function with $m - 1$ parameter $θ_{i} : α (i) \propto e^{θ_{i}}, 1 \leq i \leq m - 1$ ( $θ_{m} = 0$ ). This introduces no representation error. For the generalization to continuous space $E$ in jump and diffusion processes or even for a huge finite state space, a good representation of $α_{θ} (x)$ is important in practice.

In this section, we shall formulate our QSD problem in terms of reinforcement learning (RL) so that the problem of seeking optimal parameters becomes a policy optimization problem. We derive the policy gradient theorem to construct a gradient descent method for the optimal parameter. We then show a method for designing actor-critic algorithms based on stochastic optimization.

3.1. Formulation of RL and Policy Gradient Theorem

Before introducing the RL method of our QSD problem, we develop a general formulation by introducing the KL-divergence between two path distributions.

Let $P_{θ}$ and $Q_{θ}$ be two families of Markovian kernels on $E$ in parametric forms with the same set of parameters $θ \in Θ$ . Assume both $P_{θ}$ and $Q_{θ}$ are ergodic for any $θ$ . Let $T > 0$ and denote a path up to time T by $ω_{0}^{T} = (X_{0}, X_{1}, \dots, X_{T}) \in E^{T + 1}$ . Define the path distributions under the Markov chain kernel $P_{θ}$ and $Q_{θ}$ , respectively.

P_{θ} (ω_{0}^{T}) : = \prod_{t = 1}^{T} P_{θ} (X_{t} ∣ X_{t - 1}), Q_{θ} (ω_{0}^{T}) : = \prod_{t = 1}^{T} Q_{θ} (X_{t} ∣ X_{t - 1}) .

(10)

Define the KL divergence from $P_{θ}$ to $Q_{θ}$ on $E^{T + 1}$ :

D_{K L} (P_{θ} ∣ Q_{θ}) : = \sum_{ω_{0}^{T}} P_{θ} (ω_{0}^{T}) ln \frac{P_{θ} (ω_{0}^{T})}{Q_{θ} (ω_{0}^{T})} = - E_{P_{θ}} \sum_{t = 1}^{T} R_{θ} (X_{t - 1}, X_{t}),

(11)

where the expectation $E_{P_{θ}}$ is for the path $(X_{0}, X_{1}, \dots, X_{T})$ generated by the transition kernel $P_{θ}$ , and the following is called the (one-step) reward.

R_{θ} (X_{t - 1}, X_{t}) : = - ln \frac{P_{θ} (X_{t} ∣ X_{t - 1})}{Q_{θ} (X_{t} ∣ X_{t - 1})} .

(12)

Define the average reward $r (θ)$ as the time averaged negative KL divergence in the limit of $T \to \infty$ .

\begin{matrix} r (θ) & : = - lim_{T \to \infty} \frac{1}{T} D_{K L} (P_{θ} ∣ Q_{θ}) = - lim_{T \to \infty} \frac{1}{T} E_{P_{θ}} \sum_{t = 1}^{T} R_{θ} (X_{t - 1}, X_{t}) . \end{matrix}

(13)

Due to ergodicity of $P_{θ}$ , $r (θ) = \sum_{x_{0}, x_{1}} R_{θ} (x_{0}, x_{1}) P_{θ} (x_{1} | x_{0}) μ_{θ} (x_{0})$ where $μ_{θ}$ is the invariant measure of $P_{θ}$ , $r (θ)$ is independent of initial state $X_{0}$ . Obviously, $r (θ) \leq 0$ for any $θ$ .

Property 1.

The following are equivalent:

1.
$r (θ)$ reaches its maximal value 0 at $θ^{*}$ ;

2.
$P_{θ^{*}} = Q_{θ^{*}}$ in $P (E^{T + 1})$ for any $T > 0$ ;

3.
$P_{θ^{*}} = Q_{θ^{*}}$ ;

4.
$R_{θ^{*}} \equiv 0$ .

Proof.

We only need to show $(1) ⟹ (3)$ . It is easy to see that

$r (θ) = - \sum_{x_{0}} D_{K L} (P_{θ} (\cdot | x_{0}) | Q_{θ} (\cdot | x_{0})) μ_{θ} (x_{0}) .$

If $r (θ) = 0$ , since $μ_{θ} > 0$ , then

$D_{K L} (P_{θ} (\cdot | x_{0}) | Q_{θ} (\cdot | x_{0})) = 0 \forall x_{0} .$

Thus, we have $P_{θ} = Q_{θ}$ . □

The above property establishes the relationship between the RL problem and QSD problem.

We show our theoretic main result below as the foundation of our algorithm to be developed later. This theorem can be regarded as one type of the policy gradient theorem for the policy gradient method in reinforcement learning [31].

Define the following value function ([31] Chapter 13).

V (x) : = lim_{T \to \infty} \sum_{t = 1}^{T} E_{P_{θ}} [R_{θ} (X_{t - 1}, X_{t}) - r (θ) ∣ X_{0} = x] .

(14)

Certainly, V also depends on $θ$ , although we do not write $θ$ explicitly.

Theorem 1(policy gradient theorem).

We have the following two properties:

1.
At any θ, for any $x \in E$ , the following Bellman-type equation holds for the value function V and the average reward $r (θ)$ :
$V (x) = E_{Y \sim P_{θ} (\cdot ∣ x)} [V (Y) + R_{θ} (x, Y) - r (θ)] .$ (15)

2.
The gradient of the average reward $r (θ)$ is the following:
$\begin{matrix} \nabla_{θ} r (θ) & = E [\nabla_{θ} ln Q_{θ} (Y ∣ X)] + \\ E [(V (Y) - V (X) + R_{θ} (X, Y) - r (θ)) \nabla_{θ} ln P_{θ} (Y ∣ X)], \end{matrix}$ (16)

where expectations are for the joint distribution $(X, Y) \sim μ_{θ} (x) P_{θ} (y ∣ x)$ where $μ_{θ}$ is the stationary measure of $P_{θ}$ .

Proof.

We shall prove the Bellman equation first and then we use the Bellman equation to derive the gradient of the average reward $r (θ)$ . For any $x_{0} \in E$ , by writing $ω_{0}^{T} = (x_{0}, \dots, x_{T})$ and defining

$Δ R_{θ} (ω_{0}^{T}) = \sum_{t = 1}^{T} (R (x_{t - 1}, x_{t}) - r (θ)),$

we have the following:

$\begin{matrix} V (x_{0}) & = lim_{T \to \infty} E_{P_{θ}} [Δ R_{θ} (ω_{0}^{T}) ∣ X_{0} = x] \\ = lim_{T \to \infty} \sum_{x_{2}, \dots, x_{T}} \sum_{x_{1}} ((\prod_{t = 2}^{T} P_{θ} (x_{t} ∣ x_{t - 1})) P_{θ} (x_{1} ∣ x_{0}) Δ R (ω_{0}^{T})) \\ = lim_{T \to \infty} \sum_{x_{1}} (P_{θ} (x_{1} ∣ x_{0}) \sum_{x_{2}, \dots, x_{T}} (\prod_{t = 2}^{T} P_{θ} (x_{t} ∣ x_{t - 1}) [Δ R (ω_{1}^{T}) + Δ R (ω_{0}^{1})])) \\ = \sum_{x_{1}} (P_{θ} (x_{1} ∣ x_{0}) (lim_{T \to \infty} [\sum_{x_{2}, \dots, x_{T}} \prod_{t = 2}^{T} P_{θ} (x_{t} ∣ x_{t - 1}) Δ R (ω_{1}^{T})] + Δ R (ω_{0}^{1}))) \\ = \sum_{x_{1}} P_{θ} (x_{1} ∣ x_{0}) [V (x_{1}) + R_{θ} (x_{0}, x_{1})] - r (θ), \end{matrix}$ (17)

which proves (15); in other words, we have the following.

$r (θ) = E_{Y \sim P_{θ} (\cdot ∣ x)} [V (Y) + R_{θ} (x, Y) - V (x)], \forall x \in E .$

Next, we compute the gradient of $r (θ)$ . By trivial equality of the following:

$\sum_{x_{1}} P_{θ} (x_{1} ∣ x_{0}) \nabla_{θ} ln P_{θ} (x_{1} ∣ x_{0}) = \nabla_{θ} \sum_{x_{1}} P_{θ} (x_{1} ∣ x_{0}) = 0,$ (18)

and the definition (12), we can write the gradient of $r (θ)$ as follows.

$\begin{matrix} \nabla_{θ} r (θ) = & \sum_{y} \nabla_{θ} P_{θ} (y ∣ x) [V (y) + R_{θ} (x, y) - V (x)] \\ + \sum_{y} P_{θ} (y ∣ x) [\nabla_{θ} V (y) - \nabla_{θ} V (x) + \nabla_{θ} ln Q_{θ} (y ∣ x)] . \end{matrix}$

We here keep the term $V (x)$ in the first line, even though it has no contribution here (in fact, to add any constant to $V (x)$ is also fine). Since this equation holds for all states x on the right-hand side, we take the expectation with respect to $μ_{θ}$ , the stationary distribution of $P_{θ}$ . Thus, we have the following.

$\begin{matrix} \nabla_{θ} r (θ) = & \sum_{x, y} μ_{θ} (x) \nabla_{θ} P_{θ} (y ∣ x) [V (y) + R_{θ} (x, y) - V (x)] \\ + \sum_{x, y} μ_{θ} (x) P_{θ} (y ∣ x) [\nabla_{θ} V (y) - \nabla_{θ} V (x) + \nabla_{θ} ln Q_{θ} (y ∣ x)] \\ = & \sum_{x, y} μ_{θ} (x) \nabla_{θ} P_{θ} (y ∣ x) [V (y) + R_{θ} (x, y) - V (x)] \\ + \sum_{y} μ_{θ} (y) \nabla_{θ} V (y) - \sum_{x} μ_{θ} (x) \nabla_{θ} V (x) + \sum_{x, y} μ_{θ} (x) P_{θ} (y ∣ x) \nabla_{θ} ln Q_{θ} (y ∣ x) \\ = & \sum_{x, y} μ_{θ} (x) P_{θ} (y ∣ x) [V (y) + R_{θ} (x, y) - V (x)] \nabla_{θ} ln P_{θ} (y ∣ x) \\ + \sum_{x, y} μ_{θ} (x) P_{θ} (y ∣ x) \nabla_{θ} ln Q_{θ} (y ∣ x) . \end{matrix}$

In fact, we can add any constant number b (independent of x and y) inside the squared bracket of the last line without changing the equality due to the following fact similar to (18): $\sum_{x, y} μ_{θ} (x) \nabla_{θ} P_{θ} (y ∣ x) = \sum_{y} μ_{θ} (y) \nabla_{θ} \sum_{x} P_{θ} (x ∣ y) = 0$ . (16) is a special case of $b = r (θ)$ . □

Remark 1.

As shown in the proof, (16) holds if $r (θ)$ at the right-hand side is replaced by any constant number b. $b = r (θ)$ is a good choice to reduce the variance since $r (θ)$ can be regarded as the expectation of $R_{θ}$ .

Remark 2.

If $P_{θ} = Q_{θ}$ , then the first term of (16) vanishes due to (18) and the second term of (16) vanishes due to (15).

Remark 3.

The name of “policy” here refers to the role of θ as the policy for decision makers to improve reward $r (θ)$ .

3.2. Learn QSD

Now, we discuss how to connect QSD with the results in the previous subsection. In view of Equation (5), we introduce $β : = α K_{α}$ as the one-step distribution if starting from the initial $α$ ; in other words, we have the following.

β (y) : = \sum_{x \in E} α (x) K_{α} (x, y), \forall y

(19)

By (5), $α$ is a QSD if and only if $β = α$ . However, we do not directly compare these two distributions $α$ and $β$ . Instead, we consider their Markovian kernels induced by (4): $K_{α}$ and $K_{β}$ . Our approach is to consider KL divergence similar to (11) between two kernels $K_{α}$ and $K_{β}$ since $α = β$ if and only if $K_{α} = K_{β}$ . In this manner, one can view $K_{α}$ and $K_{β}$ (note $β = α K_{α}$ ) as two transition matrices $P_{θ}$ and $Q_{θ}$ in the previous section, in which the parameter $θ$ here is in fact the distribution $α$ .

To have a further representation of the distribution $α$ , which is a (probability mass) function on $E$ , we propose a parametrized family for $α$ in the form $α_{θ}$ where $θ$ is a generic parameter. In the simplest case, $α_{θ}$ takes the so-called soft-max form $α_{θ} (i) = \frac{e^{θ_{i}}}{\sum_{j \geq 1} e^{θ_{j}}}$ if $E = {1, \dots, N}$ for $θ = (θ_{1}, \dots, θ_{N - 1}, θ_{N} \equiv 0) .$ This parametrization represents $α$ without any approximation error for finite state space and the effective space of $θ$ is just $R^{N - 1}$ . For certain problems, particularly with large state space, if one has some prior knowledge about the structure of the function $α$ on $E$ , one might propose other parametric forms of $α_{θ}$ with the dimension of $θ$ less than the cardinality $| E |$ to improve the efficiency, although the extra representation error in this manner has to be introduced.

For any given $α_{θ} \in P (E)$ , the corresponding Markovian kernel $K_{α_{θ}}$ is then defined in (4) and $β_{θ} = α_{θ} K_{α_{θ}} i$ is defined by (19). $K_{β_{θ}}$ is like-wise defined by (4) again. To use the formulation in Section 3.1, we chose $P_{θ} = K_{α_{θ}}$ and $Q_{θ} = K_{β_{θ}}$ . Define the objective function as before:

\begin{matrix} r (θ) & : = - lim_{T \to \infty} \frac{1}{T} D_{K L} (P_{θ} ∣ Q_{θ}) = - lim_{T \to \infty} \frac{1}{T} E_{P_{θ}} \sum_{t = 1}^{T} R_{θ} (X_{t - 1}, X_{t}) . \end{matrix}

where the following is the case.

R_{θ} (x, y) = - ln \frac{K_{α_{θ}} (x, y)}{K_{β_{θ}} (x, y)} .

The value function $V (x)$ is defined similarly. Theorem 1 now provides the expression of the following gradient:

\begin{matrix} \nabla_{θ} r (θ) & = E [(R_{θ} (X, Y) - r (θ) + V (Y) - V (X)) \nabla_{θ} ln K_{α_{θ}} (X, Y) \\ + \nabla_{θ} ln K_{β_{θ}} (X, Y)] \end{matrix}

(20)

where $(X, Y) \sim μ_{θ} (x) K_{α_{θ}} (x, y)$ and where $μ_{θ}$ is the stationary measure of $K_{α_{θ}}$ .

The optimal $θ^{*}$ for the QSD $α_{θ}$ is to maximize $r (θ)$ , and this can be solved by the gradient descent algorithm:

θ_{t + 1} = θ_{t} + η_{t}^{θ} \nabla_{θ} r (θ_{t}) .

(21)

where $η_{t}^{θ} > 0$ is the step size. In practice, the stochastic gradient is applied:

\nabla_{θ} r (θ_{t}) \approx \nabla_{θ} ln K_{α_{θ}} (X_{t}, X_{t + 1}) \times δ (X_{t}, X_{t + 1}) + \nabla_{θ} ln K_{β_{θ}} (X_{t}, X_{t + 1})

where $X_{t}, X_{t + 1}$ are sampled based on the Markovian kernel $K_{α_{θ}}$ (see Algorithm 1) and the differential temporal (TD) error $δ_{t}$ is as follows.

δ_{t} = δ (X_{t}, X_{t + 1}) = R_{θ} (X_{t}, X_{t + 1}) - r (θ_{t}) + V (X_{t + 1}) - V (X_{t}) .

(22)

Next, we need to address a remaining issue, which is the question of how to compute value functions V and $r (θ_{t})$ in the TD error (22). In addition, we also need to show the details of computing $\nabla_{θ} K_{α_{θ}}$ and $\nabla_{θ} K_{β_{θ}}$ .

3.3. Actor-Critic Algorithm

With the stochastic gradient method (21), we can obtain optimal policy $θ^{*}$ . We refer to (21) as the learning dynamics for the policy, and it is generally known as actor. To calculate the value function V appearing in $\nabla r (θ)$ , we need to have a new learning dynamic, which is called critic. Then, the overall policy-gradient method is termed as the actor-critic method.

We start with the Bellman Equation (15) for the value function and considered the mean-square-error loss as follows:

MSE [V] = \frac{1}{2} \sum_{x} ν (x) {(\sum_{y} K_{α_{θ}} (x, y) [V (y) + R_{θ} (x, y) - r (θ)] - V (x))}^{2}

where $ν$ is any distribution supported on $E$ . $MSE [V] = 0$ if and only if V satisfies the Bellman Equation (15), i.e., V is the value function. To learn V, we introduce function approximation for the value function, $V_{ψ}$ , with the parameter $ψ$ and considered to minimize the following:

MSE (ψ) = \frac{1}{2} \sum_{x} ν (x) {(\sum_{y} K_{α_{θ}} (x, y) [V (y) + R_{θ} (x, y) - r (θ)] - V_{ψ} (x))}^{2}

by the semi-gradient method ([31], Chapter 9).

\begin{matrix} \nabla_{ψ} MSE (ψ) & = - \sum_{x, y} ν (x) K_{α_{θ}} (x, y) [V (y) + R_{θ} (x, y) - r (θ) - V_{ψ} (x)] \nabla_{ψ} V_{ψ} (x) \\ \approx - \sum_{x, y} ν (x) K_{α_{θ}} (x, y) [V_{ψ} (y) + R_{θ} (x, y) - r (θ) - V_{ψ} (x)] \nabla_{ψ} V_{ψ} (x) \end{matrix}

Here, the term $V (y)$ is frozen first and then approximated by $V_{ψ}$ since it could be treated as a prior guess of the value function for the future state.

Then, for the gradient descent iteration $ψ_{t + 1} = ψ_{t} - η_{t}^{ψ} \nabla_{ψ} {MSE}_{V} (ψ_{t})$ where $η_{t}^{ψ}$ is the step size, we can have the following stochastic gradient iteration:

ψ_{t + 1} = ψ_{t} + η_{t}^{ψ} δ (X_{t}, X_{t + 1}) \nabla_{ψ} V_{ψ_{t}} (X_{t})

(23)

where the differential temporal (TD) error $δ$ is defined above in (22).

δ_{t} = δ (X_{t}, X_{t + 1}) = R_{θ_{t}} (X_{t}, X_{t + 1}) - r (θ_{t}) + V_{ψ_{t}} (X_{t + 1}) - V_{ψ_{t}} (X_{t}) .

Here, for the sake of simplicity, $(X_{t}, X_{t + 1})$ are the same samples as in the actor method for $θ_{t}$ . This means that distribution $ν$ above is chosen as $μ$ used for the gradient $\nabla_{θ} r (θ)$ .

Next, we consider the calculation of the reward $r (θ)$ by the following Bellman Equation (15).

\sum_{x} μ (x) \sum_{y} K_{α_{θ}} (x, y) (R_{θ} (x, y) - r (θ) + V (y) - V (x)) = 0

Let $r_{t}$ be the estimate of the reward $r (θ_{t})$ at time t. We can update our estimate of the reward every time a transition occurs as follows:

r_{t + 1} = r_{t} + η_{t}^{r} \times δ_{t}

(24)

where $δ_{t}$ is the TD error before

δ_{t} = δ (X_{t}, X_{t + 1}) = R_{θ_{t}} (X_{t}, X_{t + 1}) - r_{t} + V_{ψ_{t}} (X_{t + 1}) - V_{ψ_{t}} (X_{t}) .

In conclusion, (21), (23) and (24) together consist of the actor-critic algorithm, which is summarized in Algorithm 1. We remark that Algorithm 1 can be easily adapted to use the mini-batch gradient method where several copies of $(X_{t}, X_{t + 1})$ are sampled, and the average is used to update the parameters. The stationary distribution $μ_{θ}$ of $K_{α_{θ}}$ is sampled by running the corresponding Markov chain for several steps with “warm start”: the initial for $θ_{t + 1}$ is set as the final state generated from the previous iteration at $θ_{t}$ . The length of this “burn-in” period can be set as just one step in practice for efficiency.

Remark 4.

Finally, we remark on the computation of $\nabla_{θ} ln K_{α_{θ}}$ and $\nabla_{θ} ln K_{β_{θ}}$ in Algorithm 1. The details are shown in Appendix A. We comment that the main computational cost is the function $K (x, E)$ , which has to be pre-computed and stored. If the problem has some special structure, the function could be approximated in parametric form. Another special case is our second example where $K (x, E) = 0 \forall x \in {2, 3, \dots, N}$ .

4. Numerical Experiment

In this section, we present two examples to demonstrate Algorithm 1. We call the algorithm (7), (8) and (9) in Section 2.2 used in [23,24], as Vanilla Algorithm, Projection Algorithm and Polyak Averaging Algorithm, respectively. Let 0 be the absorbing state and $E = {1, \dots, N}$ are non-absorbing states; the Markov transition matrix on ${0, \dots, N}$ is denoted by the following:

\tilde{K} = [\begin{matrix} 1 & 0 \\ * & K \end{matrix}],

where K is an N-by-N sub-Markovian matrix. For Algorithm 1, distribution $α_{θ}$ on $E$ is always parameterized as follows:

α_{θ} = \frac{1}{e^{θ_{1}} + \dots + e^{θ_{N - 1}} + 1} [e^{θ_{1}}, \dots, e^{θ_{N - 1}}, 1],

and the value function $V_{ψ} (x)$ is represented in tabular form for simplicity:

V_{ψ} = [ψ_{1}, \dots, ψ_{N}]

where $ψ \in R^{N}$ .

4.1. Loopy Markov Chain

We test a toy example of the three-state loopy Markov chain, which was considered in [23,24]. The transition probability matrix for the four states ${0, 1, 2, 3}$ is as follows.

\tilde{K} = [\begin{matrix} 1 & 0 & 0 & 0 \\ ϵ & \frac{1 - ϵ}{3} & \frac{1 - ϵ}{3} & \frac{1 - ϵ}{3} \\ ϵ & \frac{1 - ϵ}{3} & \frac{1 - ϵ}{3} & \frac{1 - ϵ}{3} \\ ϵ & \frac{1 - ϵ}{3} & \frac{1 - ϵ}{3} & \frac{1 - ϵ}{3} \end{matrix}], ϵ \in (0, 1) .

The state 0 is the absorbing state ∂ and $E = {1, 2, 3}$ . K is the sub-matrix of $\tilde{K}$ corresponding to the states ${1, 2, 3}$ . With the probability $ϵ$ , the process exits $E$ directly from state 1, 2 or 3. The true quasi-stationary distribution of this example is the uniform distribution for any $ϵ$ .

In order to show the advantage of our algorithm, we consider two cases: (1) $ϵ = 0.1$ and (2) $ϵ = 0.9$ . For a larger $ϵ$ , the original Markov chain is very easy to exit; thus, each iteration takes less time, but the convergence rate of Vanilla algorithm is slower.

In order to quantify the accuracy of the learned quasi-stationary distribution, we compute the $L_{2}$ norm of the error between the learned quasi-stationary distribution and the true values.

In Figure 1, we compute the QSD when $ϵ = 0.1$ . We set the initial value $θ_{0} = [- 1, 1], ψ_{0} = [0, 0, 0], r_{0} = 0$ , the learning rate $η_{n}^{θ} = max {1 / n^{0.1}, 0.2}, η_{n}^{ψ} = 0.0001, η_{n}^{r} = 0.0001$ and the batch size is 4. The step size for the Projection Algorithm is $ϵ_{n} = n^{- 0.99}$ . Figure 2 is for the case when $ϵ = 0.9$ We set the initial value $θ_{0} = [4, - 2], ψ_{0} = [0, 0, 0], r_{0} = 0$ , the learning rate $η_{n}^{θ} = 0.04, η_{n}^{ψ} = 0.0001, η_{n}^{r} = 0.0001$ and the batch size is 32. The step size for the Projection Algorithm is $ϵ_{n} = n^{- 0.99}$ .

The loopy Markov chain example with $ϵ = 0.1$ . The figure shows the log–log plots of $L_{2}$ -norm error of the Vanilla Algorithm (a), Projection Algorithm (b), Polyak Averaging Algorithm (c) and our actor-critic algorithm (d). The iteration for the actor-critic algorithm is defined as one step of gradient descent (“t” in Algorithm 1).

The loopy Markov chain example with $ϵ = 0.9$ . The figure shows the log–log plots of $L_{2}$ -norm error of Vanilla Algorithm (a), Projection Algorithm (b), Polyak Averaging Algorithm (c) and our actor-critic algorithm (d).

4.2. M/M/1/N Queue with Finite Capacity and Absorption

Our second example is an M/M/1 queue with finite queue capacity. The 0 state has been set as an absorbing state. The transition probability matrix on ${0, \dots, N}$ takes the following form:

\tilde{K} = [\begin{matrix} 1 & 0 & 0 & 0 & 0 & \dots & 0 & 0 \\ μ_{1} & 0 & λ_{1} & 0 & 0 & \dots & 0 & 0 \\ 0 & μ_{2} & 0 & λ_{2} & 0 & \dots & 0 & 0 \\ 0 & 0 & μ_{3} & 0 & λ_{3} & \dots & 0 & 0 \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ 0 & 0 & 0 & 0 & 0 & 0 & λ_{N - 1} \\ 0 & 0 & 0 & 0 & 0 & \dots & 1 & 0 \end{matrix}]

where $λ_{i} = \frac{ρ_{i}}{ρ_{i} + 1}$ , $μ_{i} = \frac{1}{ρ_{i} + 1}$ , $i \in {1, 2, \dots, N - 1}$ . $ρ_{i} > 1$ means a higher chance to jump to the right than to the left. A larger $ρ_{i}$ will have less probability of exiting $E$ . Note that $K (x, E) = 1$ for $x \in {2, \dots, N}$ . Thus, $K_{α} (x, y) = K (x, y)$ for any $α$ if $x \neq 1$ and $K_{α} (1, y) = K (1, y) + μ_{1} α (y) = \{\begin{matrix} λ_{1} + μ_{1} α (1) & y = 1, \\ μ_{1} α (y) & 2 \leq y \leq N \end{matrix} .$ Then, $R_{θ} (x, y) = - ln \frac{K_{α_{θ}} (x, y)}{K_{β_{θ}} (x, y)} = 0$ if $x \neq 1$ and by (20), the gradient is simplified as follows:

\nabla_{θ} r (θ) = E_{Y} [(R_{θ} (1, Y) - r (θ) + V (Y) - V (1)) \nabla_{θ} ln K_{α_{θ}} (1, Y) + \nabla_{θ} ln K_{β_{θ}} (1, Y)]

where Y follows distribution $K_{α} (1, \cdot)$ .

We consider two cases: (1) a constant $ρ_{i} = 1.25$ and (2) a state-dependent $ρ_{i} = 2 - \frac{3}{2 N - 4} (i - 1)$ . Note that $ρ_{i} = 1$ gives an equal probability of jumping to the left and to the right. Thus, in case (1), there is a boundary layer at the most right end and in case (2), we expect to see a peak of the QSD near $i \approx 2 N / 3$ . Figure 3 shows the true QSD in both cases. We set $N = 500$ .

The QSD for M/M/1/500 queue with $ρ_{i} \equiv 1.25$ (**left**) and $ρ_{i} = 2 - \frac{3}{2 N - 4} (i - 1)$ (**right**).

In Figure 4, we consider the case when $ρ_{i} = 1.25$ and compute $L_{2}$ errors. We set the initial value $θ_{0}^{i} = - 35 + \frac{35}{498} (i - 1)$ for $i \in {1, 2, \dots, 498}$ and $θ_{0}^{499} = 3$ , $ψ_{0} = [0, 0, \dots, 0]$ , $r_{0} = 0$ and the learning rate $η_{n}^{θ} = 0.0003, η_{n}^{ψ} = 0.0001, η_{n}^{r} = 0.0001$ and the batch size is 64. The step size for Projection Algorithm is $ϵ_{n} = n^{- 0.95}$ . Figure 5 plots the errors for the state-dependent $ρ_{i} = 2 - \frac{3}{2 N - 4} (i - 1)$ . We set the initial value $θ_{0}^{i} = 8 + \frac{35}{250} (i - 1)$ for $i \in {1, 2, \dots, 250}$ , $θ_{0}^{251} = 44$ , $θ_{0}^{i} = 43$ for $i \in {252, \dots, 305}$ , $θ_{0}^{306} = 48$ , $θ_{0}^{307} = 42$ and $θ_{0}^{i} = 43 - \frac{38}{293} (i - 1)$ for $i \in {308, 309, \dots, 499}$ , $ψ_{0} = [0, 0, \dots, 0], r_{0} = 0$ and the learning rate is $η_{n}^{θ} = 0.0002$ , $η_{n}^{ψ} = 0.0001, η_{n}^{R} = 0.0001$ with the batch size as 128. The step size for the Projection Algorithm is $ϵ_{n} = n^{- 0.95}$ . Both figures demonstrate that actor-critic algorithm performs quite well on this example.

The M/M/1/500 queue with $ρ_{i} = 1.25$ . The figure shows the log–log plots of $L_{2}$ -norm error of Vanilla Algorithm (a), Projection Algorithm (b), Polyak Averaging Algorithm (c) and our actor-critic algorithm (d).

The M/M/1/500 queue with $ρ_{i} = 2 - \frac{3}{2 N - 4} (i - 1)$ . The figure shows the log–log plots of $L_{2}$ -norm error of Vanilla Algorithm (a), Projection Algorithm (b), Polyak Averaging Algorithm (c) and our actor-critic algorithm (d).

In Table 1, we compared the CPU time of each algorithm in the M/M/1/500 queue when they obtain an accuracy at $2 \times 10^{- 1}$ . We found that our algorithm cost less time on this example.

Table 1.

The CPU time of each algorithm in the M/M/1/500 queue when they obtain the accuracy at $2 \times 10^{- 1}$ .

Algorithm	Vanilla	Projection	Polyak Averaging	ac_ $α$
Time (s)	1038.3279	429.6304	505.2299	186.9280
Time (s)	753.9503	259.0671	268.5476	251.5370

Open in a new tab

5. Summary and Conclusions

In this paper, we propose a reinforcement learning (RL) method for quasi-stationary distribution (QSD) in discrete time finite-state Markov chains. By minimizing the KL-divergence of two Markovian path distributions induced by the candidate distribution and the true target distribution, we introduce the formulation in terms of RL and derive the corresponding policy gradient theorem. We devise an actor-critic algorithm to learn the QSD in its parameterized form $α_{θ}$ . This formulation of RL can receive benefit from the development of the RL method and the optimization theory. We illustrated our actor-critic methods on two numerical examples by using simple tabular parametrization and gradient descent optimization. It has been observed that the performance of our method is more prominent for large scale problems.

We only demonstrate the preliminary mechanism of the idea here, and there is much space left for improving the efficiency and extensions in future works. The generalization from the current consideration of finite-state Markov chain to the jump Markov process and the diffusion case is in consideration. More importantly, for very large or high dimensional state space, modern function approximation methods such as kernel methods or neural networks should be used for the distribution $α_{θ}$ and the value function $V_{ψ}$ . The recent tremendous advancement of optimization techniques for policy gradient in reinforcement learning could also contribute much to efficiency improvement of our current formulation.

Acknowledgments

L.L. acknowledges the support of NSFC 11871486. X.Z. acknowledges the support of Hong Kong RGC GRF 11305318.

Appendix A

In this appendix, we discuss the computation of the gradient of $\nabla_{θ} ln K_{α_{θ}}$ and $\nabla_{θ} ln K_{β_{θ}}$ . Note that $\nabla_{θ} α_{θ}$ is straightforward since we model $α$ in its parametrization form $θ$ . By definition (4), we have the following:

\nabla_{θ} ln K_{α_{θ}} (X_{t}, X_{t + 1}) = \frac{1 - K (X_{t}, E)}{K_{α_{θ}} (X_{t}, X_{t + 1})} \nabla_{θ} α_{θ} (X_{t + 1})

and the following as well:

\nabla_{θ} ln K_{β_{θ}} (X_{t}, X_{t + 1}) = \frac{1 - K (X_{t}, E)}{K_{β_{θ}} (X_{t}, X_{t + 1})} \nabla β_{θ} (X_{t + 1}) .

where $K_{β} (x, y) = K (x, y) + (1 - K (x, E)) β (y)$ . The vector $K (x, E)$ for any x can be pre-computed and saved in tabular form.

By (19), the one-step distribution $β$ is computed below.

β (X_{t + 1}) = \sum_{x} α (x) [K (x, X_{t + 1}) + (1 - K (x, E)) α (X_{t + 1})] \approx \frac{1}{n} \sum_{i = 1}^{n} K (Z_{i}, X_{t + 1}) + (1 - K (Z_{i}, E)) α (X_{t + 1})

Here, the samples $Z_{i} \sim α$ could be approximated by stationary distribution $μ$ ; thus, one may simply use the known sample $X_{t}$ to replace $Z_{i}$ with $n = 1$ .

To find $\nabla_{θ} β_{θ}$ , we use stochastic approximation again.

\begin{matrix} \nabla_{θ} β_{θ} (X_{t + 1}) & = \sum_{x} \nabla α_{θ} (x) [K (x, X_{t + 1}) + (1 - K (x, E)) α_{θ} (X_{t + 1})] + [\sum_{x} α_{θ} (x) (1 - K (x, E))] \nabla_{θ} α_{θ} (y), \\ \approx \frac{1}{n} \sum_{i = 1}^{n} \nabla ln α_{θ} (Z_{i}) [K (Z_{i}, X_{t + 1}) + (1 - K (Z_{i}, E)) α_{θ} (X_{t + 1})] + (1 - K (Z_{i}, E)) \nabla_{θ} α_{θ} (X_{t + 1}) . \end{matrix}

Author Contributions

Conceptualization, Z.C. and L.L.; Investigation, Z.C.; Methodology, Z.C. and X.Z.; Software, Z.C.; Supervision, X.Z.; Writing—original draft, Z.C.; Writing—review & editing, L.L. and X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Government of Hong Kong, Grant Number 11305318; NSFC Grant Number 11871486.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflict of interest.

Footnotes

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Collet P., Martínez S., Martín J.S. Quasi-Stationary Distributions: Markov Chains, Diffusions and Dynamical Systems. Springer Science & Business Media; Cham, Switzerlands: 2012. [Google Scholar]
2.Buckley F., Pollett P. Analytical methods for a stochastic mainland–island metapopulation model. Ecol. Model. 2010;221:2526–2530. doi: 10.1016/j.ecolmodel.2010.02.017. [DOI] [Google Scholar]
3.Lambert A. Population dynamics and random genealogies. Stoch. Model. 2008;24:45–163. doi: 10.1080/15326340802437728. [DOI] [Google Scholar]
4.De Oliveira M.M., Dickman R. Quasi-stationary distributions for models of heterogeneous catalysis. Phys. Stat. Mech. Appl. 2004;343:525–542. doi: 10.1016/j.physa.2004.06.155. [DOI] [Google Scholar]
5.Dykman M.I., Horita T., Ross J. Statistical distribution and stochastic resonance in a periodically driven chemical system. J. Chem. Phys. 1995;103:966–972. doi: 10.1063/1.469796. [DOI] [Google Scholar]
6.Artalejo J.R., Economou A., Lopez-Herrero M.J. Stochastic epidemic models with random environment: Quasi-stationarity, extinction and final size. J. Math. Biol. 2013;67:799–831. doi: 10.1007/s00285-012-0570-5. [DOI] [PubMed] [Google Scholar]
7.Clancy D., Mendy S.T. Approximating the quasi-stationary distribution of the sis model for endemic infection. Methodol. Comput. Appl. Probab. 2011;13:603–618. doi: 10.1007/s11009-010-9177-8. [DOI] [Google Scholar]
8.Sani A., Kroese D., Pollett P. Stochastic models for the spread of hiv in a mobile heterosexual population. Math. Biosci. 2007;208:98–124. doi: 10.1016/j.mbs.2006.09.024. [DOI] [PubMed] [Google Scholar]
9.Chan D.C., Pollett P.K., Weinstein M.C. Quantitative risk stratification in markov chains with limiting conditional distributions. Med. Decis. Mak. 2009;29:532–540. doi: 10.1177/0272989X08330121. [DOI] [PubMed] [Google Scholar]
10.Berglund N., Landon D. Mixed-mode oscillations and interspike interval statistics in the stochastic fitzhugh–nagumo model. Nonlinearity. 2012;25:2303. doi: 10.1088/0951-7715/25/8/2303. [DOI] [Google Scholar]
11.Landon D. Ph.D. Thesis. Université d’Orléans; Orléans, France: 2012. Perturbation et Excitabilité Dans des Modeles Stochastiques de Transmission de l’Influx Nerveux. [Google Scholar]
12.Gesù G.D., Lelièvre T., Peutrec D.L., Nectoux B. Jump markov models and transition state theory: The quasi-stationary distribution approach. Faraday Discuss. 2017;195:469–495. doi: 10.1039/C6FD00120C. [DOI] [PubMed] [Google Scholar]
13.Lelièvre T., Nier F. Low temperature asymptotics for quasistationary distributions in a bounded domain. Anal. PDE. 2015;8:561–628. doi: 10.2140/apde.2015.8.561. [DOI] [Google Scholar]
14.Pollock M., Fearnhead P., Johansen A.M., Roberts G.O. The scalable langevin exact algorithm: Bayesian inference for big data. arXiv. 20161609.03436 [Google Scholar]
15.Wang A.Q., Roberts G.O., Steinsaltz D. An approximation scheme for quasi-stationary distributions of killed diffusions. Stoch. Process. Appl. 2020;130:3193–3219. doi: 10.1016/j.spa.2019.09.010. [DOI] [Google Scholar]
16.Watkins D.S. Fundamentals of Matrix Computations. Volume 64 John Wiley & Sons; Hoboken, NJ, USA: 2004. [Google Scholar]
17.Bebbington M. Parallel implementation of an aggregation/disaggregation method for evaluating quasi-stationary behavior in continuous-time markov chains. Parallel Comput. 1997;23:1545–1559. doi: 10.1016/S0167-8191(97)89286-1. [DOI] [Google Scholar]
18.Pollett P., Stewart D. An efficient procedure for computing quasi-stationary distributions of markov chains by sparse transition structure. Adv. Appl. Probab. 1994;26:68–79. doi: 10.2307/1427580. [DOI] [Google Scholar]
19.Martinez S., Martin J.S. Quasi-stationary distributions for a brownian motion with drift and associated limit laws. J. Appl. Probab. 1994;31:911–920. doi: 10.2307/3215316. [DOI] [Google Scholar]
20.Aldous D., Flannery B., Palacios J.L. Two applications of urn processes the fringe analysis of search trees and the simulation of quasi-stationary distributions of markov chains. Probab. Eng. Inform. Sci. 1988;2:293–307. doi: 10.1017/S026996480000084X. [DOI] [Google Scholar]
21.Benaïm M., Cloez B. A stochastic approximation approach to quasi-stationary distributions on finite spaces. Electron. Commun. Probab. 2015;20:1–13. doi: 10.1214/ECP.v20-3956. [DOI] [Google Scholar]
22.De Oliveira M.M., Dickman R. How to simulate the quasistationary state. Phys. Rev. E. 2005;71:016129. doi: 10.1103/PhysRevE.71.016129. [DOI] [PubMed] [Google Scholar]
23.Blanchet J., Glynn P., Zheng S. Analysis of a stochastic approximation algorithm for computing quasi-stationary distributions. Adv. Appl. Probab. 2016;48:792–811. doi: 10.1017/apr.2016.28. [DOI] [Google Scholar]
24.Zheng S. Ph.D. Thesis. Columbia University; New York, NY, USA: 2014. Stochastic Approximation Algorithms in the Estimation of Quasi-Stationary Distribution of Finite and General State Space Markov Chains. [Google Scholar]
25.Kushner H., Yin G.G. Stochastic Approximation and Recursive Algorithms and Applications. Volume 35 Springer Science & Business Media; Cham, Switzerlands: 2003. [Google Scholar]
26.Polyak B.T., Juditsky A.B. Acceleration of stochastic approximation by averaging. SIAM J. Control. Optim. 1992;30:838–855. doi: 10.1137/0330046. [DOI] [Google Scholar]
27.Blei D.M., Kucukelbir A., McAuliffe J.D. Variational inference: A review for statisticians. J. Am. Stat. Assoc. 2017;112:859–877. doi: 10.1080/01621459.2017.1285773. [DOI] [Google Scholar]
28.Jordan M.I., Ghahramani Z., Jaakkola T.S., Saul L.K. An Introduction to Variational Methods for Graphical Models. Mach. Learn. 1999;37:183–233. doi: 10.1023/A:1007665907178. [DOI] [Google Scholar]
29.Liu Q., Wang D. Stein variational gradient descent: A general purpose bayesian inference algorithm. In: Lee D., Sugiyama M., Luxburg U., Guyon I., Garnett R., editors. Advances in Neural Information Processing Systems. Volume 29 Curran Associates, Inc.; Red Hook, NY, USA: 2016. [Google Scholar]
30.Rezende D., Mohamed S. Variational inference with normalizing flows. In: Bach F., Blei D., editors. Proceedings of the 32nd International Conference on Machine Learning; Lille, France. 7–9 July 2015; pp. 1530–1538. [Google Scholar]
31.Sutton R.S., Barto A.G. Reinforcement Learning: An Introduction. MIT Press; Cambridge, MA, USA: 2018. [Google Scholar]
32.Mnih V., Kavukcuoglu K., Silver D., Graves A., Antonoglou I., Wierstra D., Riedmiller M. Playing atari with deep reinforcement learning. arXiv. 20131312.5602 [Google Scholar]
33.Popova M., Isayev O., Tropsha A. Deep reinforcement learning for de novo drug design. Sci. Adv. 2018;4:eaap7885. doi: 10.1126/sciadv.aap7885. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Silver D., Hubert T., Schrittwieser J., Antonoglou I., Lai M., Guez A., Lanctot M., Sifre L., Kumaran D., Graepel T., et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science. 2018;362:1140–1144. doi: 10.1126/science.aar6404. [DOI] [PubMed] [Google Scholar]
35.Rose D.C., Mair J.F., Garrahan J.P. A reinforcement learning approach to rare trajectory sampling. New J. Phys. 2021;23:013013. doi: 10.1088/1367-2630/abd7bd. [DOI] [Google Scholar]
36.Méléard S., Villemonais D. Quasi-stationary distributions and population processes. Probab. Surv. 2012;9:340–410. doi: 10.1214/11-PS191. [DOI] [Google Scholar]
37.Blanchet J., Glynn P., Zheng S. Empirical analysis of a stochastic approximation approach for computing quasi-stationary distributions. In: Schütze O., Coello C.A.C., Tantar A.-A., Tantar E., Bouvry P., Moral P.D., Legrand P., editors. EVOLVE—A Bridge between Probability, Set Oriented Numerics, and Evolutionary Computation II. Springer; Berlin/Heidelberg, Germany: 2013. pp. 19–37. [Google Scholar]
38.Boyd S., Boyd S.P., Vandenberghe L. Convex Optimization. Cambridge University Press; Cambridge, UK: 2004. [Google Scholar]
39.Wang W., Carreira-Perpinán M.A. Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application. arXiv. 20131309.1541 [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

[B1-entropy-24-00133] 1.Collet P., Martínez S., Martín J.S. Quasi-Stationary Distributions: Markov Chains, Diffusions and Dynamical Systems. Springer Science & Business Media; Cham, Switzerlands: 2012. [Google Scholar]

[B2-entropy-24-00133] 2.Buckley F., Pollett P. Analytical methods for a stochastic mainland–island metapopulation model. Ecol. Model. 2010;221:2526–2530. doi: 10.1016/j.ecolmodel.2010.02.017. [DOI] [Google Scholar]

[B3-entropy-24-00133] 3.Lambert A. Population dynamics and random genealogies. Stoch. Model. 2008;24:45–163. doi: 10.1080/15326340802437728. [DOI] [Google Scholar]

[B4-entropy-24-00133] 4.De Oliveira M.M., Dickman R. Quasi-stationary distributions for models of heterogeneous catalysis. Phys. Stat. Mech. Appl. 2004;343:525–542. doi: 10.1016/j.physa.2004.06.155. [DOI] [Google Scholar]

[B5-entropy-24-00133] 5.Dykman M.I., Horita T., Ross J. Statistical distribution and stochastic resonance in a periodically driven chemical system. J. Chem. Phys. 1995;103:966–972. doi: 10.1063/1.469796. [DOI] [Google Scholar]

[B6-entropy-24-00133] 6.Artalejo J.R., Economou A., Lopez-Herrero M.J. Stochastic epidemic models with random environment: Quasi-stationarity, extinction and final size. J. Math. Biol. 2013;67:799–831. doi: 10.1007/s00285-012-0570-5. [DOI] [PubMed] [Google Scholar]

[B7-entropy-24-00133] 7.Clancy D., Mendy S.T. Approximating the quasi-stationary distribution of the sis model for endemic infection. Methodol. Comput. Appl. Probab. 2011;13:603–618. doi: 10.1007/s11009-010-9177-8. [DOI] [Google Scholar]

[B8-entropy-24-00133] 8.Sani A., Kroese D., Pollett P. Stochastic models for the spread of hiv in a mobile heterosexual population. Math. Biosci. 2007;208:98–124. doi: 10.1016/j.mbs.2006.09.024. [DOI] [PubMed] [Google Scholar]

[B9-entropy-24-00133] 9.Chan D.C., Pollett P.K., Weinstein M.C. Quantitative risk stratification in markov chains with limiting conditional distributions. Med. Decis. Mak. 2009;29:532–540. doi: 10.1177/0272989X08330121. [DOI] [PubMed] [Google Scholar]

[B10-entropy-24-00133] 10.Berglund N., Landon D. Mixed-mode oscillations and interspike interval statistics in the stochastic fitzhugh–nagumo model. Nonlinearity. 2012;25:2303. doi: 10.1088/0951-7715/25/8/2303. [DOI] [Google Scholar]

[B11-entropy-24-00133] 11.Landon D. Ph.D. Thesis. Université d’Orléans; Orléans, France: 2012. Perturbation et Excitabilité Dans des Modeles Stochastiques de Transmission de l’Influx Nerveux. [Google Scholar]

[B12-entropy-24-00133] 12.Gesù G.D., Lelièvre T., Peutrec D.L., Nectoux B. Jump markov models and transition state theory: The quasi-stationary distribution approach. Faraday Discuss. 2017;195:469–495. doi: 10.1039/C6FD00120C. [DOI] [PubMed] [Google Scholar]

[B13-entropy-24-00133] 13.Lelièvre T., Nier F. Low temperature asymptotics for quasistationary distributions in a bounded domain. Anal. PDE. 2015;8:561–628. doi: 10.2140/apde.2015.8.561. [DOI] [Google Scholar]

[B14-entropy-24-00133] 14.Pollock M., Fearnhead P., Johansen A.M., Roberts G.O. The scalable langevin exact algorithm: Bayesian inference for big data. arXiv. 20161609.03436 [Google Scholar]

[B15-entropy-24-00133] 15.Wang A.Q., Roberts G.O., Steinsaltz D. An approximation scheme for quasi-stationary distributions of killed diffusions. Stoch. Process. Appl. 2020;130:3193–3219. doi: 10.1016/j.spa.2019.09.010. [DOI] [Google Scholar]

[B16-entropy-24-00133] 16.Watkins D.S. Fundamentals of Matrix Computations. Volume 64 John Wiley & Sons; Hoboken, NJ, USA: 2004. [Google Scholar]

[B17-entropy-24-00133] 17.Bebbington M. Parallel implementation of an aggregation/disaggregation method for evaluating quasi-stationary behavior in continuous-time markov chains. Parallel Comput. 1997;23:1545–1559. doi: 10.1016/S0167-8191(97)89286-1. [DOI] [Google Scholar]

[B18-entropy-24-00133] 18.Pollett P., Stewart D. An efficient procedure for computing quasi-stationary distributions of markov chains by sparse transition structure. Adv. Appl. Probab. 1994;26:68–79. doi: 10.2307/1427580. [DOI] [Google Scholar]

[B19-entropy-24-00133] 19.Martinez S., Martin J.S. Quasi-stationary distributions for a brownian motion with drift and associated limit laws. J. Appl. Probab. 1994;31:911–920. doi: 10.2307/3215316. [DOI] [Google Scholar]

[B20-entropy-24-00133] 20.Aldous D., Flannery B., Palacios J.L. Two applications of urn processes the fringe analysis of search trees and the simulation of quasi-stationary distributions of markov chains. Probab. Eng. Inform. Sci. 1988;2:293–307. doi: 10.1017/S026996480000084X. [DOI] [Google Scholar]

[B21-entropy-24-00133] 21.Benaïm M., Cloez B. A stochastic approximation approach to quasi-stationary distributions on finite spaces. Electron. Commun. Probab. 2015;20:1–13. doi: 10.1214/ECP.v20-3956. [DOI] [Google Scholar]

[B22-entropy-24-00133] 22.De Oliveira M.M., Dickman R. How to simulate the quasistationary state. Phys. Rev. E. 2005;71:016129. doi: 10.1103/PhysRevE.71.016129. [DOI] [PubMed] [Google Scholar]

[B23-entropy-24-00133] 23.Blanchet J., Glynn P., Zheng S. Analysis of a stochastic approximation algorithm for computing quasi-stationary distributions. Adv. Appl. Probab. 2016;48:792–811. doi: 10.1017/apr.2016.28. [DOI] [Google Scholar]

[B24-entropy-24-00133] 24.Zheng S. Ph.D. Thesis. Columbia University; New York, NY, USA: 2014. Stochastic Approximation Algorithms in the Estimation of Quasi-Stationary Distribution of Finite and General State Space Markov Chains. [Google Scholar]

[B25-entropy-24-00133] 25.Kushner H., Yin G.G. Stochastic Approximation and Recursive Algorithms and Applications. Volume 35 Springer Science & Business Media; Cham, Switzerlands: 2003. [Google Scholar]

[B26-entropy-24-00133] 26.Polyak B.T., Juditsky A.B. Acceleration of stochastic approximation by averaging. SIAM J. Control. Optim. 1992;30:838–855. doi: 10.1137/0330046. [DOI] [Google Scholar]

[B27-entropy-24-00133] 27.Blei D.M., Kucukelbir A., McAuliffe J.D. Variational inference: A review for statisticians. J. Am. Stat. Assoc. 2017;112:859–877. doi: 10.1080/01621459.2017.1285773. [DOI] [Google Scholar]

[B28-entropy-24-00133] 28.Jordan M.I., Ghahramani Z., Jaakkola T.S., Saul L.K. An Introduction to Variational Methods for Graphical Models. Mach. Learn. 1999;37:183–233. doi: 10.1023/A:1007665907178. [DOI] [Google Scholar]

[B29-entropy-24-00133] 29.Liu Q., Wang D. Stein variational gradient descent: A general purpose bayesian inference algorithm. In: Lee D., Sugiyama M., Luxburg U., Guyon I., Garnett R., editors. Advances in Neural Information Processing Systems. Volume 29 Curran Associates, Inc.; Red Hook, NY, USA: 2016. [Google Scholar]

[B30-entropy-24-00133] 30.Rezende D., Mohamed S. Variational inference with normalizing flows. In: Bach F., Blei D., editors. Proceedings of the 32nd International Conference on Machine Learning; Lille, France. 7–9 July 2015; pp. 1530–1538. [Google Scholar]

[B31-entropy-24-00133] 31.Sutton R.S., Barto A.G. Reinforcement Learning: An Introduction. MIT Press; Cambridge, MA, USA: 2018. [Google Scholar]

[B32-entropy-24-00133] 32.Mnih V., Kavukcuoglu K., Silver D., Graves A., Antonoglou I., Wierstra D., Riedmiller M. Playing atari with deep reinforcement learning. arXiv. 20131312.5602 [Google Scholar]

[B33-entropy-24-00133] 33.Popova M., Isayev O., Tropsha A. Deep reinforcement learning for de novo drug design. Sci. Adv. 2018;4:eaap7885. doi: 10.1126/sciadv.aap7885. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B34-entropy-24-00133] 34.Silver D., Hubert T., Schrittwieser J., Antonoglou I., Lai M., Guez A., Lanctot M., Sifre L., Kumaran D., Graepel T., et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science. 2018;362:1140–1144. doi: 10.1126/science.aar6404. [DOI] [PubMed] [Google Scholar]

[B35-entropy-24-00133] 35.Rose D.C., Mair J.F., Garrahan J.P. A reinforcement learning approach to rare trajectory sampling. New J. Phys. 2021;23:013013. doi: 10.1088/1367-2630/abd7bd. [DOI] [Google Scholar]

[B36-entropy-24-00133] 36.Méléard S., Villemonais D. Quasi-stationary distributions and population processes. Probab. Surv. 2012;9:340–410. doi: 10.1214/11-PS191. [DOI] [Google Scholar]

[B37-entropy-24-00133] 37.Blanchet J., Glynn P., Zheng S. Empirical analysis of a stochastic approximation approach for computing quasi-stationary distributions. In: Schütze O., Coello C.A.C., Tantar A.-A., Tantar E., Bouvry P., Moral P.D., Legrand P., editors. EVOLVE—A Bridge between Probability, Set Oriented Numerics, and Evolutionary Computation II. Springer; Berlin/Heidelberg, Germany: 2013. pp. 19–37. [Google Scholar]

[B38-entropy-24-00133] 38.Boyd S., Boyd S.P., Vandenberghe L. Convex Optimization. Cambridge University Press; Cambridge, UK: 2004. [Google Scholar]

[B39-entropy-24-00133] 39.Wang W., Carreira-Perpinán M.A. Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application. arXiv. 20131309.1541 [Google Scholar]

PERMALINK

Learn Quasi-Stationary Distributions of Finite State Markov Chain

Zhiqiang Cai

Ling Lin

Xiang Zhou

Roles

Abstract

1. Introduction

2. Problem Setup and Review

2.1. Quasi-Stationary Distribution

Definition 1.

2.2. Review of Simulation Methods for Quasi-Stationary Distribution

3. Learn Quasi-Stationary Distribution

3.1. Formulation of RL and Policy Gradient Theorem

Property 1.

Proof.

Theorem 1(policy gradient theorem).

Proof.

Remark 1.

Remark 2.

Remark 3.

3.2. Learn QSD

3.3. Actor-Critic Algorithm

Remark 4.

4. Numerical Experiment

4.1. Loopy Markov Chain

Figure 1.

Figure 2.

4.2. M/M/1/N Queue with Finite Capacity and Absorption

Figure 3.

Figure 4.

Figure 5.

Table 1.

5. Summary and Conclusions

Acknowledgments

Appendix A

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases