Skip to main content
Entropy logoLink to Entropy
. 2022 Jan 17;24(1):133. doi: 10.3390/e24010133

Learn Quasi-Stationary Distributions of Finite State Markov Chain

Zhiqiang Cai 1,*, Ling Lin 2, Xiang Zhou 1,3
Editors: Michael Dellnitz, Carsten Hartmann, Feliks Nüske
PMCID: PMC8774945  PMID: 35052159

Abstract

We propose a reinforcement learning (RL) approach to compute the expression of quasi-stationary distribution. Based on the fixed-point formulation of quasi-stationary distribution, we minimize the KL-divergence of two Markovian path distributions induced by candidate distribution and true target distribution. To solve this challenging minimization problem by gradient descent, we apply a reinforcement learning technique by introducing the reward and value functions. We derive the corresponding policy gradient theorem and design an actor-critic algorithm to learn the optimal solution and the value function. The numerical examples of finite state Markov chain are tested to demonstrate the new method.

Keywords: quasi-stationary distribution, reinforcement learning, KL-divergence, actor-critic algorithm

1. Introduction

Quasi-stationary distribution (QSD) is the long time statistical behavior of a stochastic process that will be surely killed when this process is conditioned to survive [1]. This concept has been widely used in applications, such as in biology and ecology [2,3], chemical kinetics [4,5], epidemics [6,7,8], medicine [9] and neuroscience [10,11]. Many works for rare events in meta-stable systems also focus on this quasi-stationary distribution [12,13]. In addition, some new Monte Carlo sampling methods, for instance, the Quasi-stationary Monte Carlo method [14,15], also arise by using QSD instead of true stationary distribution, for instance, the Quasi-stationary Monte Carlo method [14,15]

We are interested in the numerical computation of QSD and focus on the finite state Markov chain in this paper. Mathematically, the quasi-stationary distribution can be solved as the principal left eigenvector of a sub-Markovian transition matrix. Thus, traditional numerical algebra methods can be applied to solve the quasi-stationary distribution in finite state space, for example, the power method [16], the multi-grid method [17] and Arnoldi’s algorithm [18]. These eigenvector methods can produce a stochastic vector for QSD instead of generating samples of QSD.

In search of efficient algorithms for large state space, stochastic approaches are in favor of either sampling the QSD or computing the expression of QSD, and these methods can be applied or extended easily to continuous state space. A popular approach for sampling quasi-stationary distribution is the Fleming–Viot stochastic method [19]. The Flemming–Viot method first simulates N particles independently. When any one of the particles falls into the absorbing state and becomes killed, a new particle is uniformly selected from the remaining N1 surviving particles to replace the dead one, and the simulation continues. When time and N tend to infinity, the particles’ empirical distribution can converge to the quasi-stationary distribution.

In [20,21,22], the authors proposed to recursively update the expression of QSD at each iteration based on the empirical distribution of a single-particle simulation. It is shown in [21] that the convergence rate can be O(n1/2), where n is the iteration number. This method is later improved in [23,24] by applying the stochastic approximation method [25] and the Polyak–Ruppert averaging technique [26]. These improved algorithms have a choice of flexible step size but require a projection operator onto probability simplex, which carries some extra computational overhead increasing with the number of states. Ref. [15] extended the algorithm to the diffusion process.

In this paper, we focus on how to compute the expression of the quasi-stationary distribution, which is denoted by α(x) on a metric space E. If E is finite, α is a probability vector, and if E is a domain in Rd, then α is a probability density function on E. We assume α can be numerically represented in parametric form αθ and θΘ. This family {αθ} can be in tabular form or any neural network. Then, the problem of finding the QSD α becomes answering the question of how to compute the optimal parameter θ in Θ. We call this problem the learning problem for QSD. In addition, we want to directly learn QSD and not use the distribution family {αθ} to fit the simulated samples generated by other traditional simulation methods.

Our minimization problem for QSD is similar to the variational inference (VI) [27], which minimizes an objective functional measuring the distance between the target and candidate distributions. However, unlike the mainstream VI methods such as evidence lower bound (ELBO) technique [28] or particle-based [29], flow-based methods [30], our approach is based on recent important progresses from reinforcement learning (RL) method [31], particularly the policy gradient method and actor-critic algorithm. We first regard the learning process of the quasi-stationary distribution as the interaction with the environment, which is constructed by the property of QSD. Reinforcement learning has recently shown tremendous advancements and remarkable successes in applications (e.g., [32,33,34]). The RL framework provides an innovative and powerful modeling and computation approach for many scientific computing problems.

The essential question is how to formulate the QSD problem as an RL problem. Firstly, for the sub-Markovian kernel K of a Markov process, we can define a Markovian kernel Kα on E (see Definition 1) and then QSD is defined by the equation α=αKα, which equals α as the initial distribution and the distribution after one step. Secondly, we consider an optimal α (in our parametric family of distribution) to minimize the Kullback–Leibler divergence (i.e., relative entropy) of two path distributions, denoted by P and Q, associated with two Markovian kernels Kα and Kβ where β:=αKα. Thirdly, inspired by the recent work [35] of using RL for rare events sampling problems, we transform the minimization of KL divergence between P and Q into the maximization of a time-averaged reward function and defined the corresponding value function V(x) at each state x. This completes our modeling of RL for the quasi-stationary distribution problem. Lastly, we derive the policy gradient theorem (Theorem 1) to compute the gradient with respect to θ of the averaged reward for the learning dynamic for the averaged reward. This is known as the “actor” part. The “critic” part is to learn the value function V in its parametric form Vψ. The actor-critic algorithm uses the stochastic gradient descent to train the parameter θ for the action αθ and the parameter ψ for the value function Vψ (see Algorithm 1).

Our contribution is that we first devise a method to transform the QSD problem into the RL problem. Similar to [35], our paper also uses the KL-divergence to define the RL problem. However, our paper fully adapts the unique property of QSD that is a fixed point problem α=αKα to define the RL problem.

Our learning method allows the flexible parametrization of the distributions and uses the stochastic gradient method to train the optimal distribution. It is easy to implement optimization with scale up to large state spaces. The numerical examples we tested have shown our that methods converge faster than other existing methods [22,23].

Finally, we remark that our method works very well for QSD of the strict sub-Markovian kernel K but is not applicable to compute the invariant distribution when K is Markovian. This is because we transform the problem into the variational problem between two Markovian kernels Kα and Kβ (where β=αKα). Note that Kα(x,y)=K(x,y)+(1K(x,E))α(y) (Definition 1), and our method is based on the fact that α=β if and only if Kα=Kβ. If K is Markovian kernel, then KαK for any α, and our method cannot work. Thus, K(x,E) has to be strictly less than 1 for some xE.

This paper is organized as follows. Section 2 is a short review of the quasi-stationary distribution and some basic simulation methods of QSD. In Section 3, we first formulate the reinforcement learning problem by KL-divergence and derive the policy gradient theorem (Theorem 1). Using the above formulation, we then develop the actor-critic algorithm to estimate the quasi-stationary distribution. In Section 4, the efficiency of our algorithms is illustrated by four examples compared with the simulation methods in [24].

Algorithm 1: (ac-α method) Actor-critic algorithm for quasi-stationary distribution αθ
graphic file with name entropy-24-00133-i001.jpg

2. Problem Setup and Review

2.1. Quasi-Stationary Distribution

We start with an abstract setting. Let E be a finite state equipped with the Borel σ-field B(E), and let P(E) be the space of probabilities over E. A sub-Markovian kernel on E is defined as a map K:E×B(E)[0,1] such that for all xE,AK(x,A) is a nonzero measure with K(x,E)1 and for all AB(E),xK(x,A) is measurable. In particular, if K(x,E)=1 for all xE, then K is called a Markovian kernel. Throughout the paper, we assume that K is strictly sub-Markovian, i.e., K(x,E)<1 for some x.

Let Xt be a Markov chain with values in E where E denotes an absorbing state. We define the extinction time

τ:=inft>0:Xt=.

We define the quasi-stationary distribution (QSD) α as the long time limit of the conditional distribution, if there exists a probability distribution ν on E such that the following is the case:

α(A):=limtPνXtAτ>t,AB(E). (1)

where Pν refers to the probability distribution of Xt associated with the initial distribution ν on E. Such a conditional distribution well describes the behavior of the process before extinction, and it is easy to see that α satisfies the following fixed point problem:

PαXtAτ>t=α(A) (2)

where Pα refers to the probability distribution of Xt associated with the initial distribution α on E. Equation (2) is equivalent to the following stationary condition such that the following is the case:

α=αKαK1,or α(y)=xα(x)K(x,y)xα(x)K(x,E) (3)

where α is a row vector and 1 denotes the column vector with all entries being one and

K(x,E)=xEK(x,x).

For any sub-Markovian kernel K, we can associate K with a Markovian kernel K~ on E{} defined by the following:

K~(x,A)=K(x,A)K~(x,{})=1K(x,E)K~(,{})=1.

for all xE,AB(E). The kernel K~ can be understood as the Markovian transition kernel of the Markov chain (Xt) on E{} for which its transitions in E is specified by K, but it is “killed” forever once it leaves E.

In this paper, we assume E is a finite state space and the process in consideration has a unique QSD. Assume that K is irreducible, then existence and uniqueness of the quasi-stationary distribution can be obtained by the Perron–Frobenius theorem [36].

An important Markovian kernel is the following Kα, which is defined on E only and has a “regenerative probability” α.

Definition 1.

For any given αP(E) and a sub-Markovian kernel K on E, we define Kα, a Markovian kernel on E, as follows:

Kα(x,A):=K(x,A)+1K(x,E)α(A) (4)

for all xE and AB(E).

Kα is a Markovian kernel because Kα(x,E)=1. It is easy to sample Xt+1Kα(Xt,·) from any state XtE: run the transition as normal by using K~ to have a next state denoted by Y, then Xt+1=Y if YE; otherwise, sample Xt+1 from α.

We know that α is the quasi-stationary distribution of K if and only if it is the stationary distribution of Kα.

α=αKα. (5)

It is easy to observe that α=β if and only if Kα=Kβ for any two distributions α and β. Moreover, for every α, Kα has a unique invariant probability denoted by Γ(α). Then, αΓ(α) is continuous in P(E) (i.e., for the topology of weak convergence), and there exists αP(E) such that α=Γ(α) or, equivalently, α is a QSD for K.

2.2. Review of Simulation Methods for Quasi-Stationary Distribution

According to the above subsection, the QSD α satisfies the fixed point problem as follows:

α=Γ(α), (6)

where Γ(α) is the stationary distribution of Kα on E. In general, (6) can be solved recursively by αn+1Γ(αn).

The Fleming–Viot (FV) method [19] evolves N particles independently of each other as a Markov process associated with the transition kernel Kα until one succeeds in jumping to the absorbing state . At that time, this killed particle is immediately reset to E as an initial state uniformly chosen from one of the remaining N1 particles. The QSD α is approximated by the empirical distribution of the N particles in total, and these particles can be regarded as samples from the quasi-stationary distribution α such as the MCMC method.

Ref. [37] proposed a simulation method by only using one particle at each iteration to update α. At iteration n, given an αnP(E), one can run a discrete-time Markov chain X(n+1) as normal on E with initial X0(n+1)αn; then, αn+1 is computed as the following weighted average of empirical distributions:

αn+1(x):=αn(x)+1n+1k=0τ(n+1)1IXk(n+1)=xX0(n+1)αnαn(x)1n+1j=1n+1τ(j) (7)

where n0 and I are the indicator functions, and τ(j)=mink0Xk(j) is the first extinction time for the process X(j). This iterative scheme has a convergence rate of O(1n).

In [23,24], the above method is extended to the stochastic approximations framework:

αn+1(x)=ΘHαn+ϵnk=0τ(n+1)1IXk(n+1)=x|X0(n+1)αnαn(x) (8)

where ΘH denotes the L2 projection into the probability simplex, and ϵn is the step size satisfying ϵn= and ϵn2<. Specifically, if ϵn=O(1nr) for 0.5<r<1, under a sufficient condition, they have nrαnαdN(0,V) for some matrix V [23,24]. If the Polyak–Ruppert averaging technique is applied to generate the following:

νn:=1nk=1nαk, (9)

then the convergence rate of νnα becomes 1n [23,24].

The simulation schemes (7) and (8) need to sample the initial states according to αn and to add the empirical distribution and αn at each x point wisely. Thus, they are suitable for finite state space where α is a probability vector saved in the tabular form. In (8), there is no need to record all exit times τ(j),j=1,,n, but the additional projection operation in (8) is computationally expensive since the cost is O(mlogm) where m=|E| [38,39].

3. Learn Quasi-Stationary Distribution

We focus on the computation of the expression of the quasi-stationary distribution. In particular, when this distribution is parametrized in a certain manner by θ, we can extend the tabular form for finite-state Markov chain to any flexible form, even in the neural networks for probability density function in Rd. However, we do not pursue this representation and expressivity issue here and restrict our discussion to finite state space only to illustrate our main idea first. In finite state space, α(x) for xE={1,,m} can be simply described as a softmax function with m1 parameter θi:α(i)eθi,1im1 (θm=0). This introduces no representation error. For the generalization to continuous space E in jump and diffusion processes or even for a huge finite state space, a good representation of αθ(x) is important in practice.

In this section, we shall formulate our QSD problem in terms of reinforcement learning (RL) so that the problem of seeking optimal parameters becomes a policy optimization problem. We derive the policy gradient theorem to construct a gradient descent method for the optimal parameter. We then show a method for designing actor-critic algorithms based on stochastic optimization.

3.1. Formulation of RL and Policy Gradient Theorem

Before introducing the RL method of our QSD problem, we develop a general formulation by introducing the KL-divergence between two path distributions.

Let Pθ and Qθ be two families of Markovian kernels on E in parametric forms with the same set of parameters θΘ. Assume both Pθ and Qθ are ergodic for any θ. Let T>0 and denote a path up to time T by ω0T=(X0,X1,,XT)ET+1. Define the path distributions under the Markov chain kernel Pθ and Qθ, respectively.

Pθ(ω0T):=t=1TPθ(XtXt1),Qθ(ω0T):=t=1TQθ(XtXt1). (10)

Define the KL divergence from Pθ to Qθ on ET+1:

DKL(PθQθ):=ω0TPθ(ω0T)lnPθ(ω0T)Qθ(ω0T)=EPθt=1TRθ(Xt1,Xt), (11)

where the expectation EPθ is for the path (X0,X1,,XT) generated by the transition kernel Pθ, and the following is called the (one-step) reward.

Rθ(Xt1,Xt):=lnPθ(XtXt1)Qθ(XtXt1). (12)

Define the average reward r(θ) as the time averaged negative KL divergence in the limit of T.

r(θ):=limT1TDKL(PθQθ)=limT1TEPθt=1TRθ(Xt1,Xt). (13)

Due to ergodicity of Pθ, r(θ)=x0,x1Rθ(x0,x1)Pθ(x1|x0)μθ(x0) where μθ is the invariant measure of Pθ, r(θ) is independent of initial state X0. Obviously, r(θ)0 for any θ.

Property 1.

The following are equivalent:

  • 1. 

    r(θ) reaches its maximal value 0 at θ*;

  • 2. 

    Pθ*=Qθ* in P(ET+1) for any T>0;

  • 3. 

    Pθ*=Qθ*;

  • 4. 

    Rθ*0.

Proof. 

We only need to show (1)(3). It is easy to see that

r(θ)=x0DKL(Pθ(·|x0)|Qθ(·|x0))μθ(x0).

If r(θ)=0, since μθ>0, then

DKL(Pθ(·|x0)|Qθ(·|x0))=0x0.

Thus, we have Pθ=Qθ. □

The above property establishes the relationship between the RL problem and QSD problem.

We show our theoretic main result below as the foundation of our algorithm to be developed later. This theorem can be regarded as one type of the policy gradient theorem for the policy gradient method in reinforcement learning [31].

Define the following value function ([31] Chapter 13).

Vx:=limTt=1TEPθRθ(Xt1,Xt)r(θ)X0=x. (14)

Certainly, V also depends on θ, although we do not write θ explicitly.

Theorem 1(policy gradient theorem). 

We have the following two properties:

  • 1. 
    At any θ, for any xE, the following Bellman-type equation holds for the value function V and the average reward r(θ):
    V(x)=EYPθ(·x)V(Y)+Rθ(x,Y)r(θ). (15)
  • 2. 
    The gradient of the average reward r(θ) is the following:
    θr(θ)=EθlnQθ(YX)+EV(Y)V(X)+Rθ(X,Y)r(θ)θlnPθ(YX), (16)

where expectations are for the joint distribution (X,Y)μθ(x)Pθ(yx) where μθ is the stationary measure of Pθ.

Proof. 

We shall prove the Bellman equation first and then we use the Bellman equation to derive the gradient of the average reward r(θ). For any x0E, by writing ω0T=(x0,,xT) and defining

ΔRθ(ω0T)=t=1T(R(xt1,xt)r(θ)),

we have the following:

Vx0=limTEPθΔRθ(ω0T)X0=x=limTx2,,xTx1t=2TPθ(xtxt1)Pθ(x1x0)ΔR(ω0T)=limTx1Pθ(x1x0)x2,,xTt=2TPθ(xtxt1)ΔR(ω1T)+ΔR(ω01)=x1Pθ(x1x0)limTx2,,xTt=2TPθ(xtxt1)ΔR(ω1T)+ΔR(ω01)=x1Pθ(x1x0)V(x1)+Rθ(x0,x1)r(θ), (17)

which proves (15); in other words, we have the following.

r(θ)=EYPθ(·x)V(Y)+Rθ(x,Y)V(x),xE.

Next, we compute the gradient of r(θ). By trivial equality of the following:

x1Pθ(x1x0)θlnPθ(x1x0)=θx1Pθ(x1x0)=0, (18)

and the definition (12), we can write the gradient of r(θ) as follows.

θr(θ)=yθPθ(yx)V(y)+Rθ(x,y)V(x)+yPθyxθV(y)θV(x)+θlnQθ(yx).

We here keep the term V(x) in the first line, even though it has no contribution here (in fact, to add any constant to V(x) is also fine). Since this equation holds for all states x on the right-hand side, we take the expectation with respect to μθ, the stationary distribution of Pθ. Thus, we have the following.

θr(θ)=x,yμθ(x)θPθyxV(y)+Rθ(x,y)V(x)+x,yμθ(x)PθyxθV(y)θV(x)+θlnQθ(yx)=x,yμθ(x)θPθyxV(y)+Rθ(x,y)V(x)+yμθ(y)θV(y)xμθ(x)θV(x)+x,yμθ(x)PθyxθlnQθ(yx)=x,yμθ(x)PθyxV(y)+Rθ(x,y)V(x)θlnPθyx+x,yμθ(x)PθyxθlnQθ(yx).

In fact, we can add any constant number b (independent of x and y) inside the squared bracket of the last line without changing the equality due to the following fact similar to (18): x,yμθ(x)θPθyx=yμθ(y)θxPθxy=0. (16) is a special case of b=r(θ). □

Remark 1.

As shown in the proof, (16) holds if r(θ) at the right-hand side is replaced by any constant number b. b=r(θ) is a good choice to reduce the variance since r(θ) can be regarded as the expectation of Rθ.

Remark 2.

If Pθ=Qθ, then the first term of (16) vanishes due to (18) and the second term of (16) vanishes due to (15).

Remark 3.

The name of “policy” here refers to the role of θ as the policy for decision makers to improve reward r(θ).

3.2. Learn QSD

Now, we discuss how to connect QSD with the results in the previous subsection. In view of Equation (5), we introduce β:=αKα as the one-step distribution if starting from the initial α; in other words, we have the following.

β(y):=xEα(x)Kα(x,y),y (19)

By (5), α is a QSD if and only if β=α. However, we do not directly compare these two distributions α and β. Instead, we consider their Markovian kernels induced by (4): Kα and Kβ. Our approach is to consider KL divergence similar to (11) between two kernels Kα and Kβ since α=β if and only if Kα=Kβ. In this manner, one can view Kα and Kβ (note β=αKα) as two transition matrices Pθ and Qθ in the previous section, in which the parameter θ here is in fact the distribution α.

To have a further representation of the distribution α, which is a (probability mass) function on E, we propose a parametrized family for α in the form αθ where θ is a generic parameter. In the simplest case, αθ takes the so-called soft-max form αθ(i)=eθij1eθj if E={1,,N} for θ=(θ1,,θN1,θN0). This parametrization represents α without any approximation error for finite state space and the effective space of θ is just RN1. For certain problems, particularly with large state space, if one has some prior knowledge about the structure of the function α on E, one might propose other parametric forms of αθ with the dimension of θ less than the cardinality |E| to improve the efficiency, although the extra representation error in this manner has to be introduced.

For any given αθP(E), the corresponding Markovian kernel Kαθ is then defined in (4) and βθ=αθKαθi is defined by (19). Kβθ is like-wise defined by (4) again. To use the formulation in Section 3.1, we chose Pθ=Kαθ and Qθ=Kβθ. Define the objective function as before:

r(θ):=limT1TDKL(PθQθ)=limT1TEPθt=1TRθ(Xt1,Xt).

where the following is the case.

Rθ(x,y)=lnKαθ(x,y)Kβθ(x,y).

The value function V(x) is defined similarly. Theorem 1 now provides the expression of the following gradient:

θr(θ)=E[Rθ(X,Y)r(θ)+V(Y)V(X)θlnKαθ(X,Y)+θlnKβθ(X,Y)] (20)

where (X,Y)μθ(x)Kαθ(x,y) and where μθ is the stationary measure of Kαθ.

The optimal θ* for the QSD αθ is to maximize r(θ), and this can be solved by the gradient descent algorithm:

θt+1=θt+ηtθθr(θt). (21)

where ηtθ>0 is the step size. In practice, the stochastic gradient is applied:

θr(θt)θlnKαθ(Xt,Xt+1)×δ(Xt,Xt+1)+θlnKβθ(Xt,Xt+1)

where Xt,Xt+1 are sampled based on the Markovian kernel Kαθ (see Algorithm 1) and the differential temporal (TD) error δt is as follows.

δt=δ(Xt,Xt+1)=Rθ(Xt,Xt+1)r(θt)+V(Xt+1)V(Xt). (22)

Next, we need to address a remaining issue, which is the question of how to compute value functions V and r(θt) in the TD error (22). In addition, we also need to show the details of computing θKαθ and θKβθ.

3.3. Actor-Critic Algorithm

With the stochastic gradient method (21), we can obtain optimal policy θ*. We refer to (21) as the learning dynamics for the policy, and it is generally known as actor. To calculate the value function V appearing in r(θ), we need to have a new learning dynamic, which is called critic. Then, the overall policy-gradient method is termed as the actor-critic method.

We start with the Bellman Equation (15) for the value function and considered the mean-square-error loss as follows:

MSE[V]=12xν(x)yKαθ(x,y)V(y)+Rθ(x,y)r(θ)V(x)2

where ν is any distribution supported on E. MSE[V]=0 if and only if V satisfies the Bellman Equation (15), i.e., V is the value function. To learn V, we introduce function approximation for the value function, Vψ, with the parameter ψ and considered to minimize the following:

MSE(ψ)=12xν(x)yKαθ(x,y)V(y)+Rθ(x,y)r(θ)Vψ(x)2

by the semi-gradient method ([31], Chapter 9).

ψMSE(ψ)=x,yν(x)Kαθ(x,y)V(y)+Rθ(x,y)r(θ)Vψ(x)ψVψ(x)x,yν(x)Kαθ(x,y)Vψ(y)+Rθ(x,y)r(θ)Vψ(x)ψVψ(x)

Here, the term V(y) is frozen first and then approximated by Vψ since it could be treated as a prior guess of the value function for the future state.

Then, for the gradient descent iteration ψt+1=ψtηtψψMSEV(ψt) where ηtψ is the step size, we can have the following stochastic gradient iteration:

ψt+1=ψt+ηtψδ(Xt,Xt+1)ψVψt(Xt) (23)

where the differential temporal (TD) error δ is defined above in (22).

δt=δ(Xt,Xt+1)=Rθt(Xt,Xt+1)r(θt)+Vψt(Xt+1)Vψt(Xt).

Here, for the sake of simplicity, (Xt,Xt+1) are the same samples as in the actor method for θt. This means that distribution ν above is chosen as μ used for the gradient θr(θ).

Next, we consider the calculation of the reward r(θ) by the following Bellman Equation (15).

xμ(x)yKαθ(x,y)(Rθ(x,y)r(θ)+V(y)V(x))=0

Let rt be the estimate of the reward r(θt) at time t. We can update our estimate of the reward every time a transition occurs as follows:

rt+1=rt+ηtr×δt (24)

where δt is the TD error before

δt=δ(Xt,Xt+1)=Rθt(Xt,Xt+1)rt+Vψt(Xt+1)Vψt(Xt).

In conclusion, (21), (23) and (24) together consist of the actor-critic algorithm, which is summarized in Algorithm 1. We remark that Algorithm 1 can be easily adapted to use the mini-batch gradient method where several copies of (Xt,Xt+1) are sampled, and the average is used to update the parameters. The stationary distribution μθ of Kαθ is sampled by running the corresponding Markov chain for several steps with “warm start”: the initial for θt+1 is set as the final state generated from the previous iteration at θt. The length of this “burn-in” period can be set as just one step in practice for efficiency.

Remark 4.

Finally, we remark on the computation of θlnKαθ and θlnKβθ in Algorithm 1. The details are shown in Appendix A. We comment that the main computational cost is the function K(x,E), which has to be pre-computed and stored. If the problem has some special structure, the function could be approximated in parametric form. Another special case is our second example where K(x,E)=0x{2,3,,N}.

4. Numerical Experiment

In this section, we present two examples to demonstrate Algorithm 1. We call the algorithm (7), (8) and (9) in Section 2.2 used in [23,24], as Vanilla Algorithm, Projection Algorithm and Polyak Averaging Algorithm, respectively. Let 0 be the absorbing state and E={1,,N} are non-absorbing states; the Markov transition matrix on {0,,N} is denoted by the following:

K~=10*K,

where K is an N-by-N sub-Markovian matrix. For Algorithm 1, distribution αθ on E is always parameterized as follows:

αθ=1eθ1++eθN1+1eθ1,,eθN1,1,

and the value function Vψ(x) is represented in tabular form for simplicity:

Vψ=[ψ1,,ψN]

where ψRN.

4.1. Loopy Markov Chain

We test a toy example of the three-state loopy Markov chain, which was considered in [23,24]. The transition probability matrix for the four states {0,1,2,3} is as follows.

K~=1000ϵ1ϵ31ϵ31ϵ3ϵ1ϵ31ϵ31ϵ3ϵ1ϵ31ϵ31ϵ3,ϵ(0,1).

The state 0 is the absorbing state and E={1,2,3}. K is the sub-matrix of K~ corresponding to the states {1,2,3}. With the probability ϵ, the process exits E directly from state 1, 2 or 3. The true quasi-stationary distribution of this example is the uniform distribution for any ϵ.

In order to show the advantage of our algorithm, we consider two cases: (1) ϵ=0.1 and (2) ϵ=0.9. For a larger ϵ, the original Markov chain is very easy to exit; thus, each iteration takes less time, but the convergence rate of Vanilla algorithm is slower.

In order to quantify the accuracy of the learned quasi-stationary distribution, we compute the L2 norm of the error between the learned quasi-stationary distribution and the true values.

In Figure 1, we compute the QSD when ϵ=0.1. We set the initial value θ0=[1,1],ψ0=[0,0,0],r0=0, the learning rate ηnθ=max{1/n0.1,0.2},ηnψ=0.0001,ηnr=0.0001 and the batch size is 4. The step size for the Projection Algorithm is ϵn=n0.99. Figure 2 is for the case when ϵ=0.9 We set the initial value θ0=[4,2],ψ0=[0,0,0],r0=0, the learning rate ηnθ=0.04,ηnψ=0.0001,ηnr=0.0001 and the batch size is 32. The step size for the Projection Algorithm is ϵn=n0.99.

Figure 1.

Figure 1

The loopy Markov chain example with ϵ=0.1. The figure shows the log–log plots of L2-norm error of the Vanilla Algorithm (a), Projection Algorithm (b), Polyak Averaging Algorithm (c) and our actor-critic algorithm (d). The iteration for the actor-critic algorithm is defined as one step of gradient descent (“t” in Algorithm 1).

Figure 2.

Figure 2

The loopy Markov chain example with ϵ=0.9. The figure shows the log–log plots of L2-norm error of Vanilla Algorithm (a), Projection Algorithm (b), Polyak Averaging Algorithm (c) and our actor-critic algorithm (d).

4.2. M/M/1/N Queue with Finite Capacity and Absorption

Our second example is an M/M/1 queue with finite queue capacity. The 0 state has been set as an absorbing state. The transition probability matrix on {0,,N} takes the following form:

K~=1000000μ10λ100000μ20λ200000μ30λ300000000λN10000010

where λi=ρiρi+1, μi=1ρi+1, i{1,2,,N1}. ρi>1 means a higher chance to jump to the right than to the left. A larger ρi will have less probability of exiting E. Note that K(x,E)=1 for x{2,,N}. Thus, Kα(x,y)=K(x,y) for any α if x1 and Kα(1,y)=K(1,y)+μ1α(y)=λ1+μ1α(1)y=1,μ1α(y)2yN. Then, Rθ(x,y)=lnKαθ(x,y)Kβθ(x,y)=0 if x1 and by (20), the gradient is simplified as follows:

θr(θ)=EY[Rθ(1,Y)r(θ)+V(Y)V(1)θlnKαθ(1,Y)+θlnKβθ(1,Y)]

where Y follows distribution Kα(1,·).

We consider two cases: (1) a constant ρi=1.25 and (2) a state-dependent ρi=232N4(i1). Note that ρi=1 gives an equal probability of jumping to the left and to the right. Thus, in case (1), there is a boundary layer at the most right end and in case (2), we expect to see a peak of the QSD near i2N/3. Figure 3 shows the true QSD in both cases. We set N=500.

Figure 3.

Figure 3

The QSD for M/M/1/500 queue with ρi1.25 (left) and ρi=232N4(i1) (right).

In Figure 4, we consider the case when ρi=1.25 and compute L2 errors. We set the initial value θ0i=35+35498(i1) for i{1,2,,498} and θ0499=3, ψ0=[0,0,,0], r0=0 and the learning rate ηnθ=0.0003,ηnψ=0.0001,ηnr=0.0001 and the batch size is 64. The step size for Projection Algorithm is ϵn=n0.95. Figure 5 plots the errors for the state-dependent ρi=232N4(i1). We set the initial value θ0i=8+35250(i1) for i{1,2,,250}, θ0251=44, θ0i=43 for i{252,,305}, θ0306=48, θ0307=42 and θ0i=4338293(i1) for i{308,309,,499}, ψ0=[0,0,,0],r0=0 and the learning rate is ηnθ=0.0002, ηnψ=0.0001,ηnR=0.0001 with the batch size as 128. The step size for the Projection Algorithm is ϵn=n0.95. Both figures demonstrate that actor-critic algorithm performs quite well on this example.

Figure 4.

Figure 4

The M/M/1/500 queue with ρi=1.25. The figure shows the log–log plots of L2-norm error of Vanilla Algorithm (a), Projection Algorithm (b), Polyak Averaging Algorithm (c) and our actor-critic algorithm (d).

Figure 5.

Figure 5

The M/M/1/500 queue with ρi=232N4(i1). The figure shows the log–log plots of L2-norm error of Vanilla Algorithm (a), Projection Algorithm (b), Polyak Averaging Algorithm (c) and our actor-critic algorithm (d).

In Table 1, we compared the CPU time of each algorithm in the M/M/1/500 queue when they obtain an accuracy at 2×101. We found that our algorithm cost less time on this example.

Table 1.

The CPU time of each algorithm in the M/M/1/500 queue when they obtain the accuracy at 2×101.

Algorithm Vanilla Projection Polyak Averaging ac_α
Time (s) 1038.3279 429.6304 505.2299 186.9280
Time (s) 753.9503 259.0671 268.5476 251.5370

5. Summary and Conclusions

In this paper, we propose a reinforcement learning (RL) method for quasi-stationary distribution (QSD) in discrete time finite-state Markov chains. By minimizing the KL-divergence of two Markovian path distributions induced by the candidate distribution and the true target distribution, we introduce the formulation in terms of RL and derive the corresponding policy gradient theorem. We devise an actor-critic algorithm to learn the QSD in its parameterized form αθ. This formulation of RL can receive benefit from the development of the RL method and the optimization theory. We illustrated our actor-critic methods on two numerical examples by using simple tabular parametrization and gradient descent optimization. It has been observed that the performance of our method is more prominent for large scale problems.

We only demonstrate the preliminary mechanism of the idea here, and there is much space left for improving the efficiency and extensions in future works. The generalization from the current consideration of finite-state Markov chain to the jump Markov process and the diffusion case is in consideration. More importantly, for very large or high dimensional state space, modern function approximation methods such as kernel methods or neural networks should be used for the distribution αθ and the value function Vψ. The recent tremendous advancement of optimization techniques for policy gradient in reinforcement learning could also contribute much to efficiency improvement of our current formulation.

Acknowledgments

L.L. acknowledges the support of NSFC 11871486. X.Z. acknowledges the support of Hong Kong RGC GRF 11305318.

Appendix A

In this appendix, we discuss the computation of the gradient of θlnKαθ and θlnKβθ. Note that θαθ is straightforward since we model α in its parametrization form θ. By definition (4), we have the following:

θlnKαθ(Xt,Xt+1)=1K(Xt,E)Kαθ(Xt,Xt+1)θαθ(Xt+1)

and the following as well:

θlnKβθ(Xt,Xt+1)=1K(Xt,E)Kβθ(Xt,Xt+1)βθ(Xt+1).

where Kβ(x,y)=K(x,y)+(1K(x,E))β(y). The vector K(x,E) for any x can be pre-computed and saved in tabular form.

By (19), the one-step distribution β is computed below.

β(Xt+1)=xα(x)K(x,Xt+1)+(1K(x,E))α(Xt+1)1ni=1nK(Zi,Xt+1)+(1K(Zi,E))α(Xt+1)

Here, the samples Ziα could be approximated by stationary distribution μ; thus, one may simply use the known sample Xt to replace Zi with n=1.

To find θβθ, we use stochastic approximation again.

θβθ(Xt+1)=xαθ(x)K(x,Xt+1)+(1K(x,E))αθ(Xt+1)+xαθ(x)(1K(x,E))θαθ(y),1ni=1nlnαθ(Zi)K(Zi,Xt+1)+(1K(Zi,E))αθ(Xt+1)+(1K(Zi,E))θαθ(Xt+1).

Author Contributions

Conceptualization, Z.C. and L.L.; Investigation, Z.C.; Methodology, Z.C. and X.Z.; Software, Z.C.; Supervision, X.Z.; Writing—original draft, Z.C.; Writing—review & editing, L.L. and X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Government of Hong Kong, Grant Number 11305318; NSFC Grant Number 11871486.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflict of interest.

Footnotes

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Collet P., Martínez S., Martín J.S. Quasi-Stationary Distributions: Markov Chains, Diffusions and Dynamical Systems. Springer Science & Business Media; Cham, Switzerlands: 2012. [Google Scholar]
  • 2.Buckley F., Pollett P. Analytical methods for a stochastic mainland–island metapopulation model. Ecol. Model. 2010;221:2526–2530. doi: 10.1016/j.ecolmodel.2010.02.017. [DOI] [Google Scholar]
  • 3.Lambert A. Population dynamics and random genealogies. Stoch. Model. 2008;24:45–163. doi: 10.1080/15326340802437728. [DOI] [Google Scholar]
  • 4.De Oliveira M.M., Dickman R. Quasi-stationary distributions for models of heterogeneous catalysis. Phys. Stat. Mech. Appl. 2004;343:525–542. doi: 10.1016/j.physa.2004.06.155. [DOI] [Google Scholar]
  • 5.Dykman M.I., Horita T., Ross J. Statistical distribution and stochastic resonance in a periodically driven chemical system. J. Chem. Phys. 1995;103:966–972. doi: 10.1063/1.469796. [DOI] [Google Scholar]
  • 6.Artalejo J.R., Economou A., Lopez-Herrero M.J. Stochastic epidemic models with random environment: Quasi-stationarity, extinction and final size. J. Math. Biol. 2013;67:799–831. doi: 10.1007/s00285-012-0570-5. [DOI] [PubMed] [Google Scholar]
  • 7.Clancy D., Mendy S.T. Approximating the quasi-stationary distribution of the sis model for endemic infection. Methodol. Comput. Appl. Probab. 2011;13:603–618. doi: 10.1007/s11009-010-9177-8. [DOI] [Google Scholar]
  • 8.Sani A., Kroese D., Pollett P. Stochastic models for the spread of hiv in a mobile heterosexual population. Math. Biosci. 2007;208:98–124. doi: 10.1016/j.mbs.2006.09.024. [DOI] [PubMed] [Google Scholar]
  • 9.Chan D.C., Pollett P.K., Weinstein M.C. Quantitative risk stratification in markov chains with limiting conditional distributions. Med. Decis. Mak. 2009;29:532–540. doi: 10.1177/0272989X08330121. [DOI] [PubMed] [Google Scholar]
  • 10.Berglund N., Landon D. Mixed-mode oscillations and interspike interval statistics in the stochastic fitzhugh–nagumo model. Nonlinearity. 2012;25:2303. doi: 10.1088/0951-7715/25/8/2303. [DOI] [Google Scholar]
  • 11.Landon D. Ph.D. Thesis. Université d’Orléans; Orléans, France: 2012. Perturbation et Excitabilité Dans des Modeles Stochastiques de Transmission de l’Influx Nerveux. [Google Scholar]
  • 12.Gesù G.D., Lelièvre T., Peutrec D.L., Nectoux B. Jump markov models and transition state theory: The quasi-stationary distribution approach. Faraday Discuss. 2017;195:469–495. doi: 10.1039/C6FD00120C. [DOI] [PubMed] [Google Scholar]
  • 13.Lelièvre T., Nier F. Low temperature asymptotics for quasistationary distributions in a bounded domain. Anal. PDE. 2015;8:561–628. doi: 10.2140/apde.2015.8.561. [DOI] [Google Scholar]
  • 14.Pollock M., Fearnhead P., Johansen A.M., Roberts G.O. The scalable langevin exact algorithm: Bayesian inference for big data. arXiv. 20161609.03436 [Google Scholar]
  • 15.Wang A.Q., Roberts G.O., Steinsaltz D. An approximation scheme for quasi-stationary distributions of killed diffusions. Stoch. Process. Appl. 2020;130:3193–3219. doi: 10.1016/j.spa.2019.09.010. [DOI] [Google Scholar]
  • 16.Watkins D.S. Fundamentals of Matrix Computations. Volume 64 John Wiley & Sons; Hoboken, NJ, USA: 2004. [Google Scholar]
  • 17.Bebbington M. Parallel implementation of an aggregation/disaggregation method for evaluating quasi-stationary behavior in continuous-time markov chains. Parallel Comput. 1997;23:1545–1559. doi: 10.1016/S0167-8191(97)89286-1. [DOI] [Google Scholar]
  • 18.Pollett P., Stewart D. An efficient procedure for computing quasi-stationary distributions of markov chains by sparse transition structure. Adv. Appl. Probab. 1994;26:68–79. doi: 10.2307/1427580. [DOI] [Google Scholar]
  • 19.Martinez S., Martin J.S. Quasi-stationary distributions for a brownian motion with drift and associated limit laws. J. Appl. Probab. 1994;31:911–920. doi: 10.2307/3215316. [DOI] [Google Scholar]
  • 20.Aldous D., Flannery B., Palacios J.L. Two applications of urn processes the fringe analysis of search trees and the simulation of quasi-stationary distributions of markov chains. Probab. Eng. Inform. Sci. 1988;2:293–307. doi: 10.1017/S026996480000084X. [DOI] [Google Scholar]
  • 21.Benaïm M., Cloez B. A stochastic approximation approach to quasi-stationary distributions on finite spaces. Electron. Commun. Probab. 2015;20:1–13. doi: 10.1214/ECP.v20-3956. [DOI] [Google Scholar]
  • 22.De Oliveira M.M., Dickman R. How to simulate the quasistationary state. Phys. Rev. E. 2005;71:016129. doi: 10.1103/PhysRevE.71.016129. [DOI] [PubMed] [Google Scholar]
  • 23.Blanchet J., Glynn P., Zheng S. Analysis of a stochastic approximation algorithm for computing quasi-stationary distributions. Adv. Appl. Probab. 2016;48:792–811. doi: 10.1017/apr.2016.28. [DOI] [Google Scholar]
  • 24.Zheng S. Ph.D. Thesis. Columbia University; New York, NY, USA: 2014. Stochastic Approximation Algorithms in the Estimation of Quasi-Stationary Distribution of Finite and General State Space Markov Chains. [Google Scholar]
  • 25.Kushner H., Yin G.G. Stochastic Approximation and Recursive Algorithms and Applications. Volume 35 Springer Science & Business Media; Cham, Switzerlands: 2003. [Google Scholar]
  • 26.Polyak B.T., Juditsky A.B. Acceleration of stochastic approximation by averaging. SIAM J. Control. Optim. 1992;30:838–855. doi: 10.1137/0330046. [DOI] [Google Scholar]
  • 27.Blei D.M., Kucukelbir A., McAuliffe J.D. Variational inference: A review for statisticians. J. Am. Stat. Assoc. 2017;112:859–877. doi: 10.1080/01621459.2017.1285773. [DOI] [Google Scholar]
  • 28.Jordan M.I., Ghahramani Z., Jaakkola T.S., Saul L.K. An Introduction to Variational Methods for Graphical Models. Mach. Learn. 1999;37:183–233. doi: 10.1023/A:1007665907178. [DOI] [Google Scholar]
  • 29.Liu Q., Wang D. Stein variational gradient descent: A general purpose bayesian inference algorithm. In: Lee D., Sugiyama M., Luxburg U., Guyon I., Garnett R., editors. Advances in Neural Information Processing Systems. Volume 29 Curran Associates, Inc.; Red Hook, NY, USA: 2016. [Google Scholar]
  • 30.Rezende D., Mohamed S. Variational inference with normalizing flows. In: Bach F., Blei D., editors. Proceedings of the 32nd International Conference on Machine Learning; Lille, France. 7–9 July 2015; pp. 1530–1538. [Google Scholar]
  • 31.Sutton R.S., Barto A.G. Reinforcement Learning: An Introduction. MIT Press; Cambridge, MA, USA: 2018. [Google Scholar]
  • 32.Mnih V., Kavukcuoglu K., Silver D., Graves A., Antonoglou I., Wierstra D., Riedmiller M. Playing atari with deep reinforcement learning. arXiv. 20131312.5602 [Google Scholar]
  • 33.Popova M., Isayev O., Tropsha A. Deep reinforcement learning for de novo drug design. Sci. Adv. 2018;4:eaap7885. doi: 10.1126/sciadv.aap7885. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Silver D., Hubert T., Schrittwieser J., Antonoglou I., Lai M., Guez A., Lanctot M., Sifre L., Kumaran D., Graepel T., et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science. 2018;362:1140–1144. doi: 10.1126/science.aar6404. [DOI] [PubMed] [Google Scholar]
  • 35.Rose D.C., Mair J.F., Garrahan J.P. A reinforcement learning approach to rare trajectory sampling. New J. Phys. 2021;23:013013. doi: 10.1088/1367-2630/abd7bd. [DOI] [Google Scholar]
  • 36.Méléard S., Villemonais D. Quasi-stationary distributions and population processes. Probab. Surv. 2012;9:340–410. doi: 10.1214/11-PS191. [DOI] [Google Scholar]
  • 37.Blanchet J., Glynn P., Zheng S. Empirical analysis of a stochastic approximation approach for computing quasi-stationary distributions. In: Schütze O., Coello C.A.C., Tantar A.-A., Tantar E., Bouvry P., Moral P.D., Legrand P., editors. EVOLVE—A Bridge between Probability, Set Oriented Numerics, and Evolutionary Computation II. Springer; Berlin/Heidelberg, Germany: 2013. pp. 19–37. [Google Scholar]
  • 38.Boyd S., Boyd S.P., Vandenberghe L. Convex Optimization. Cambridge University Press; Cambridge, UK: 2004. [Google Scholar]
  • 39.Wang W., Carreira-Perpinán M.A. Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application. arXiv. 20131309.1541 [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.


Articles from Entropy are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES