Deep Reinforcement Learning Data Collection for Bayesian Inference of Hidden Markov Models

Mohammad Alali; Mahdi Imani

doi:10.1109/tai.2024.3515939

. Author manuscript; available in PMC: 2025 May 1.

Published in final edited form as: IEEE Trans Artif Intell. 2024 Dec 12;6(5):1217–1232. doi: 10.1109/tai.2024.3515939

Deep Reinforcement Learning Data Collection for Bayesian Inference of Hidden Markov Models

Mohammad Alali ¹, Mahdi Imani ¹

PMCID: PMC12045110 NIHMSID: NIHMS2050407 PMID: 40313356

Abstract

Hidden Markov Models (HMMs) are a powerful class of dynamical models for representing complex systems that are partially observed through sensory data. Existing data collection methods for HMMs, typically based on active learning or heuristic approaches, face challenges in terms of efficiency in stochastic domains with costly data. This paper introduces a Bayesian lookahead data collection method for inferring HMMs with finite state and parameter spaces. The method optimizes data collection under uncertainty using a belief state that captures the joint distribution of system states and models. Unlike traditional approaches that prioritize short-term gains, this policy accounts for the long-term impact of data collection decisions to improve inference performance over time. We develop a deep reinforcement learning policy that approximates the optimal Bayesian solution by simulating system trajectories offline. This pre-trained policy can be executed in real-time, dynamically adapting to new conditions as data is collected. The proposed framework supports a wide range of inference objectives, including point-based, distribution-based, and causal inference. Experimental results across three distinct systems demonstrate significant improvements in inference accuracy and robustness, showcasing the effectiveness of the approach in uncertain and data-limited environments.

Index Terms—: Hidden Markov Models, Inference, Reinforcement Learning, Causality

I. Introduction

Hidden Markov Models (HMMs) are widely recognized statistical models employed in numerous domains, including network security, ecology, transportation, image processing, sentiment analysis, autonomous driving, health monitoring, and disaster evaluation [1]–[8]. Inference of HMMs is crucial for understanding and predicting sequential data patterns, offering insights into underlying processes masked by noisy or incomplete observations.

Several inference techniques have been developed for modeling and constructing HMMs. A stochastic variational inference algorithm for learning the parameters of HMMs is introduced in [9], where its efficacy is demonstrated on synthetic experiments and a large genomics dataset. Wiuf et al. [10] suggest a novel reaction network scheme, the Baum-Welch reaction network, which effectively learns parameters for HMMs and demonstrates that its expectation and maximization steps converge exponentially fast. An online expectation-maximization algorithm is proposed in [11] for enabling realtime parameter estimation in HMMs. Markov Chain Monte Carlo (MCMC) methods [12], [13] are also a popular class of methods used for Bayesian inference of HMMs. Moreover, combining approximate Bayesian computation with MCMC techniques is proposed in [14] to improve computational efficiency, especially when dealing with intractable likelihoods and poor prior knowledge about model parameters. The existing inference methods can be categorized broadly into point-based, such as a maximum likelihood or maximum a posteriori inference, and distribution-based, such as Bayesian inference.

The accuracy of the inference methods is limited to the capacity in which the available data represent the underlying complex mechanism of systems. One of the main challenges in the inference of HMMs is the non-identifiability issue [15], which refers to cases where different sets of model parameters can produce the observed data. This can lead to difficulties in model interpretation and inference. The non-identifiability issue often happens in systems under normal conditions or tightly controlled systems, where available data (though large) only represent a small portion of the state space. Therefore, collecting additional data to properly learn the underlying mechanism of systems is essential. Data collection can be achieved by influencing the system state transitions, such as through excitation or perturbation into the system states, intelligently selecting the sensors to monitor the system, or any other actions that may lead to obtaining different information from the systems.

Several data collection approaches have been developed to enhance the inference of HMMs and overcome the non-identifiability issue [16], [17]. For instance, a method for inferring human behaviors in human-robot collaboration tasks is proposed in [18] by collecting and analyzing data such as gestures, eye movements, and head motions, utilizing a Bayesian learning approach within a Partially Observed Markov Decision Process (POMDP) framework. Also, a Bayesian method for learning POMDP observation parameters in robot interaction management systems is introduced in [19], allowing gradual adjustment of the robot’s behavior based on user interactions and an oracle’s optimal policy. These approaches, however, are application-specific and are not applicable to general HMMs. Sensor selection or scheduling approaches are another class of data collection techniques for HMMs [20]–[24]. However, these methods can only optimize the allocation of monitoring resources rather than changing the system transitions to collect informative data. Also, the majority of these methods are developed to enhance state estimation performance and rely on the availability of full knowledge of HMMs. Experimental design techniques are the most relevant approaches for data collection for inference of HMMs. This class of methods aims to sequentially and greedily collect data to reduce model uncertainty (e.g., entropy or related measures) [25]–[27]. For instance in [28], expected information gain is utilized to optimize data selection in neurobiological research, aiming to improve inference with fewer data and potentially enhance the design of clinical trials. Also, a framework for active learning in latent variable models is proposed in [29], leveraging entropy measures to reduce data requirements for fitting mixtures of linear regressions and HMMs with generalized linear model observations. Despite the relative success of these techniques, they become inefficient in domains with a huge amount of uncertainty in state, measurement, and system models, as well as domains with limited data collection resources. This comes from the incapability of these methods to account for the long-term impacts of decisions on system uncertainty and inference accuracy.

In recent years, Reinforcement Learning (RL) techniques have been extensively applied to the inference and control of complex dynamical systems [30], [31]. For example, a fuzzy-model-based approach has been proposed in [32] to optimize nonlinear Markov jump singularly perturbed systems using RL. In [33], an inference method is developed that incorporates expert knowledge into the inference process for HMMs. The expert is modeled as a sub-optimal RL agent, and the method aims to extract its knowledge for inference rather than collecting new data. Furthermore, an RL-based data collection method for offline meta-RL is presented in [34]. This method addresses the identifiability issue and sample-efficient interactions across domains modeled by a Markov Decision Process (MDP). However, this approach focuses on task adaptation in offline meta-RL and does not address the challenge of identifying unknown model parameters. Moreover, in some complex systems, such as cyber-physical and biological systems, non-Markovian behavior can emerge, where the future state depends not only on the current state but also on past states. Recent studies have proposed fractional-order derivatives and master equations to capture these long-range dependencies [35]–[37]. In such cases, data collection strategies may need to distinguish whether Markovian or non-Markovian dynamics are more appropriate for modeling these systems.

This paper proposes a deep reinforcement learning Bayesian data collection policy for inferring unknown HMMs. A Bayesian representation of unknown HMMs is provided through a belief state, which keeps track of the joint distribution of the possible system states and models. The optimal Bayesian data collection process is formulated in belief space, where the belief transition propagates the system and data uncertainty. The offline training is achieved through simulating belief trajectories for pre-training a deep reinforcement learning policy without the need for real data or prior knowledge of the system model. The offline-trained Bayesian policy is executable in real-time, according to the latest data collected from the real systems. Some of the key features of the proposed policy are as follows:

Approximating the optimal Bayesian lookahead data collection policy, which accounts for uncertainty in the system and data and considers the long-term impact of decisions on the HMM’s inference performance.
Learnable through offline trajectories synthetically generated from the uncertain model; the trained policy is executable in real-time during data collection from the real system and dynamically adapts to the system as new data is collected.
Applicability to a wide range of inference criteria, including point-based criteria such as a maximum likelihood or maximum a posteriori, distribution-based criteria, and causal inference.

Unlike existing data collection methods, such as active learning [28] and maximum a posteriori approaches [38], which prioritize short-term or greedy objectives, our approach formulates data collection as a lookahead policy optimization problem. This allows us to consider the future impact of data collection actions, leading to more informed decisions that improve inference accuracy over time. Additionally, our method’s ability to operate offline, using synthetic belief trajectories during training, significantly reduces the amount of real-time data required, making the learning process more efficient and scalable across different domains.

The effectiveness of the proposed data collection approach is evaluated across three distinct systems. These include a general HMM with unknown state and measurement processes, a gene regulatory network characterized by an unknown state process, and a communication network featuring an unknown measurement process.

Fig. 1 illustrates a schematic diagram for the proposed data collection method. The blue arrows denote the already collected data, and the red dashed arrows indicate the future undecided decisions during data collection. The less visible network model in the early steps expresses the uncertainty about the network parameters. Given the already available data, the proposed data collection policy sequentially selects actions that can generate the most valuable data for quick and accurate inference.

The rest of the paper is organized as follows: Section II contains a background regarding finite-state HMMs and discusses the inference of unknown HMMs, followed by the formulation of data collection in HMMs. Further, Section III presents the optimal Bayesian lookahead data collection policy for unknown HMMs. The proposed deep reinforcement learning Bayesian data collection policy is explained in detail in Section IV. Section V includes the numerical results for three different HMM models, demonstrating the efficacy of the proposed method. Finally, Section VI consists of the conclusion and a discussion of future work.

II. Background and Preliminaries

A. Finite-State Hidden Markov Models

HMMs characterize the dynamical behavior of systems observed through noisy time-series data. The schematic diagram of HMMs with external inputs is shown in Fig. 2. The time-varying behavior of systems is represented through the state process, where $x_{k}$ represents the system state at time step $k$ . This paper assumes the state takes its value from a finite set $𝒳 = \{x^{1}, \dots, x^{n}\}$ . The states are observed indirectly through observations, where $y_{k}$ represents the measurement at time step $k$ , taking in an arbitrary space. The following two processes represent the finite-state HMMs:

x_{k} ~ p (. ∣ x_{k - 1}, a_{k - 1}, θ) (state process) y_{k} ~ p (. ∣ x_{k}, a_{k - 1}, θ) (measurement process)

(1)

for $k = 1,2, \dots$ ; where $p (\cdot)$ denotes probability density or probability mass functions, $a_{k - 1} \in 𝒜$ is the external input to the system at time $k - 1$ , and $θ$ represents the unknown elements of the two processes. The state and measurement processes are stochastic; the state process represents the system dynamics, whereas the measurement process characterizes the data generated from sensors/technology. The parameter $θ$ takes its value from a finite parameter space $Θ = \{θ^{1}, \dots, θ^{r}\}$ , where each $θ \in Θ$ represents a possible HMM. Therefore, the terms parameter and model are used interchangeably throughout this paper.

Fig. 2: — The schematic diagram of HMMs with external inputs.

B. Formulation of Optimal Inference for HMMs

1). Inference of Unknown HMMs:

The inference in HMMs consists of estimating the unknown parameters of the model in (1) represented by $θ$ according to the available data. Let $\{(a_{0}, y_{1}), (a_{1}, y_{2}), \dots, (a_{k - 1}, y_{k})\}$ be the available timeseries data, where $a_{i - 1}$ is the input or action at time step $i - 1$ , which has led to the measurement $y_{i}$ at time step $i$ . Depending on the application, the inference objective can be expressed in terms of point-based or distribution-based criteria, as described below.

The point-based inference aims to find a single best estimate of the unknown parameters. Two well-known point-based criteria are maximum likelihood (ML) and maximum a posteriori (MAP), which can be expressed as [39], [40]:

{\hat{θ}}_{k}^{ML} = \underset{θ \in Θ}{argmax} p (y_{1 : k} ∣ a_{0 : k - 1}, θ), {\hat{θ}}_{k}^{MAP} = \underset{θ \in Θ}{argmax} p (θ ∣ y_{1 : k}, a_{0 : k - 1}),

(2)

where ${\hat{θ}}_{k}^{ML}$ and ${\hat{θ}}_{k}^{MAP}$ represent the parameters with the highest likelihood and posterior probability, respectively. Note that MAP and ML become the same under a uniform prior distribution of the unknown parameters.

The confidence of the MAP inference ${\hat{θ}}_{k}^{MAP}$ in (2) can be assessed using the probability of the MAP model as:

C_{k} = p ({\hat{θ}}_{k}^{MAP} ∣ y_{1 : k}, a_{0 : k - 1}) = max_{θ \in Θ} p (θ ∣ y_{1 : k}, a_{0 : k - 1}),

(3)

where $1 / r \leq C_{k} \leq 1$ represents the highest posterior probability of the models. The closer $C_{k}$ to 1, the higher the confidence of the MAP inference is. The confidence rate of $1 / r$ corresponds to the poorest performance of the MAP inference. Note that $C_{k} = 1 / r$ is a special case where all the models have the same posterior probability given the available data; thus, the true model is not distinguishable from the others.

For distribution-based inference, the objective is to find the posterior distribution of the possible system models, as opposed to a model with the highest likelihood or posterior probability. Given the available data up to time step $k$ , this can be expressed as [39]:

p_{k} = p (Θ ∣ y_{1 : k}, a_{0 : k - 1}) = {[p (θ^{1} ∣ y_{1 : k}, a_{0 : k - 1}), \dots, p (θ^{r} ∣ y_{1 : k}, a_{0 : k - 1})]}^{T} .

(4)

Similar to the expected confidence in (3) for point-based inference, the confidence of the distribution-based inference in (4) can be assessed using the following entropy measure [41]:

D_{k} = - ℋ [p (Θ ∣ y_{1 : k}, a_{0 : k - 1})] = \sum_{i = 1}^{r} p_{k} (i) log p_{k} (i),

(5)

where $ℋ [.]$ represents the Shannon entropy and $log (1 / r) \leq D_{k} \leq 0$ . The values of $D_{k}$ close to zero represent cases with posterior distributions peaked over a single model and, consequently, the most confident inference. On the other hand, the values close to $D_{k} = log (1 / r)$ indicate the poorest performance of distribution-based inference.

2). Formulation of Data Collection in HMMs:

Data can be costly, in terms of time and/or money, in many practical domains, and it is often challenging or impossible to rely on only the available data for accurate inference of complex systems. This is due to the fact that the possible models in $Θ$ often cannot be easily distinguished from each other through data collected from the system in normal conditions. Therefore, rather than collecting more data, targeted data should be collected to ensure the desired inference objective. Data collection in dynamic systems is challenging, as targeted actions should be made to intelligently change the system dynamics and generate data that reveal the true system model.

Depending on the inference criteria, the expected accuracy of the inference process can be assessed through the measures defined for the point-based and distribution-based inference in (3) and (5). Thus, given the huge cost of data collection, the goal is to select the best sequence of actions that leads to measurements under which the expected accuracy/confidence of inference is maximized. This can be expressed as:

a_{0 : k - 1}^{*} = \underset{a_{0 : k - 1} \in 𝒜^{k}}{argmax} E [C_{k} or D_{k} ∣ a_{0 : k - 1}],

(6)

where the expectation is with respect to stochasticity in the unobserved measurements. Finding the optimal sequence of actions in (6) without full knowledge of the system model or access to large real data is impossible. Meanwhile, the batch selection of the action sequence makes it unsuitable for real-time and sequential data collection, as the latest data are not utilized to change the selections. Methods such as active learning [42] are developed to sequentially choose actions by accounting for possible changes in inference measures in a single step. While these methods make the computation of the expectation in (6) tractable, they often provide non-optimal solutions for data collection due to their greedy nature and often require huge computations in real-time for their selection process. In the next paragraphs, our proposed sequential and lookahead data collection approach to overcome these challenges is described.

III. Optimal Bayesian Lookahead Data Collection

A. Bayesian Modeling of Decision-Making in Unknown HMMs

Let $(a_{0 : k - 1}, y_{1 : k})$ be the available data collected up to time step $k$ , where $a_{1 : k - 1}$ and $y_{1 : k}$ represent the sequence of selected actions and observed measurements, respectively. Let $p_{k} (θ)$ also be the posterior probability of the model $θ \in Θ$ , given the collected data up to time $k$ . We represent the information carried by the sequence of data and the posterior probability of models into a single vector called belief state. The belief state captures the joint probability distribution of the possible system states and system models using an $n \times r$ vector. The belief state at time $k$ can be represented as:

b_{k} = p (x_{k} = 𝒳, Θ ∣ y_{1 : k}, a_{0 : k - 1}) = [\begin{matrix} b_{k}^{θ^{1}} \\ ⋮ \\ b_{k}^{θ^{r}} \end{matrix}],

(7)

where

b_{k}^{θ^{l}} = [\begin{matrix} p (x_{k} = x^{1}, θ^{l} ∣ y_{1 : k}, a_{0 : k - 1}) \\ ⋮ \\ p (x_{k} = x^{n}, θ^{l} ∣ y_{1 : k}, a_{0 : k - 1}) \end{matrix}] .

(8)

The belief state is a sufficient statistic for representing the history of information as a compact form in unknown HMMs. The initial belief $b_{0} = p (x_{0} = 𝒳, Θ)$ demonstrates the initial joint probability of the system states and unknown parameters, which could be set as uniform if no prior information is available.

In this part, we describe that the belief transitions follow a Markov process and provide a recursive procedure to update the belief state as new action is taken and a new measurement is obtained. Let $b_{0}, b_{1}, \dots, b_{k}$ be the sequence of belief states, with $b_{k} = p (𝒳, Θ ∣ a_{0 : k - 1}, y_{1 : k})$ be the latest belief. The belief state after taking an action $a_{k}$ and observing a new measurement $y_{k + 1}$ can be updated as:

b_{k + 1} = p (x_{k + 1} = 𝒳, Θ ∣ b_{0}, \dots, b_{k}, a_{k}, y_{k + 1}) = \frac{p (x_{k + 1} = 𝒳, Θ, y_{k + 1} ∣ b_{0}, \dots, b_{k}, a_{k})}{p (y_{k + 1} ∣ b_{0}, \dots, b_{k}, a_{k})} \propto p (y_{k + 1} ∣ x_{k + 1} = 𝒳, a_{k}, Θ) \times p (x_{k + 1} = 𝒳, Θ ∣ b_{0}, \dots, b_{k}, a_{k}),

(9)

where the numerator of the second line is expanded in the last line. The first term of the last line of (9) can be computed using the measurement process in (1). The second term can be computed for any $x^{i} \in 𝒳$ and $θ \in Θ$ as:

p (x_{k + 1} = x^{i}, θ ∣ b_{0}, \dots, b_{k}, a_{k}) = \sum_{j = 1}^{n} p (x_{k + 1} = x^{i}, x_{k} = x^{j}, θ ∣ b_{0}, \dots, b_{k}, a_{k}) = \sum_{j = 1}^{n} p (x_{k + 1} = x^{i} ∣ x_{k} = x^{j}, a_{k}, θ) p (x_{k} = x^{j}, θ ∣ b_{0}, \dots, b_{k}) = \sum_{j = 1}^{n} p (x_{k + 1} = x^{i} ∣ x_{k} = x^{j}, a_{k}, θ) b_{k}^{θ} (j),

(10)

where $b_{k}^{θ} (j)$ denotes the $j th$ element of $b_{k}^{θ}$ and $p (x_{k} = x^{j}, θ ∣ b_{0}, \dots, b_{k})$ in the third line is equal to $b_{k}^{θ} (j)$ according to (8). Replacing (10) into (9) leads to the following expression for the numerator of (9):

b_{k + 1}^{θ} (i) \propto p (y_{k + 1} ∣ x_{k + 1} = x^{i}, a_{k}, θ) \times \sum_{j = 1}^{n} p (x_{k + 1} = x^{i} ∣ x_{k} = x^{j}, a_{k}, θ) b_{k}^{θ} (j),

(11)

for $i = 1, \dots, n$ and $θ \in Θ$ . Equations (9) and (11) enable a recursive update of the belief state and reflect the Markovian transition of the belief states.

B. MDP Formulation of Data Collection

This section provides the belief space and belief transition formulation within the context of the Markov Decision Process (MDP) for the data collection procedure. The belief state takes its value in a simplex of size $n \times r, Δ_{n \times r}$ , denoted by $ℬ$ and referred to as the belief space. If $b$ is the current belief state, upon taking action $a$ , the next belief state before observing the next observation is stochastic, with the transition represented by:

b^{'} ~ p (. ∣ b, a) .

(12)

The belief transition in (12) represents the probability of the next belief state after taking action $a$ from the current belief state $b$ . Therefore, the stochasticity in the next belief state corresponds to probable data that could be collected/observed in the next time step given all the available information reflected in the current belief state. Letting $b = {[b^{θ^{1}}, \dots, b^{θ^{r}}]}^{T}$ be the current belief state, the belief state after taking action $a$ and observing a new measurement $y$ can be written as $b_{y, a} = {[b_{y, a}^{θ^{1}}, \dots, b_{y, a}^{θ^{r}}]}^{T}$ , where

b_{y, a}^{θ} (i) = \frac{p (y ∣ x^{i}, a, θ) \sum_{j = 1}^{n} p (x^{i} ∣ x^{j}, a, θ) b^{θ} (j)}{\sum_{x^{'} \in 𝒳} \sum_{θ^{'} \in Θ} p (y ∣ x^{'}, a, θ^{'}) \sum_{j = 1}^{n} p (x^{'} ∣ x^{j}, a, θ^{'}) b^{θ^{'}} (j)},

(13)

for $i = 1, \dots, n$ and any $θ \in Θ$ according to (9) and (11). The belief transition probabilities in (12) can be discrete or continuous distributions, depending on the size of the measurement space. Given that the measurement takes its value in a continuous space, the belief transition can be expressed as:

p (b^{'} ∣ b, a) = \int_{𝒴} p (y∣ b, a) I_{b^{'} = b_{y, a}} d y,

(14)

where $𝒴$ represents the space of all potential measurements and $I_{b^{'} = b_{y, a}}$ is an indicator function which returns 1 if $b^{'} = b_{y, a}$ and 0 otherwise. Notice that for multi-variant measurements, a single integral can be replaced by multiple ones, and in scenarios with discrete/finite measurement outcomes, the integral in (14) should be replaced with summations. (14) shows that when an action is chosen for data collection at any particular belief state, multiple potential subsequent belief states exist before observing the next data. It is important to recognize that the potential subsequent belief states will differ depending on the chosen actions and collected data. This enables us to propagate uncertainty in the belief and evaluate which action choices could result in future beliefs that align better with our data collection goals.

We formulate the gain during the data collection as a reward function in the belief space. The reward function can be expressed as $R (b_{k}, a_{k}, b_{k + 1})$ , which measures the immediate gain in the inference accuracy after taking action $a_{k}$ in belief state $b_{k}$ and moving to the next belief state $b_{k + 1}$ . Various reward functions can be defined given the inference and data collection objectives. If the goal is enhancing the confidence rate of point-based inference in (2), the reward function can be formulated using (3) as:

R (b_{k}, a_{k}, b_{k + 1}) = C_{k + 1} - C_{k} = max_{θ \in Θ} p_{k + 1} (θ) - max_{θ \in Θ} p_{k} (θ) = max_{θ \in Θ} {‖ b_{k + 1}^{θ} ‖}_{1} - max_{θ \in Θ} {‖ b_{k}^{θ} ‖}_{1},

(15)

where $‖ . ‖_{1}$ is the L-1 norm, and $p_{k + 1}$ and $p_{k}$ express the model probabilities at time steps $k + 1$ and $k$ , computable using belief states at time steps $k + 1$ and $k$ , respectively. A positive reward characterizes a single-step increase in the maximum posterior probability of models, and a reward of zero represents that the maximum posterior probability is unchanged after observing the next observation as a result of the last taken action.

For the distribution-based inference in (4), the reward function can be formulated using (5) as:

R (b_{k}, a_{k}, b_{k + 1}) = D_{k + 1} - D_{k} = - H [p_{k + 1}] + H [p_{k}] = \sum_{θ \in Θ} {‖ b_{k + 1}^{θ} ‖}_{1} \log {‖ b_{k + 1}^{θ} ‖}_{1} - \sum_{θ \in Θ} {‖ b_{k}^{θ} ‖}_{1} \log {‖ b_{k}^{θ} ‖}_{1},

(16)

where the positive values represent a reduction of uncertainty in the posterior distribution of models after the latest action. In the rest of the paper, (15) and (16) are referred to as the point-based reward and entropy reward respectively. Note that more complicated and time-dependent reward functions can be used instead of (15) and (16). For instance, the reward function could include the cost of actions in domains with different action costs and the causality of inference objectives in domains with specific data collection goals.

C. Optimal Stationary Bayesian Data Collection

The belief state provides the MDP representation of the unknown HMMs. The belief transition allows for the propagation of state, measurement, and model uncertainty, given all the available information. The belief state gets updated stochastically according to the belief transition upon taking any action, and a reward is observed according to the next belief state. We define a deterministic stationary data collection policy $μ : ℬ \to 𝒜$ , which associates an action to any given belief state. The optimal Bayesian lookahead data collection policy $μ^{*}$ can be defined as:

μ^{*} (b) = \underset{μ}{argmax} E [\sum_{t = 0}^{\infty} γ^{t} R (b_{t}, a_{t}, b_{t + 1}) ∣ b_{0} = b, a_{0 : \infty} ~ μ],

(17)

for all $b \in ℬ$ ; where $0 < γ < 1$ is a discount factor indicating the importance of early reward compared to the future ones. The optimal Bayesian lookahead policy $μ^{*} (b)$ prescribes the best action that could be taken at the belief state $b$ to maximize the long-term performance of inference (i.e., achieving the highest expected accumulated reward). The term “Bayesian” comes from the expectation in (17), which is with respect to the future belief states generated from the belief transition. In other words, all probable next belief states propagating the uncertainty in state, measurement, and models from the current belief state $b$ are taken into account for finding the optimal Bayesian data collection policy $μ^{*} (b)$ .

The exact computation of the optimal Bayesian policy in (17) is not possible, primarily due to the continuity of the belief space $ℬ$ . As previously noted, the belief space forms an $(n \times r) -simplex (Δ_{n \times r})$ , preventing the use of dynamic programming approaches, such as value iteration or policy iteration [43], for computing the optimal policy. In the subsequent section, the proposed deep reinforcement learning approach is introduced to approximate the optimal data collection policy.

IV. Deep Reinforcement Learning Bayesian Data Collection

The recent advances in deep reinforcement learning (RL) have enabled learning on much larger scales than before. In this paper, we develop a deep RL Bayesian data collection policy to approximate the solution to the optimization problem in (17). To approximate the expectation in (17), the proposed deep RL algorithm uses simulated samples from the belief transition, which propagates the system’s uncertainty without requiring data collection from real systems. The trained offline policy is executable for real-time and sequential data collection. In this section, the matrix-form implementation of the belief state is first described, followed by the proposed deep RL approach for learning the Bayesian data collection policy.

A. Matrix-form Implementation of Belief State

Let $\{x^{1}, \dots, x^{n}\}$ and $\{θ^{1}, \dots, θ^{r}\}$ be an arbitrary enumeration of the possible system states and models, respectively. We define the transition matrix of size $n \times n$ associated with the model $θ \in Θ$ under action $a$ as:

{(M^{θ} (a))}_{i j} = p (x_{k + 1} = x^{i} ∣ x_{k} = x^{j}, a_{k} = a, θ),

(18)

for $i, j = 1, \dots, n$ . The observation process can be expressed using the update vector $T^{θ} (y, a)$ as:

{(T^{θ} (y, a))}_{i} = p (y_{k + 1} = y ∣ x_{k + 1} = x^{i}, a_{k} = a, θ),

(19)

for $i = 1, \dots, n$ . Letting $b = {[b^{θ^{1}}, \dots, b^{θ^{r}}]}^{T}$ be the current belief state, the next belief state $b_{y, a}$ after taking action $a$ and observing $y$ can be computed according to (13). This can be expressed in terms of the transition matrix and update vector in (18) and (19) as:

b_{y, a} = \frac{[\begin{matrix} T^{θ^{1}} (y, a) \circ (M^{θ^{1}} (a) b^{θ^{1}}) \\ ⋮ \\ T^{θ^{r}} (y, a) \circ (M^{θ^{r}} (a) b^{θ^{r}}) \end{matrix}]}{\sum_{m = 1}^{r} {‖T^{θ^{m}} (y, a) \circ (M^{θ^{m}} (a) b^{θ^{m}})‖}_{1}},

(20)

where “ $\circ$ ” denotes Hadamard product (i.e., the componentwise multiplication of two vectors), and $‖ \cdot ‖_{1}$ denotes the absolute sum of a vector’s elements. The belief transition probability in (14) can also be expressed as:

p (b^{'} ∣ b, a) = \int_{𝒴} (\sum_{m = 1}^{r} {‖T^{θ^{m}} (y, a) \circ (M^{θ^{m}} (a) b^{θ^{m}})‖}_{1}) I_{b^{'} = b_{y, a}} d y .

(21)

B. Deep Reinforcement Learning Bayesian Data Collection Policy

For an arbitrary data collection policy $μ : ℬ \to 𝒜$ defined over the belief space, we define the Q-function or the expected discounted reward function at belief state $b \in ℬ$ after taking action $a \in 𝒜$ and following policy $μ$ afterward as:

Q^{μ} (b, a) = E [\sum_{t = 0}^{\infty} γ^{t} R (b_{t}, a_{t}, b_{t + 1}) ∣ b_{0} = b, a_{0} = a, a_{1 : \infty} ~ μ],

(22)

where the expectation is with respect to the stochasticity of the belief transition. The optimal Q-function, denoted by $Q^{*} ≔ Q^{μ^{*}}$ , provides the maximum expected return under the optimal policy $μ^{*}$ , where $Q^{μ^{*}} (b, a) = max_{μ} Q^{μ} (b, a)$ , for all $b \in ℬ$ and $a \in 𝒜$ .

Since the tabular representation of the Q-function is impossible over the continuous belief space, a deep reinforcement learning approach is employed for approximating the Q-function. This paper adopts the deep Q-network (DQN) approach [44] to learn the Bayesian data collection policy, aiming to approximate $Q^{μ^{*}} (b, a)$ according to neural networks. Note that other deep RL methods can also be used to derive the proposed data collection policy. The DQN method approximates this Q-function using two feed-forward deep neural networks with the same architecture: the Q-network, $Q_{w}$ , and the target network, $Q_{w^{-}}$ . The input for both networks is the belief state, and the outputs are $Q_{w} (b, a^{1}), \dots, Q_{w} (b, a^{| 𝒜 |})$ , each corresponding to an action. The initial weights for both the Q-network and the target-network are set randomly.

To train the neural networks, we utilize a fixed-size replay memory $𝒟$ . This replay memory stores episodes containing belief states that are generated using actions selected by the epsilon-greedy policy [43]. When the replay memory reaches its capacity, the newest experience replaces the oldest. Each episode begins with an initial belief state, which, if unknown, can be randomly chosen from the belief space $ℬ$ . At step $t$ of the episode, an action is chosen based on the epsilon-greedy policy defined using the Q-network $Q_{w}$ :

a_{t} ~ \{\begin{array}{l} {argmax}_{a \in 𝒜} Q_{w} (b_{t}, a) & w.p. 1 - ϵ \\ random \{a^{1}, \dots, a^{| 𝒜 |}\} & w.p. ϵ \end{array},

(23)

where $0 \leq ϵ \leq 1$ is a parameter that determines the degree of exploration during the training.

After choosing $a_{t}$ , the subsequent belief state is sampled from $p (. ∣ b_{t}, a_{t})$ . This sample can be obtained by initially sampling from the current belief state as follows:

(x_{t}, θ_{t}) ~ Cat [b_{t}^{θ^{1}} (1), \dots, b_{t}^{θ^{1}} (n), \dots, b_{t}^{θ^{r}} (1), \dots, b_{t}^{θ^{r}} (n)],

(24)

where “Cat” stands for categorical (discrete) distribution, and $(x_{t}, θ_{t})$ is a selected sample from the belief state $b_{t}$ , which is a realization of possible system state and model given the information up to time step $t$ (reflected in belief state $b_{t}$ ). A sample representing the next measurement can be generated according to ( $x_{t}, θ_{t}$ ) and (1) as:

y_{t + 1} ~ p (. ∣ x_{t + 1}, a_{t}, θ_{t}), where x_{t + 1} ~ p (. ∣ x_{t}, a_{t}, θ_{t}) .

(25)

Then, one can compute $b_{t + 1}$ using $a_{t}$ and $y_{t + 1}$ in (20). The reward function can also be computed according to the transition as: $r_{t + 1} = R (b_{t}, a_{t}, b_{t + 1})$ .

After reaching a pre-determined number of steps, $N_{Q_Update}$ , the Q-network $Q_{w}$ needs to be updated based on a minibatch of experiences of size $N_{batch}$ , randomly chosen from the replay memory $𝒟$ as:

Z = {\{(b_{i}, a_{i}, b_{i + 1}, r_{i + 1})\}}_{i = 1}^{N_{batch}} ~ 𝒟,

(26)

The target values for the Q-network update can be calculated for the minibatch samples as:

y_{i} = r_{i + 1} + γ max_{a \in 𝒜} Q_{w^{-}} (b_{i + 1}, a),

(27)

for $i = 1, \dots, N_{batch}$ . Note that the target values are calculated utilizing the target-network $Q_{w^{-}}$ . With these values, the weights of the Q-network, represented by $w$ , can be updated as follows:

w \leftarrow w - α \nabla_{w} [\sum_{i = 1}^{N_{batch}} {(y_{i} - Q_{w} (b_{i}, a_{i}))}^{2}],

(28)

where $α$ represents the learning rate, and the mean squared error serves as the loss function for updating the weights. The optimization described in (28) can be implemented using a stochastic gradient optimization method like Adam [45]. After updating $w$ , the weights of the target-network, denoted as $w^{-}$ , should be updated using the soft update method as:

w^{-} \leftarrow (1 - τ) w^{-} + τ w,

(29)

where $τ$ is the soft update hyperparameter. This incremental adjustment of the target network weights toward the Q-network is a critical factor in ensuring the stability of the deep reinforcement learning process. Training can halt either when performance gains become marginal or a predetermined performance threshold is reached. After the offline training, the final Q-network can be used for approximating the optimal Bayesian data collection policy as $μ^{*} (b) \approx {argmax}_{a \in 𝒜} Q_{w} (b, a)$ . This Bayesian policy associates an action to each sample in the belief space $b \in ℬ$ . This policy can be utilized in real-time according to the belief states computed based on data from the real system.

Fig. 3 presents a schematic of the proposed data collection framework, showing both the offline training and real-time execution phases. Our deep RL data collection policy approximates the solution to the optimization problem in (17) by using belief transition samples, eliminating the need for costly real-world data collection needed for training. These offline belief trajectories are generated to train our proposed deep RL data collection policy. Fig. 3(a) shows this process, where rewards from belief transitions are used to update the policy. In real-time, as illustrated in Fig. 3(b), the trained policy selects actions based on the current belief state, while newly collected data updates the belief state, reflecting the Bayesian nature of our approach.

The main computational complexity of the proposed method arises during the offline phase, which occurs before real-time execution. In this phase, simulated belief trajectories are used to train the deep RL policy. Although this training is computationally demanding due to neural network optimization and belief state updates, it is carried out offline, allowing the policy to be fully optimized before real-time use. Once trained, the policy can be applied to real-time data collection without further retraining, enabling efficient operation in practical applications. During real-time execution, the primary computational cost involves updating the belief state, with a time complexity of $O (r n^{2})$ , where $r$ is the number of possible system models and $n$ is the number of system states. This complexity arises from calculating transition probabilities and updating the joint belief state across all models and states, as detailed in (20).

V. Numerical Experiments

This section contains three sets of experiments investigating the performance of the proposed method. The results are represented in terms of data collection for enhancing the performance of point-based, distribution-based, and causal inference. The following three data collection methods are used for comparison with our proposed policy: the active learning data collection policy [28], the maximum a posteriori (MAP) data collection policy [38], and the Bayesian Offline Reinforcement Learning (BOReL) data collection policy [34]. Active learning and MAP focus on immediate data impact by minimizing uncertainty in a single time step, resulting in a primarily short-term, greedy approach. These methods are effective in scenarios where immediate system feedback is critical but are limited in managing long-term uncertainty, particularly in complex environments. In contrast, BOReL addresses offline meta-reinforcement learning, where the objective is to equip a meta-agent with strategies that generalize across new tasks by leveraging past task data. BOReL uses Bayesian reinforcement learning principles, incorporating task dynamics and environmental uncertainty to balance exploration and exploitation effectively. Through a belief-augmented neural network, BOReL generates a structured belief estimate from offline data, enabling adaptable exploration across diverse scenarios. Unlike BOReL, which addresses data and uncertainty separately, our proposed data collection policy considers the joint distribution of state and parameters, achieving exact Bayesian optimality.

Three HMMs are selected for the numerical experiments: 1) a general HMM featuring unknown state and measurement processes 2) a gene regulatory network (p53-MDM2) characterized by an unknown state process 3) Littman’s network, serving as a communication network example with an unknown measurement process. The hyperparameters for our approach, such as learning rate, exploration rate, minibatch size, and replay memory size, were selected through a combination of grid search and prior research recommendations to balance training stability and performance. Following are the final hyperparameter values: 3 hidden layers, 128 neurons in each hidden layer, learning rate $α = 0.0005$ , replay memory size $|𝒟| = 100, 000$ , minibatch size $N_{batch} = 64$ , discount factor $γ = 0.95$ , exploration rate $ϵ = 0.1$ , Q-network update frequency $N_{Q_{-} Update} = 4$ , and soft update coefficient $τ = 0.001$ . Moreover, the ReLU activation function is used between each of the two layers in the neural networks. Further, the training process is limited to a maximum of 100,000 episodes, at which point the training is stopped.

For each HMM system, the proposed data collection policy is trained offline using synthetically generated belief state trajectories rather than real-world data. These belief trajectories are created by simulating the transitions in each system under different actions, allowing us to train our proposed policy without the need for costly or time-consuming real-time data collection. This offline training phase enables learning effective policies that can be applied to real-time data collection scenarios. The trained policies are finally tested for real-time data collection in the real systems and their performance is compared with MAP, active learning, and BOReL policies. The results are derived from the average of 1000 trials, with $95 %$ confidence bounds also reported where possible.

A. A General HMM Problem with Unknown State and Measurement Processes

Our proposed Bayesian data collection method is first tested on an HMM problem consisting of four states. The state transition diagram of this system is visualized in Fig. 4. The system’s four states and three actions are represented as $𝒳 = \{x^{1}, x^{2}, x^{3}, x^{4}\}$ and $𝒜 = \{a^{1}, a^{2}, a^{3}\}$ . The unknown parts of the HMM include the state transition probabilities from state $x^{1}$ to the other states under actions $a^{1}$ and $a^{2}$ . This leads to the following parameters for the state transitions:

ψ = [ψ_{1} = (q_{11}^{1}, q_{12}^{1}, q_{13}^{1}, q_{14}^{1}), ψ_{2} = (q_{11}^{2}, q_{12}^{2}, q_{13}^{2}, q_{14}^{2})],

(30)

where $q_{i j}^{l} = p (x_{k + 1} = x^{j} ∣ x_{k} = x^{i}, a_{k} = a^{l})$ . Each $q_{i j}^{l}$ is assumed to take its value from the set ${0,1 / 3,1 / 2,1}$ , subject to $q_{11}^{l} + q_{12}^{l} + q_{13}^{l} + q_{14}^{l} = 1$ , for $l = 1,2$ . This results in 14 possibilities for each of the parameter vectors $ψ_{1}$ and $ψ_{2}$ as:

ψ_{i} \in Ψ_{i} = {(1 / 2,1 / 2,0, 0), (1 / 2,0, 1 / 2,0), (1 / 2,0, 0,1 / 2), (0,1 / 2,1 / 2,0), (0,1 / 2,0, 1 / 2), (0,0, 1 / 2,1 / 2), (1,0, 0,0), (0,1, 0,0), (0,0, 1,0), (0,0, 0,1), (1 / 3,1 / 3,1 / 3,0), (1 / 3,1 / 3,0, 1 / 3), (1 / 3,0, 1 / 3,1 / 3), (0,1 / 3,1 / 3,1 / 3)}

(31)

for $i = 1,2$ . Therefore, the parameter set $Ψ = Ψ_{1} \times Ψ_{2}$ , consisting of all unknown parameters of the state transition, has a size of 14 × 14 = 196, with the true parameters $ψ^{*} = [ψ_{1}^{*} = (0,1 / 2,1 / 2,0), ψ_{2}^{*} = (0,1 / 3,1 / 3,1 / 3)]$ according to Fig. 4. Notice that $ψ_{1}^{*}$ and $ψ_{2}^{*}$ correspond to the 4th and 14th element of the sets in (31) respectively.

Fig. 4: — Transition diagram of the HMM problem.

The system state is assumed to be partially observable through the following measurement process:

y_{k} = \{\begin{array}{l} x^{'} \in \{𝒳 - x_{k}\} & with probability 1 - ϕ \\ x_{k} & with probability ϕ \end{array} .

(32)

where the observation $y_{k}$ represents the true underlying state $x_{k}$ with a probability of $ϕ$ and a non-true state with a probability of $1 - ϕ$ . We also assume $ϕ$ be an unknown parameter of the system, where it takes its values in the measurement parameter set $Φ = {0.25,0.55,0.85}$ with the true parameter $ϕ^{*} = 0.85$ .

Given the unknown parts of the transition probabilities and measurement process, the entire parameter space $Θ : Ψ \times Φ$ includes 196 × 3 = 588 distinct elements. Given the state space size of 4, the belief state takes its value in a $Δ_{4 \times 588}$ . This represents a large and continuous space for the belief state, underscoring the need for a systematic and efficient data collection strategy to accurately infer the unknown parameters.

Fig. 5 represents the average negative entropy achieved under different policies. The proposed Bayesian data collection method is trained using the reward in (16). One can see that the proposed data collection method reaches a higher average negative entropy in almost all the steps compared to the other methods. The next best performance belongs to the BOReL policy, followed by the active learning policy. The worst performance is for the MAP policy, with minimal change observed in the average negative entropy after reaching step 20. The results indicate the significance of accounting for the uncertainty in the future horizons for effective data collection during the inference process.

In the next experiment, we investigate the effectiveness of the proposed policy in inferring a subset of unknown parameters rather than the entire unknown parameters. This is referred to as causal inference since the data collection policy should solely focus on these subsets. Given that the entire parameter vector consists of $θ = (ψ = [ψ_{1}, ψ_{2}], ϕ)$ , three causal cases for inferring $ψ, ψ_{1}$ and $ψ_{2}$ are considered. Specifically, the reward function for the causal inference of the parameter $ψ_{1}$ , while other parameters $ψ_{2}, ϕ$ are also unknown, can be expressed as:

R_{ψ_{1}} = max_{ψ_{1} \in Ψ_{1}} p (ψ_{1} ∣ y_{1 : k + 1}, a_{0 : k}) - max_{ψ_{1} \in Ψ_{1}} p (ψ_{1} ∣ y_{1 : k}, a_{0 : k - 1}),

(33)

where $p (ψ_{1} ∣ y_{1 : k}, a_{0 : k - 1}) = \sum_{ψ_{2}^{'} \in Ψ_{2}} \sum_{ϕ^{'} \in Φ} p (ψ_{1}, ψ_{2}^{'}, ϕ^{'} ∣ y_{1 : k}, a_{0 : k - 1})$ . $R_{ψ_{2}}$ and $R_{ψ}$ can also be derived similarly. More information about the causal reward functions is provided in the supplementary document.

Fig. 6 represents the performance of different data collection policies in terms of average true model posterior probability for inferring each subset of unknown parameters. Columns 1, 2, and 3 of this figure illustrate the average results over 100 steps for the parameter sets $Ψ, Ψ_{1}$ , and $Ψ_{2}$ , respectively. Additionally, Fig. 6(a), (b), and (c) in each of the rows showcase the data collection results for inference of unknown parameter vectors $ψ, ψ_{1}$ , and $ψ_{2}$ , correspondingly.

In Fig. 6(a), our method shows an average true model posterior probability of 20% after 100 steps for the inference of $ψ$ , which is better than the performance obtained by all the other policies in this case. Moreover, case (a) shows a better performance for inferring $ψ$ compared to cases (b) and (c). This is because in case (a), the proposed method has a superior performance in inferring both $ψ_{1}$ and $ψ_{2}$ simultaneously compared to other policies, whereas cases (b) and (c) aim to infer only one of $ψ_{1}$ or $ψ_{2}$ . Further, Fig. 6(b) shows that the proposed method has an average true model posterior probability of near 80% for inferring the unknown parameter vector $ψ_{1}$ after 100 steps, but the other policies show a much lower performance with the BOReL policy showing a performance of about 40%, active learning policy reaching performance of less than 40%, and the MAP policy showing performance of less than 20% after 100 steps. Also, our policy in case (b) performs better in inferring $ψ_{1}$ compared to the other cases, which aligns with its goal; however, it has lower performance for inferring $ψ$ and $ψ_{2}$ compared to the proposed policy in the other cases. Finally, one can see that in Fig. 6(c), our method performs better than the other policies in all the steps for inferring $ψ_{2}$ . In this case, for inferring $ψ_{2}$ , the proposed method reaches about 60% of performance, the active learning and BOReL policies get to about 20%, and the MAP policy remain less than 20% after 100 steps. Additionally, as expected, case (c) shows the best performance for inference of $ψ_{2}$ compared to the other cases.

For further analysis regarding the last experiments, a snapshot of the average posterior probabilities for different elements of the parameter sets $Ψ_{1}$ and $Ψ_{2}$ at time step 100 is visualized for different inference cases in Fig. 7. As a reminder, the 4th element of $Ψ_{1}$ and the 14th element of $Ψ_{2}$ correspond to $ψ_{1}^{*}$ and $ψ_{2}^{*}$ respectively. It can be seen that in case (a), these elements have the largest average posterior probabilities among the other elements under the proposed policy, which shows that our method has recognized the true underlying parameters with high confidence. This is where the other policies show low and similar to each other posterior probabilities for different elements and fail to distinguish the true parameters. In case (b), the 4th element of $Ψ_{1}$ using our approach shows the highest posterior probability among all the other cases. However, in this case, the average posterior probabilities for different elements of $Ψ_{2}$ show almost the same value under all the policies. This is because in case (b), the focus was on inferring the parameter vector $ψ_{1}$ , and considering this objective, the best set of data collection actions was selected to maximize the confidence for inferring $ψ_{1}$ without considering the consequences for inference of $ψ_{2}$ . Lastly, in case (c), the 14th element of $Ψ_{2}$ has the largest average posterior probability among the other elements under the proposed policy. This clearly shows the effectiveness of our approach in targeted data collection for inference of the parameter vector $ψ_{2}$ .

Fig. 7: — The average posterior probabilities of the parameter sets $Ψ_{1}$ and $Ψ_{2}$ at time step 100 for three data collection objectives: a) Inferring the parameter vector $ψ$ b) Causal Inference of $ψ_{1}$ c) Causal Inference of $ψ_{2}$ .

B. p53-MDM2 Regulatory Network with Unknown State Process

Gene regulatory networks are commonly modeled through HMMs with binary state variables [46]–[49]. We consider the p53-MDM2 gene regulatory network [50]–[53] shown in Fig. 8 for our analysis. The suppressive interactions are shown using blunt arrows, and activating regulations are represented by standard arrows. The protein known as $p 53$ , which is produced from the p53 gene, is crucial for regulating how cells respond to different stress signals that have the potential to induce genomic instability.

Fig. 8: — Activation/repression pathway diagram for the p53-MDM2 network model.

The p53-MDM2 network consists of 4 genes with the state vector at time step $k$ represented by $x_{k} = {[x_{k} (1), x_{k} (2), x_{k} (3), x_{k} (4)]}^{T} = [A T M_{k}, p 53_{k}, {W i p 1}_{k}, {M D M 2_{k}]}^{T}$ . Each state variable takes a value from the binary set $x_{k} (i) \in {0,1}$ , where 1 and 0 represent the activation and inactivation of the $i th$ gene, respectively. The state process for this gene regulatory network can be represented by [54]:

x_{k} = \bar{R x_{k - 1}} \oplus a_{k - 1} \oplus n_{k}, for k = 1,2, \dots,

(34)

where $\bar{v}$ maps the positive elements of a vector to 1 and others to 0, “ $\oplus$ ” indicates component-wise modulo-2 addition (i.e., XOR), $a_{k - 1}$ is the action (i.e., perturbation or excitation) at time $k - 1$ , and $n_{k}$ is the transition noise at time $k$ which is modeled by an independent Bernoulli random variable with intensity $p = 0.001$ . Additionally, $R$ in (34) is the connectivity matrix of size 4 × 4, expressing the types of interactions between the genes, represented by:

R = {[\begin{matrix} 0 & + 1 & 0 & - 1 \\ 0 & 0 & + 1 & + 1 \\ - 1 & - 1 & 0 & + 1 \\ 0 & - 1 & 0 & 0 \end{matrix}]}^{T} .

(35)

The element in the $i th$ row and $j th$ column of the connectivity matrix ( $R$ ) is denoted as $r_{i j}$ , which represents the type of regulations from gene $i$ to gene $j$ . $r_{i j}$ takes in ${- 1,0, + 1}$ , where +1 and −1 represent the positive and negative regulations and 0 indicates no regulation. The following 5 regulations in (35) are assumed to be unknown: $θ = (r_{21}, r_{24}, r_{42}, r_{22}, r_{41})$ with true values of $θ^{*} = (+ 1, - 1, + 1,0, - 1)$ . Since each regulation can take three values in ${- 1,0, + 1}$ , the parameter set $Θ$ consists of $3^{5} = 243$ distinct parameter vectors. With $2^{4} = 16$ states within the p53-MDM2 network, the belief space for this problem can be characterized as $Δ_{16 \times 243}$ .

The following action or perturbation space is considered for our analyses: $𝒜 = \{a^{1} = 1000, a^{2} = 0100, a^{3} = 0010, a^{4} = 0001, a^{5} = 0000\}$ . The first four actions alter the state value of a single gene at each time step. For instance, considering (34), $a^{1}$ flips the first gene’s value from 0 to 1 or 1 to 0. Note that this assumption closely aligns with common practice for data collection of gene regulatory networks in experimental settings, where drugs designed for targeted genes are used for gene excitations and data collection [55]. Note that the fifth action, $a^{5}$ , leaves the state of all genes unchanged.

The specific type of gene-expression data determines the characteristics of the HMM observation model. This paper considers the following Gaussian model, which is often used for representing cDNA microarrays or live-cell imaging-based assays [56], [57]:

y_{k} (i) = m + δ x_{k} (i) + v_{k} (i), k = 1,2, \dots,

(36)

for $i = 1, \dots, 4$ . $v_{k} (i) ~ 𝒩 (0, σ^{2})$ represents a zero-mean Gaussian noise vector with uncorrelated elements, $m$ denotes the baseline expression which refers to the “zero” states, and $δ$ is the differential expression value. For our experiments, we assume that the parameters of the observation model in (36) are known and provided as follows: $m = 20, δ = 30$ , and $σ = 5$ .

The performance of the proposed method using the point-based reward in (15) is assessed in terms of average maximum posterior probability for 50 time steps. Fig. 9 shows that our method achieves a maximum posterior probability of nearly 1 after about 15 steps, whereas other approaches exhibit notably lower performance, with active learning and MAP reaching around 0.3 and BOReL demonstrating a performance of 0.5 by the 15th step. Further, after 50 steps, active learning and MAP policies achieve a performance of less than 0.8, while BOReL reaches about 0.9, which still falls short of the performance demonstrated by the proposed method. The active learning and MAP policies exhibit nearly identical performance, primarily due to their greedy nature. Unlike our approach, these policies prioritize immediate gains without considering the long-term implications of their actions on the inference objective, resulting in slower improvement over time. Meanwhile, BOReL’s lack of joint modeling of state and parameter uncertainty has also led to a lower performance compared to our proposed method.

Fig. 9: — Performance of different data collection policies for inferring all unknown parameters of the p53-MDM2 network.

Four variations of the previous experiment are conducted to evaluate the robustness of our method under different noise conditions. Two process noise levels ( $p = 0.001,0.05$ ) and two measurement noise levels ( $σ = 5,10$ ) are considered, resulting in four different combinations. The performance of various data collection methods in these experiments is reported in terms of the average maximum posterior probability per step. As shown in Table I, the results show that performance declines as noise increases. However, our method consistently outperforms all other approaches across all noise scenarios, demonstrating its robustness in noisy environments.

TABLE I:

Average maximum posterior probability per step obtained using different data collection policies for inferring unknown parameters of the p53-MDM2 network under various process noise ( $p$ ) and measurement noise ( $σ$ ) values.

_$σ$╲^$p$	0.001				0.05
	Proposed Policy	Active Learning	MAP Policy	BOReL Policy	Proposed Policy	Active Learning	MAP Policy	BOReL Policy
5	0.811	0.457	0.445	0.638	0.701	0.425	0.421	0.632
10	0.654	0.382	0.369	0.499	0.485	0.265	0.263	0.329

Open in a new tab

The performance of the proposed method for causal inference of the unknown regulation $r_{24} = - 1$ , while other regulations are also unknown, is analyzed here. Fig. 10(a) shows the average maximum posterior probability over the entire parameter set (i.e., $Θ$ ). The low average maximum posterior probability over the entire parameter space under the proposed method comes from the focus of the proposed policy on the inference of the regulation $r_{24}$ . The average posterior probability of all the unknown regulations under the proposed method and active learning method is presented in Fig. 10(b). The average posterior probabilities of regulations being −1, 0, and 1 are shown using black, blue, and red, respectively. It can be seen that the posterior probability of being −1 for $r_{24}$ increases quickly to near 1 only after collecting 10 data and gets to 1 after 15 data, whereas the same posterior probability has a very slow increase using active learning and BOReL policies. This posterior probability reaches close to 1 after 50 steps using the active learning policy, and it reaches about 0.75 using the BOReL policy. Moreover, Fig. 10(b) illustrates that the true model posterior probability of the regulations $r_{21} = + 1$ and $r_{42} = + 1$ have increased considerably under data collected for inference of $r_{24}$ by the proposed policy, whereas these data have not substantially increased the true model posterior probability of $r_{22} = 0$ and $r_{41} = - 1$ . This justifies the low maximum posterior probability of the entire parameter set in Fig. 10(a), as well as the capability of the proposed policy in systematically taking actions for causal inference.

Fig. 10: — (a) The average posterior probability of the entire parameters, and (b) average posterior probabilities of all the unknown interactions, for causal inference of the unknown regulation $r_{24}$ in the p53-MDM2 network.

The distribution-based criterion (entropy) in (16) is considered in this part for the inference of the p53-MDM2 network. Fig. 11 shows the average distribution of the posterior probabilities under different policies at time steps 5 and 10. It can be seen that after only 5 steps/data, the posterior probabilities over the parameter set $Θ$ are more peaked (roughly between 150 and 243) using the proposed policy (the true parameter belongs to this set). Further, it can be seen that using 10 data, the posterior probability of the true parameter (model 184), shown as red, is peaked at almost 1, which means that the true model has been identified with high confidence after only 10 data. On the other hand, the other policies show a more uncertain (not peaked) distribution of the posterior probabilities in both time steps 5 and 10. More results related to the p53-MDM2 network can be found in the supplementary materials.

C. Littman’s Network with Unknown Measurement Process

Littman’s network [58] is considered to study the effectiveness of our proposed data collection method in communication networks. The Littman’s network comprises 7 states as $𝒳 = \{x^{1} = 00, x^{2} = 20, x^{3} = 40, x^{4} = 60, x^{5} = 80, x^{6} = 100, x^{7} = CRASH\}$ . Each of ${00,20, \dots, 100}$ corresponds to the current load of the communication network, and CRASH refers to a state where the network is overloaded and eventually crashes. Note that when the system is in states $x^{1} = 00$ and $x^{7} = CRASH$ , the network load cannot be decreased or increased, respectively. Four actions are defined for this network as $𝒜 = \{a^{1} = UNRESTRICT, a^{2} = STEADY, a^{3} = RESTRICT, a^{4} = REBOOT}$ . UNRESTRICT means that the network should increase its load; conversely, RESTRICT shows that the network should decrease its current load. STEADY is an action that enforces the network to stay in its current state, and finally, REBOOT is an action that relieves all the load from the network. It is assumed that the state transition parameters for this system are completely known. Complete information about the transition probabilities is provided in the supplementary material.

The measurement process provides the status of the network, with the measurement space $𝒴 = \{y^{1} = UP, y^{2} = DOWN}$ . The measurement process can be expressed using the update vectors defined in (19) as:

T^{θ} (y^{1}, a) = [\begin{matrix} 1 \\ 1 \\ 1 \\ t_{1} \\ t_{2} \\ t_{3} \\ 0 \end{matrix}], T^{θ} (y^{2}, a) = [\begin{matrix} 0 \\ 0 \\ 0 \\ 1 - t_{1} \\ 1 - t_{2} \\ 1 - t_{3} \\ 0 \end{matrix}], for a \in 𝒜,

for each $a \in 𝒜$ , where $θ = (t_{1}, t_{2}, t_{3})$ is a single vector expressing the unknown parameters with the true parameters being $θ^{*} = (t_{1}^{*} = 0.8, t_{2}^{*} = 0.5, t_{3}^{*} = 0.2)$ . The unknown elements of the update vector can take values as $t_{i} \in {0.2,0.5,0.8}$ , leading to a parameter set $Θ$ of size $3^{3} = 27$ . Since Littman’s network has 7 states, the belief space for this problem can be described as $Δ_{7 \times 27}$ .

The results of the proposed method for causal inference of each of the three unknown parameters, $t_{1}, t_{2}$ , and $t_{3}$ , are presented in Fig. 12. Note that each plot in Fig. 12 represents the average true posterior probability for the causal parameter that the methods aim to infer. For instance, Fig. 12(c) represents that the proposed method has achieved a value near 0.8 for the true posterior probability of $t_{3}$ , meaning that with a confidence of near 80%, $t_{3}$ is equal to its true value. In contrast, active learning, BOReL, and MAP policies achieve much lower confidence values of about 45%, 40%, and 30%, respectively. Similarly, the proposed policy achieves a high confidence of about 75% in case (a), while the causal inference of parameter $t_{2}$ in case (b) has been more challenging, and with 50 data a confidence of near 40% is achieved in this case. This can be explained by the fact that $t_{1}^{*} = 0.2$ and $t_{3}^{*} = 0.8$ are associated with less uncertain parameters, whereas $t_{2}^{*} = 0.5$ represents the parameter with the highest uncertainty in the observation model, making its inference more challenging compared to the others. Additional experiments on the Littmans’s network are included in the supplementary materials.

VI. Conclusion and Future Work

This paper developed a deep reinforcement learning data collection approach for Bayesian inference of general hidden Markov models. At first, a belief state was defined as the joint distribution of the possible system states and models. Using this belief state, the optimal Bayesian lookahead data collection process was formulated to maximize the inference performance given the system and data uncertainty. Since the belief state is continuous, a deep reinforcement learning approach was utilized to approximate the optimal Bayesian data collection policy. The proposed policy can be trained offline, using simulated trajectories from the belief transition. This enables our data collection policy to account for the long-term effects of data collection decisions on the inference performance, as opposed to the short-term impacts. The trained policy can then be used for sequential and real-time data collection as new observations become available from the real system. The effectiveness of our proposed method in terms of point-based, distribution-based, and causal inference objectives was demonstrated using comprehensive numerical experiments over three systems.

Our future work will focus on scaling the proposed data collection method to handle larger systems with continuous state and action spaces. Additionally, we aim to explore more complex data collection objectives, such as fault detection, abnormality or attack detection, and systems with semi-Markov and non-Markovian dynamics. We also plan to investigate variations of our method for different domains, such as image processing, which would require compact representations of the belief state tailored to each domain. Furthermore, we will work on accelerating the training process and improving accuracy by incorporating techniques such as softmax. Finally, we will adapt our data collection method to other learning paradigms, including supervised and unsupervised learning.

Supplementary Material

Supplement

NIHMS2050407-supplement-Supplement.pdf^{(1.3MB, pdf)}

Impact Statement—

Recent technological advances leading to data growth have prompted a simultaneous focus on studying increasingly complex systems and phenomena. Modeling complex systems through available data might require thousands or millions of data, which are impossible to collect due to time or cost constraints. This paper provides a systematic approach for collecting targeted data depending on the objective of the analyses. The results demonstrate that the amount of data is not the only factor for accurate modeling of complex systems; rather, targeted and informative data can shed light on the underlying complex mechanisms of these systems. Examples of such systems include ecology, economics, engineering, and medicine. In particular, the biological application of this work will help biologists collect genomic data that best identifies specific interactions between genes causing chronic diseases.

Acknowledgment

The authors acknowledge the support of the National Institute of Health award 1R21EB032480-01, National Science Foundation awards IIS-2311969, ARMY Research Laboratory award W911NF-23-2-0207, ARMY Research Office awards W911NF-24-2-0166 and W911NF2110299, and Office of Naval Research award N00014-23-1-2850.

Biographies

graphic file with name nihms-2050407-b0001.gif

Mohammad Alali is a fourth-year Ph.D. candidate in Electrical Engineering at Northeastern University. He received his B.Sc. degree in Electrical Engineering from University of Tehran, Tehran, Iran, in 2018, and M.Sc. degree in Electrical Engineering from Montana State University, Bozeman, USA, in 2021. His research interests lie in the domains of machine learning, Bayesian statistics, reinforcement learning, and experimental design. More specifically, he is currently investigating several projects with a focus on fast and efficient inference of various observable/partially observable Markov models. He is the recipient of the Best Paper Finalist Award at the 20th IFAC Symposium on System Identification in 2024 and the American Control Conference in 2023.

graphic file with name nihms-2050407-b0002.gif

Mahdi Imani received his Ph.D. degree in Electrical and Computer Engineering from Texas A&M University, College Station, TX, in 2019. He is currently an Assistant Professor in the Department of Electrical and Computer Engineering at Northeastern University. His research interests include machine learning, Bayesian statistics, and decision theory, with applications spanning from computational biology to cyber-physical systems. He has received several prestigious awards, including the NIH NIBIB Trailblazer Award in 2022, the Oracle Research Award in 2022, the Outstanding Associate Editor Award from IEEE Transactions on Neural Networks and Learning Systems, the NSF CISE Career Research Initiation Initiative Award in 2020, the Association of Former Students Distinguished Graduate Student Award for Excellence in Research-Doctoral in 2019, and the Best Paper Finalist Award at the American Control Conference in 2023 and the 49th Asilomar Conference on Signals, Systems, and Computers in 2015. He currently serves as an Associate Editor for IEEE Transactions on Neural Networks and Learning Systems (TNNLS) and IEEE Transactions on Vehicular Technology (TVT).

References

[1].Li W and Zhang C, “A hidden Markov model for condition monitoring of time series data in complex network systems,” IEEE Transactions on Reliability, vol. 72, no. 4, pp. 1478–1492, 2023. [Google Scholar]
[2].Kazeminajafabadi A, Ghoreishi SF, and Imani M, “Optimal detection for Bayesian attack graphs under uncertainty in monitoring and reimaging,” in 2024 American Control Conference (ACC), 2024. [Google Scholar]
[3].Liu P, Qu T, Gao H, and Gong X, “Driving intention recognition of surrounding vehicles based on a time-sequenced weights hidden Markov model for autonomous driving,” Sensors, vol. 23, no. 21, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Lin Y, Ghoreishi SF, Lan T, and Imani M, “High-level human intention learning for cooperative decision-making,” in 2024 IEEE Conference on Control Technology and Applications (CCTA), pp. 209–216, IEEE, 2024. [Google Scholar]
[5].Liu X, Shi K, Wang Z, and Chen J, “Exploit camera raw data for video super-resolution via hidden Markov model inference,” IEEE Transactions on Image Processing, vol. 30, pp. 2127–2140, 2021. [DOI] [PubMed] [Google Scholar]
[6].Perikos I, Kardakis S, and Hatzilygeroudis I, “Sentiment analysis using novel and interpretable architectures of hidden Markov models,” Knowledge-Based Systems, vol. 229, p. 107332, 2021. [Google Scholar]
[7].Kazeminajafabadi A and Imani M, “Optimal joint defense and monitoring for networks security under uncertainty: A pomdp-based approach,” IET Information Security, vol. 2024, no. 1, p. 7966713, 2024. [Google Scholar]
[8].Mor B, Garhwal S, and Kumar A, “A systematic review of hidden Markov models and their applications,” Archives of Computational Methods in Engineering, vol. 28, pp. 1429–1448, 2021. [Google Scholar]
[9].Foti N, Xu J, Laird D, and Fox E, “Stochastic variational inference for hidden Markov models,” in Advances in Neural Information Processing Systems (Ghahramani Z, Welling M, Cortes C, Lawrence N, and Weinberger K, eds.), vol. 27, Curran Associates, Inc., 2014. [Google Scholar]
[10].Wiuf C, Behera A, Singh A, and Gopalkrishnan M, “A reaction network scheme for hidden Markov model parameter learning,” Journal of The Royal Society Interface, vol. 20, no. 203, p. 20220877, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
[11].Mongillo G and Deneve S, “Online learning with hidden Markov models,” Neural Computation, vol. 20, pp. 1706–1716, 07 2008. [DOI] [PubMed] [Google Scholar]
[12].Dewar M, Wiggins C, and Wood F, “Inference in hidden Markov models with explicit state duration distributions,” IEEE Signal Processing Letters, vol. 19, no. 4, pp. 235–238, 2012. [Google Scholar]
[13].Fearnhead P and Clifford P, “On-line inference for hidden Markov models via particle filters,” Journal of the Royal Statistical Society Series B: Statistical Methodology, vol. 65, pp. 887–899, 10 2003. [Google Scholar]
[14].Owen J, Wilkinson DJ, and Gillespie CS, “Scalable inference for Markov processes with intractable likelihoods,” Statistics and Computing, vol. 25, pp. 145–156, 2015. [Google Scholar]
[15].Wieland F-G, Hauber AL, Rosenblatt M, Tönsing C, and Timmer J, “On structural and practical identifiability,” Current Opinion in Systems Biology, vol. 25, pp. 60–69, 2021. [Google Scholar]
[16].Jakaria AHM, Rahman MA, Asif M, Khalil AA, Kholidy HA, Anderson M, and Drager S, “Trajectory synthesis for a UAV swarm based on resilient data collection objectives,” IEEE Transactions on Network and Service Management, vol. 20, no. 1, pp. 138–151, 2023. [Google Scholar]
[17].Levine S, Pastor P, Krizhevsky A, Ibarz J, and Quillen D, “Learning hand-eye coordination for robotic grasping with deep learning and largescale data collection,” The International Journal of Robotics Research, vol. 37, no. 4–5, pp. 421–436, 2018. [Google Scholar]
[18].Zheng W, Wu B, and Lin H, “POMDP model learning for human robot collaboration,” in 2018 IEEE Conference on Decision and Control (CDC), pp. 1156–1161, 2018. [Google Scholar]
[19].Atrash A and Pineau J, “A Bayesian method for learning POMDP observation parameters for robot interaction management systems,” in The POMDP Practitioners Workshop, 2010. [Google Scholar]
[20].Alali M, KazemiNajafabadi A, and Imani M, “Deep reinforcement learning sensor scheduling for effective monitoring of dynamical systems,” Systems Science & Control Engineering, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
[21].Zandi R, Behzad K, Motamedi E, Salehinejad H, and Siami M, “RoboFiSense: Attention-based robotic arm activity recognition with wifi sensing,” IEEE Journal of Selected Topics in Signal Processing, 2024. [Google Scholar]
[22].Asadi N and Ghoreishi SF, “Active learning for efficient data acquiring in coupled multidisciplinary systems,” in 4th Modeling, Estimation and Control Conference, MECC; 2024, 2024. [Google Scholar]
[23].De Oliveira ACB, Siami M, and Sontag ED, “Sensor and actuator scheduling in bilinear dynamical networks,” in 2022 IEEE 61st Conference on Decision and Control (CDC), pp. 3568–3573, 2022. [Google Scholar]
[24].Liu H, Li Y, Johansson KH, Mårtensson J, and Xie L, “Rollout approach to sensor scheduling for remote state estimation under integrity attack,” Automatica, vol. 144, p. 110473, 2022. [Google Scholar]
[25].Engström J, Wei R, McDonald AD, Garcia A, O’Kelly M, and Johnson L, “Resolving uncertainty on the fly: modeling adaptive driving behavior as active inference,” Frontiers in Neurorobotics, vol. 18, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
[26].Read RJ, Oeffner RD, and McCoy AJ, “Measuring and using information gained by observing diffraction data,” Acta Crystallographica Section D: Structural Biology, vol. 76, no. 3, pp. 238–247, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[27].Tosh C, Tec M, and Tansey W, “Targeted active learning for probabilistic models,” arXiv preprint arXiv:2210.12122, 2022. [Google Scholar]
[28].Parr T, Friston K, and Zeidman P, “Active data selection and information seeking,” Algorithms, vol. 17, no. 3, 2024. [Google Scholar]
[29].Jha A, Ashwood ZC, and Pillow JW, “Active Learning for Discrete Latent Variable Models,” Neural Computation, vol. 36, pp. 437–474, 02 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
[30].Alali M and Imani M, “Bayesian reinforcement learning for navigation planning in unknown environments,” Frontiers in Artificial Intelligence, vol. 7, p. 1308031, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
[31].Ravari A, Ghoreishi SF, and Imani M, “Implicit human perception learning in complex and unknown environments,” in American Control Conference (ACC), IEEE, 2024. [Google Scholar]
[32].Shen H, Wang Y, Wang J, and Park JH, “A fuzzy-model-based approach to optimal control for nonlinear Markov jump singularly perturbed systems: A novel integral reinforcement learning scheme,” IEEE Transactions on Fuzzy Systems, vol. 31, no. 10, pp. 3734–3740, 2023. [Google Scholar]
[33].Ravari A, Ghoreishi SF, and Imani M, “Optimal inference of hidden Markov models through expert-acquired data,” IEEE Transactions on Artificial Intelligence, vol. 5, no. 8, pp. 3985–4000, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
[34].Dorfman R, Shenfeld I, and Tamar A, “Offline meta reinforcement learning - identifiability challenges and effective data collection strategies,” in Advances in Neural Information Processing Systems (Ranzato M, Beygelzimer A, Dauphin Y, Liang P, and Vaughan JW, eds.), vol. 34, pp. 4607–4618, Curran Associates, Inc., 2021. [Google Scholar]
[35].Xue Y and Bogdan P, “Constructing compact causal mathematical models for complex dynamics,” in 2017 ACM/IEEE 8th International Conference on Cyber-Physical Systems (ICCPS), pp. 97–108, 2017. [Google Scholar]
[36].Gupta G, Yin C, Deshmukh JV, and Bogdan P, “Non-Markovian reinforcement learning using fractional dynamics,” in 2021 60th IEEE Conference on Decision and Control (CDC), pp. 1542–1547, 2021. [Google Scholar]
[37].Gupta G, Pequito S, and Bogdan P, “Learning latent fractional dynamics with unknown unknowns,” in 2019 American Control Conference (ACC), pp. 217–222, 2019. [Google Scholar]
[38].Kim T and Song E, “Optimizing input data acquisition for ranking and selection: A view through the most probable best,” in 2022 Winter Simulation Conference (WSC), pp. 2258–2269, 2022. [Google Scholar]
[39].Kantas N, Doucet A, Singh SS, Maciejowski J, and Chopin N, “On particle methods for parameter estimation in state-space models,” Statistical Science, vol. 30, no. 3, pp. 328–351, 2015. [Google Scholar]
[40].Bassett R and Deride J, “Maximum a posteriori estimators as a limit of Bayes estimators,” Mathematical Programming, vol. 174, pp. 129–144, 2019. [Google Scholar]
[41].Shannon CE, “A mathematical theory of communication,” The Bell System Technical Journal, vol. 27, no. 3, pp. 379–423, 1948. [Google Scholar]
[42].Bacci G, Ingólfsdóttir A, Larsen KG, and Reynouard R, “Active learning of Markov decision processes using Baum-Welch algorithm,” in 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 1203–1208, IEEE, 2021. [Google Scholar]
[43].Sutton RS and Barto AG, Reinforcement learning: An introduction. Cambridge, MA, USA: A Bradford Book, 2018. [Google Scholar]
[44].Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G, et al. , “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015. [DOI] [PubMed] [Google Scholar]
[45].Kingma DP and Ba J, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2015. [Google Scholar]
[46].Liu W and Rajapakse J, “Fusing gene expressions and transitive protein-protein interactions for inference of gene regulatory networks,” BMC Systems Biology, vol. 13, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[47].Hosseini SH and Imani M, “Learning to fight against cell stimuli: A game theoretic perspective,” in IEEE Conference on Artificial Intelligence, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
[48].Pechmann M, Kenny NJ, Pott L, Heger P, Chen Y-T, Buchta T, Özüak O, Lynch J, and Roth S, “Striking parallels between dorsoventral patterning in Drosophila and Gryllus reveal a complex evolutionary history behind a model gene regulatory network,” eLife, vol. 10, p. e68287, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
[49].Alali M and Imani M, “Bayesian optimization for state and parameter estimation of dynamic networks with binary space,” in 8th IEEE Conference on Control Technology and Applications (CCTA), 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
[50].Batchelor E, Loewer A, and Lahav G, “The ups and downs of p53: understanding protein dynamics in single cells,” Nature Reviews Cancer, vol. 9, no. 5, pp. 371–377, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
[51].Hosseini SH and Imani M, “An optimal Bayesian intervention policy in response to unknown dynamic cell stimuli,” Information Sciences, vol. 666, p. 120440, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
[52].Ravari A, Ghoreishi SF, and Imani M, “Optimal recursive expert-enabled inference in regulatory networks,” IEEE Control Systems Letters, vol. 7, pp. 1027–1032, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
[53].Hosseini SH and Imani M, “Modeling defensive response of cells to therapies: Equilibrium interventions for regulatory networks,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, pp. 1–13, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
[54].Imani M and Braga-Neto UM, “Maximum-likelihood adaptive filter for partially observed Boolean dynamical systems,” IEEE Transactions on Signal Processing, vol. 65, no. 2, pp. 359–371, 2017. [Google Scholar]
[55].Sisodiya SM, “Precision medicine and therapies of the future,” Epilepsia, vol. 62, no. S2, pp. S90–S105, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
[56].Chen Y and et al. , “Ratio-based decisions and the quantitative analysis of cDNA microarray images,” Journal of Biomedical optics, vol. 2, no. 4, pp. 364–374, 1997. [DOI] [PubMed] [Google Scholar]
[57].Hua J, Sima C, Cypert M, Gooden GC, Shack S, Alla L, Smith EA, Trent JM, Dougherty ER, and Bittner ML, “Dynamical analysis of drug efficacy and mechanism of action using GFP reporters,” Journal of Biological Systems, vol. 20, no. 04, pp. 403–422, 2012. [Google Scholar]
[58].Littman ML, Algorithms for sequential decision-making. PhD thesis, Brown University, Providence, RI, USA, 1996. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

NIHMS2050407-supplement-Supplement.pdf^{(1.3MB, pdf)}

[R1] [1].Li W and Zhang C, “A hidden Markov model for condition monitoring of time series data in complex network systems,” IEEE Transactions on Reliability, vol. 72, no. 4, pp. 1478–1492, 2023. [Google Scholar]

[R2] [2].Kazeminajafabadi A, Ghoreishi SF, and Imani M, “Optimal detection for Bayesian attack graphs under uncertainty in monitoring and reimaging,” in 2024 American Control Conference (ACC), 2024. [Google Scholar]

[R3] [3].Liu P, Qu T, Gao H, and Gong X, “Driving intention recognition of surrounding vehicles based on a time-sequenced weights hidden Markov model for autonomous driving,” Sensors, vol. 23, no. 21, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] [4].Lin Y, Ghoreishi SF, Lan T, and Imani M, “High-level human intention learning for cooperative decision-making,” in 2024 IEEE Conference on Control Technology and Applications (CCTA), pp. 209–216, IEEE, 2024. [Google Scholar]

[R5] [5].Liu X, Shi K, Wang Z, and Chen J, “Exploit camera raw data for video super-resolution via hidden Markov model inference,” IEEE Transactions on Image Processing, vol. 30, pp. 2127–2140, 2021. [DOI] [PubMed] [Google Scholar]

[R6] [6].Perikos I, Kardakis S, and Hatzilygeroudis I, “Sentiment analysis using novel and interpretable architectures of hidden Markov models,” Knowledge-Based Systems, vol. 229, p. 107332, 2021. [Google Scholar]

[R7] [7].Kazeminajafabadi A and Imani M, “Optimal joint defense and monitoring for networks security under uncertainty: A pomdp-based approach,” IET Information Security, vol. 2024, no. 1, p. 7966713, 2024. [Google Scholar]

[R8] [8].Mor B, Garhwal S, and Kumar A, “A systematic review of hidden Markov models and their applications,” Archives of Computational Methods in Engineering, vol. 28, pp. 1429–1448, 2021. [Google Scholar]

[R9] [9].Foti N, Xu J, Laird D, and Fox E, “Stochastic variational inference for hidden Markov models,” in Advances in Neural Information Processing Systems (Ghahramani Z, Welling M, Cortes C, Lawrence N, and Weinberger K, eds.), vol. 27, Curran Associates, Inc., 2014. [Google Scholar]

[R10] [10].Wiuf C, Behera A, Singh A, and Gopalkrishnan M, “A reaction network scheme for hidden Markov model parameter learning,” Journal of The Royal Society Interface, vol. 20, no. 203, p. 20220877, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] [11].Mongillo G and Deneve S, “Online learning with hidden Markov models,” Neural Computation, vol. 20, pp. 1706–1716, 07 2008. [DOI] [PubMed] [Google Scholar]

[R12] [12].Dewar M, Wiggins C, and Wood F, “Inference in hidden Markov models with explicit state duration distributions,” IEEE Signal Processing Letters, vol. 19, no. 4, pp. 235–238, 2012. [Google Scholar]

[R13] [13].Fearnhead P and Clifford P, “On-line inference for hidden Markov models via particle filters,” Journal of the Royal Statistical Society Series B: Statistical Methodology, vol. 65, pp. 887–899, 10 2003. [Google Scholar]

[R14] [14].Owen J, Wilkinson DJ, and Gillespie CS, “Scalable inference for Markov processes with intractable likelihoods,” Statistics and Computing, vol. 25, pp. 145–156, 2015. [Google Scholar]

[R15] [15].Wieland F-G, Hauber AL, Rosenblatt M, Tönsing C, and Timmer J, “On structural and practical identifiability,” Current Opinion in Systems Biology, vol. 25, pp. 60–69, 2021. [Google Scholar]

[R16] [16].Jakaria AHM, Rahman MA, Asif M, Khalil AA, Kholidy HA, Anderson M, and Drager S, “Trajectory synthesis for a UAV swarm based on resilient data collection objectives,” IEEE Transactions on Network and Service Management, vol. 20, no. 1, pp. 138–151, 2023. [Google Scholar]

[R17] [17].Levine S, Pastor P, Krizhevsky A, Ibarz J, and Quillen D, “Learning hand-eye coordination for robotic grasping with deep learning and largescale data collection,” The International Journal of Robotics Research, vol. 37, no. 4–5, pp. 421–436, 2018. [Google Scholar]

[R18] [18].Zheng W, Wu B, and Lin H, “POMDP model learning for human robot collaboration,” in 2018 IEEE Conference on Decision and Control (CDC), pp. 1156–1161, 2018. [Google Scholar]

[R19] [19].Atrash A and Pineau J, “A Bayesian method for learning POMDP observation parameters for robot interaction management systems,” in The POMDP Practitioners Workshop, 2010. [Google Scholar]

[R20] [20].Alali M, KazemiNajafabadi A, and Imani M, “Deep reinforcement learning sensor scheduling for effective monitoring of dynamical systems,” Systems Science & Control Engineering, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] [21].Zandi R, Behzad K, Motamedi E, Salehinejad H, and Siami M, “RoboFiSense: Attention-based robotic arm activity recognition with wifi sensing,” IEEE Journal of Selected Topics in Signal Processing, 2024. [Google Scholar]

[R22] [22].Asadi N and Ghoreishi SF, “Active learning for efficient data acquiring in coupled multidisciplinary systems,” in 4th Modeling, Estimation and Control Conference, MECC; 2024, 2024. [Google Scholar]

[R23] [23].De Oliveira ACB, Siami M, and Sontag ED, “Sensor and actuator scheduling in bilinear dynamical networks,” in 2022 IEEE 61st Conference on Decision and Control (CDC), pp. 3568–3573, 2022. [Google Scholar]

[R24] [24].Liu H, Li Y, Johansson KH, Mårtensson J, and Xie L, “Rollout approach to sensor scheduling for remote state estimation under integrity attack,” Automatica, vol. 144, p. 110473, 2022. [Google Scholar]

[R25] [25].Engström J, Wei R, McDonald AD, Garcia A, O’Kelly M, and Johnson L, “Resolving uncertainty on the fly: modeling adaptive driving behavior as active inference,” Frontiers in Neurorobotics, vol. 18, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] [26].Read RJ, Oeffner RD, and McCoy AJ, “Measuring and using information gained by observing diffraction data,” Acta Crystallographica Section D: Structural Biology, vol. 76, no. 3, pp. 238–247, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] [27].Tosh C, Tec M, and Tansey W, “Targeted active learning for probabilistic models,” arXiv preprint arXiv:2210.12122, 2022. [Google Scholar]

[R28] [28].Parr T, Friston K, and Zeidman P, “Active data selection and information seeking,” Algorithms, vol. 17, no. 3, 2024. [Google Scholar]

[R29] [29].Jha A, Ashwood ZC, and Pillow JW, “Active Learning for Discrete Latent Variable Models,” Neural Computation, vol. 36, pp. 437–474, 02 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] [30].Alali M and Imani M, “Bayesian reinforcement learning for navigation planning in unknown environments,” Frontiers in Artificial Intelligence, vol. 7, p. 1308031, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] [31].Ravari A, Ghoreishi SF, and Imani M, “Implicit human perception learning in complex and unknown environments,” in American Control Conference (ACC), IEEE, 2024. [Google Scholar]

[R32] [32].Shen H, Wang Y, Wang J, and Park JH, “A fuzzy-model-based approach to optimal control for nonlinear Markov jump singularly perturbed systems: A novel integral reinforcement learning scheme,” IEEE Transactions on Fuzzy Systems, vol. 31, no. 10, pp. 3734–3740, 2023. [Google Scholar]

[R33] [33].Ravari A, Ghoreishi SF, and Imani M, “Optimal inference of hidden Markov models through expert-acquired data,” IEEE Transactions on Artificial Intelligence, vol. 5, no. 8, pp. 3985–4000, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] [34].Dorfman R, Shenfeld I, and Tamar A, “Offline meta reinforcement learning - identifiability challenges and effective data collection strategies,” in Advances in Neural Information Processing Systems (Ranzato M, Beygelzimer A, Dauphin Y, Liang P, and Vaughan JW, eds.), vol. 34, pp. 4607–4618, Curran Associates, Inc., 2021. [Google Scholar]

[R35] [35].Xue Y and Bogdan P, “Constructing compact causal mathematical models for complex dynamics,” in 2017 ACM/IEEE 8th International Conference on Cyber-Physical Systems (ICCPS), pp. 97–108, 2017. [Google Scholar]

[R36] [36].Gupta G, Yin C, Deshmukh JV, and Bogdan P, “Non-Markovian reinforcement learning using fractional dynamics,” in 2021 60th IEEE Conference on Decision and Control (CDC), pp. 1542–1547, 2021. [Google Scholar]

[R37] [37].Gupta G, Pequito S, and Bogdan P, “Learning latent fractional dynamics with unknown unknowns,” in 2019 American Control Conference (ACC), pp. 217–222, 2019. [Google Scholar]

[R38] [38].Kim T and Song E, “Optimizing input data acquisition for ranking and selection: A view through the most probable best,” in 2022 Winter Simulation Conference (WSC), pp. 2258–2269, 2022. [Google Scholar]

[R39] [39].Kantas N, Doucet A, Singh SS, Maciejowski J, and Chopin N, “On particle methods for parameter estimation in state-space models,” Statistical Science, vol. 30, no. 3, pp. 328–351, 2015. [Google Scholar]

[R40] [40].Bassett R and Deride J, “Maximum a posteriori estimators as a limit of Bayes estimators,” Mathematical Programming, vol. 174, pp. 129–144, 2019. [Google Scholar]

[R41] [41].Shannon CE, “A mathematical theory of communication,” The Bell System Technical Journal, vol. 27, no. 3, pp. 379–423, 1948. [Google Scholar]

[R42] [42].Bacci G, Ingólfsdóttir A, Larsen KG, and Reynouard R, “Active learning of Markov decision processes using Baum-Welch algorithm,” in 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 1203–1208, IEEE, 2021. [Google Scholar]

[R43] [43].Sutton RS and Barto AG, Reinforcement learning: An introduction. Cambridge, MA, USA: A Bradford Book, 2018. [Google Scholar]

[R44] [44].Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G, et al. , “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015. [DOI] [PubMed] [Google Scholar]

[R45] [45].Kingma DP and Ba J, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2015. [Google Scholar]

[R46] [46].Liu W and Rajapakse J, “Fusing gene expressions and transitive protein-protein interactions for inference of gene regulatory networks,” BMC Systems Biology, vol. 13, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] [47].Hosseini SH and Imani M, “Learning to fight against cell stimuli: A game theoretic perspective,” in IEEE Conference on Artificial Intelligence, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] [48].Pechmann M, Kenny NJ, Pott L, Heger P, Chen Y-T, Buchta T, Özüak O, Lynch J, and Roth S, “Striking parallels between dorsoventral patterning in Drosophila and Gryllus reveal a complex evolutionary history behind a model gene regulatory network,” eLife, vol. 10, p. e68287, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] [49].Alali M and Imani M, “Bayesian optimization for state and parameter estimation of dynamic networks with binary space,” in 8th IEEE Conference on Control Technology and Applications (CCTA), 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R50] [50].Batchelor E, Loewer A, and Lahav G, “The ups and downs of p53: understanding protein dynamics in single cells,” Nature Reviews Cancer, vol. 9, no. 5, pp. 371–377, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R51] [51].Hosseini SH and Imani M, “An optimal Bayesian intervention policy in response to unknown dynamic cell stimuli,” Information Sciences, vol. 666, p. 120440, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R52] [52].Ravari A, Ghoreishi SF, and Imani M, “Optimal recursive expert-enabled inference in regulatory networks,” IEEE Control Systems Letters, vol. 7, pp. 1027–1032, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R53] [53].Hosseini SH and Imani M, “Modeling defensive response of cells to therapies: Equilibrium interventions for regulatory networks,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, pp. 1–13, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R54] [54].Imani M and Braga-Neto UM, “Maximum-likelihood adaptive filter for partially observed Boolean dynamical systems,” IEEE Transactions on Signal Processing, vol. 65, no. 2, pp. 359–371, 2017. [Google Scholar]

[R55] [55].Sisodiya SM, “Precision medicine and therapies of the future,” Epilepsia, vol. 62, no. S2, pp. S90–S105, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R56] [56].Chen Y and et al. , “Ratio-based decisions and the quantitative analysis of cDNA microarray images,” Journal of Biomedical optics, vol. 2, no. 4, pp. 364–374, 1997. [DOI] [PubMed] [Google Scholar]

[R57] [57].Hua J, Sima C, Cypert M, Gooden GC, Shack S, Alla L, Smith EA, Trent JM, Dougherty ER, and Bittner ML, “Dynamical analysis of drug efficacy and mechanism of action using GFP reporters,” Journal of Biological Systems, vol. 20, no. 04, pp. 403–422, 2012. [Google Scholar]

[R58] [58].Littman ML, Algorithms for sequential decision-making. PhD thesis, Brown University, Providence, RI, USA, 1996. [Google Scholar]

PERMALINK

Deep Reinforcement Learning Data Collection for Bayesian Inference of Hidden Markov Models

Mohammad Alali

Mahdi Imani

Abstract

I. Introduction

Fig. 1:

II. Background and Preliminaries

A. Finite-State Hidden Markov Models

Fig. 2:

B. Formulation of Optimal Inference for HMMs

1). Inference of Unknown HMMs:

2). Formulation of Data Collection in HMMs:

III. Optimal Bayesian Lookahead Data Collection

A. Bayesian Modeling of Decision-Making in Unknown HMMs

B. MDP Formulation of Data Collection

C. Optimal Stationary Bayesian Data Collection

IV. Deep Reinforcement Learning Bayesian Data Collection

A. Matrix-form Implementation of Belief State

B. Deep Reinforcement Learning Bayesian Data Collection Policy

Fig. 3:

V. Numerical Experiments

A. A General HMM Problem with Unknown State and Measurement Processes

Fig. 4:

Fig. 5:

Fig. 6:

Fig. 7:

B. p53-MDM2 Regulatory Network with Unknown State Process

Fig. 8:

Fig. 9:

TABLE I:

Fig. 10:

Fig. 11:

C. Littman’s Network with Unknown Measurement Process

Fig. 12:

VI. Conclusion and Future Work

Supplementary Material

Impact Statement—

Acknowledgment

Biographies

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases