Optimal Query Selection Using Multi-Armed Bandits

Aziz Koçanaoğulları; Yeganeh M Marghi; Murat Akçakaya; Deniz Erdoğmuş

doi:10.1109/LSP.2018.2878066

. Author manuscript; available in PMC: 2019 Dec 1.

Published in final edited form as: IEEE Signal Process Lett. 2018 Oct 26;25(12):1870–1874. doi: 10.1109/LSP.2018.2878066

Optimal Query Selection Using Multi-Armed Bandits

Aziz Koçanaoğulları ^†,^#, Yeganeh M Marghi ^†,^#, Murat Akçakaya ^‡, Deniz Erdoğmuş ^†

PMCID: PMC6777547 NIHMSID: NIHMS1512096 PMID: 31588169

Abstract

Query selection for latent variable estimation is conventionally performed by opting for observations with low noise or optimizing information theoretic objectives related to reducing the level of estimated uncertainty based on the current best estimate. In these approaches, typically the system makes a decision by leveraging the current available information about the state. However, trusting the current best estimate results in poor query selection when truth is far from the current estimate, and this negatively impacts the speed and accuracy of the latent variable estimation procedure. We introduce a novel sequential adaptive action value function for query selection using the multi-armed bandit (MAB) framework which allows us to find a tractable solution. For this adaptive-sequential query selection method, we analytically show: (i) performance improvement in the query selection for a dynamical system, (ii) the conditions where the model outperforms competitors. We also present favorable empirical assessments of the performance for this method, compared to alternative methods, both using Monte Carlo simulations and human-in-the-loop experiments with a brain computer interface (BCI) typing system where the language model provides the prior information.

Keywords: Subset selection, Query optimization, Misleading prior, Multi-armed bandit framework

I. Introduction

Recursive state estimation contributes a key role in signal processing and system identification. The recursive paradigm is often used to extract information from model parameters or the states of a dynamic system in real time, given noisy observations. On the other hand, Bayesian methods are valuable decision making approaches, since they take into account a variety of prior knowledge about the system due to the experience and previous observations (history). In stochastic dynamic systems, to estimate the state variables, maximum a posteriori (MAP) inference is commonly used. To estimate the state with a high confidence (usually pre-defined), the system probes the environment through multiple recursions of sequences of queries, which reduces the rate of state estimation convergence. Therefore, the queries need to be designed specifically to optimize both the speed and the accuracy of the state estimation. Query selection/optimization in the recursive state estimation is often performed by greedy selections using: (i) expected posterior maximization [1]; (ii) Fisher information-based approaches [2], [3], and (iii) information theory-based approaches such as entropy minimization or maximum mutual information (MMI) [4], [5], [6], [7]. It is shown that all these approaches for optimum sequence design through query selection lead to the selection of N-best queries with respect to the current belief [8], [9].

In estimation problems, the system may have access to an additional knowledge called prior information on the state of the system which can improve the estimation process. However, imprecise prior information may lead to incorrect posterior beliefs given the same set of observations. Accordingly, choosing the N-best queries by trusting the current belief does not always offer the best query optimization. In dynamic systems, the prior information about the environment can be adversarial due to the transition noise, observation noise, change of environment distributions and being outdated; hence resulting in longer decision cycles or the wrong state estimation [10], [11]. Many applications involving state estimation, system identification or sequential decision making using prior information encounter these challenges: recommender systems [12], coimnunication networks [13], [14], radar systems [15] and clinical studies [16], [17]. The common problem in all of these applications is that the misleading information can extremely impact the final decision or estimation. Another category of methods to overcome the misleading information, is variance based methods [15], [18], [19] that can be extended using Fisher information [20] to either explore or exploit. The main drawback of these methods is that they only commit to either explore or exploit for the query selection, which leads to the same solution provided by the N-best method [8], [9].

In this paper, we propose an information theoretic query selection to discard the ambiguity (exploitation) in the state estimation while also measuring the credibility of the prior information (exploration). The proposed objective function is a linear combination of exploration and exploitation. Moreover, we reformulate the query selection as a multi-armed bandit (MAB) problem. We denote this framework as MAB based on State-Measurement-MI and State-Posterior-Momentum for RBSE. The MAB framework is a well-studied approach to formulate the learning process under available observations [21], [22], [23], [24]. MAB has been proposed for decision making, predictive entropy search for sequential action selection and estimation applications [25], [26], [27], [28], [29], [30]. Such applications consider sequential selection based on MAB, in which the approaches can be considered to repeatedly make choices among elements of a finite set of state elements [28], [25]. Such a formulation enables us to analytically demonstrate that the proposed policy for query selection containing exploration and exploitation performs at least as good as the methods that only rely on the exploitation of the current belief. MAB framework also enables the formulation of subset query selection optimization as a tractable problem especially through sequential selection with theoretical guarantees [31], [21].

The novel contributions of this paper can be summarized in: (i) introducing a new action-value function for query selection using changes in the posterior to encourage exploration in the MAB setting, (ii) providing a short-term policy evaluation to demonstrate that the proposed method has theoretical guarantees under certain assumptions, and (iii) evaluating the proposed method in an actual human-in-the-loop typing scheme employing a language-model-assisted Electroencephalogram (EEG)- based BCI typing system called RSVP Keyboard. Because of space limitations, we present the proofs of analytical propositions in the supplementary material. The system code is under revision and the current version can be accessed at https://github.com/BciPy [32].

II. Preliminaries

In the framework of the state estimation problem, We refer σ as the (unknown) state which is an element of a finite set $A$ . The system (learner) proceeds with the estimation through a sequential decision making process containing sequences indexed by s of multiple trials indexed by i. We denote a list of variables with {:}, for example Φ_0:s represents a sequence of variables from 0 to s. The query and evidence sets at sequence s are denoted by $Φ_{s} ≜ {ϕ_{s}^{1}, \dots, ϕ_{s}^{K}}$ and $ε_{s} ≜ {ε_{s}^{1}, \dots, ε_{s}^{K}}$ , respectively. Here, $K \in N$ denotes number of trials in a sequence. We use the query class definition presented in [8] and assume each σ has a corresponding query defined with the class conditional representation. Therefore, without loss of generality, it is assumed that all of the observations are noisy and comes from two unimodal probability distributions conditioned on state and query tuples. Assuming all trials are independent and the current observation is only function of the current query and independent of the task history, $H_{s} ≜ {ε_{1 : s}, Φ_{1 : s}}$ , the posterior probability at time s can be expressed as:

p (σ ∣ ε_{s}, Φ_{s}, H_{s - 1}) = p (σ ∣ H_{0}) \prod_{j = 1}^{s} \prod_{i = 1}^{t_{i}} \frac{p (ε_{j}^{i} ∣ σ, ϕ_{j}^{i})}{p (ε_{j}^{i} ∣ ϕ_{j}^{i})}

where $p (σ ∣ H_{0})$ is a prior information. Using maximum a posteriori (MAP) estimation [16] the learner attempts to estimate σ with re-occuring evidence collection. Based on the collected evidence if a decision is not possible, the system decides on a subset of queries for the upcoming sequence to improve its confidence. Accordingly, the query selection process is formulated by the following optimization:

Φ_{s} = arg max_{Φ} q_{s} (ε_{1 : s - 1}, Φ_{1 : s - 1}, Φ)

(1)

where q_s denotes the objective term, which we call it action-value function. Following querying, evidence ε_s is observed and accordingly the posterior is updated. In next section we propose an action value function that balances exploration and exploitation.

III. Method and Analysis

By imposing the MAB settings to the context of the state estimation problem, each query can be represented as an arm of a MAB. This reformulation allows us to solve the subset selection problem through a greedy approach by optimizing the action value function for each arm with theoretical guarantees [31]. Therefore, selection of arms (queries in each sequence) with highest action value one by one allows us to form the query in a computationally efficient way. In MAB formulation, for the design of upcoming sequence, we assume that multiple arms are pulled according to the state posterior probability that depends on the task history. The goal is then to define an objective that specifies the subset of queries (arms) to be picked at each step. Independency between trails allows us to perform optimization per trial at each sequence. Therefore, set optimization in (1) reduces to single query selection.

Conventionally, query selection is achieved through MMI [6], [5], which is equivalent to entropy minimization when the evidence ε_s corresponding to the sequence being designed is not observed; and hence, σ is independent from ϕ_s. Query selection using mutual information can be written as the following policy:

\begin{matrix} ϕ_{s}^{i} = arg \max_{ϕ} I (σ, ε_{s}^{i} ∣ ϕ, H_{s - 1}) \\ = arg \max_{ϕ} - H (σ ∣ ε_{s}^{i}, ϕ, H_{s - 1}) \end{matrix}

(2)

In this paper, we consider three different action-value functions for the MAB formulation: (i) mutual information objective (2) (ii) history-based objective (iii) combination of (i) and (ii).

We introduce a new term called Momentum that is function of posterior changes across sequences, such that:

m (ϕ ∣ H_{j}) = E_{p (σ ∣ H_{j - 1})} [\log p (σ ∣ ε_{j}^{i}, ϕ, H_{j - 1}) - \log p (σ ∣ H_{j - 1})] 1_{ϕ} (σ)

(3)

where $1_{ϕ} (σ)$ denotes the indicator function which equates 1 if ϕ = σ Since $m (ϕ ∣ H_{j})$ is the summation of probability displacement multiplied by the probability mass along axes of the state space, we call it Momentum. Additionally, $m (ϕ ∣ H_{0}) = 0, \forall ϕ$ , in words, without collecting any evidence we can not infer the trend of a particular state in estimation. Accordingly, for the history-based approach, the objective is defined as the average of Momentum as follows.

M (ϕ ∣ H_{s - 1}) = \frac{1}{s - 1} \sum_{j = 1}^{s - 1} m (ϕ ∣ H_{j}) 1_{Φ_{j}} (ϕ)

(4)

For this approach, the query selection policy becomes:

ϕ_{s}^{i} = arg \max_{ϕ} M (ϕ ∣ H_{s - 1})

(5)

We present a new action-value function for the query selection based on combination of mutual information and history-based objectives in (4) to balance exploration and exploitation. Accordingly, the action-value function and policy can be defined as:

q_{s}^{i} (ϕ) = I (σ, ε_{s}^{i} ∣ ϕ, H_{s - 1}) + λ M (ϕ ∣ H_{s - 1}), λ \geq 0

(6)

\begin{matrix} ϕ_{s}^{i} = arg \max_{ϕ} q_{s} (ϕ) \\ = arg \max_{ϕ} - H (σ ∣ ε_{s}^{i}, ϕ, H_{s - 1}) + λ M (ϕ ∣ H_{s - 1}) \end{matrix}

(7)

where, λ is a tuning parameter that balances MMI and Momentum-based policies. The objective function (7) can be written as (8) by replacing the entropy term as shown in our previous work [8].

\begin{matrix} ϕ_{s}^{i} = arg \max_{ϕ} E_{p (σ ∣ H_{s - 1)}} E_{p (ε_{s}^{i} ∣ σ, ϕ)} [\log p (ε_{s}^{i} ∣ σ, ϕ) \\ - \log p (ε_{s}^{i} ∣ ϕ)] + \frac{λ}{s - 1} \sum_{j = 1}^{s - 1} m (ϕ ∣ H_{j}) 1_{Φ_{j}} (ϕ) \end{matrix}

(8)

For the same given task history, we show that when policy in (8) is used in MAB formulation, the target state has higher probability to be chosen to appear in the query subset compared to other policies in (2) and (5). Here we propose the analysis of the correctness of the statement. To save space we use $I (a, ε_{s}^{i} ∣ ϕ = a, H_{s - 1}) = I_{s} (a)$ and $M (ϕ = b ∣ H_{s - 1}) = M_{s} (a)$ notations for the following lemmas.

Lemma 1. Given $a, b \in A$ where a ≠ b and λ ≥ 0, if $\exists H_{s - 1}$ s.t. $p (a ∣ H_{s - 1}) < p (b ∣ H_{s - 1})$ , then

p (I_{s} (a) + λ M_{s} (a) > I_{s} (b) + λ M_{s} (b)) \geq p (I_{s} (a) > I_{s} (b))

Lemma 1 shows that the probability of a (assuming to be the target state) having a higher action-value than b is larger when policy (8) is used instead of (2). Although the probability of a given, the task history is lower than the probability of b given the task history. This means that even if a is less likely than b according to the prior information and observations, using the proposed policy, a has more chance to appear in the query subset compared to policy in (2).

Lemma 2. Given $a, b \in A$ where a ≠ b and λ ≥ 0, if $\exists H_{s - 1}$ s.t. $p (a ∣ H_{s - 1}) > p (b ∣ H_{s - 1})$ , then

p (I_{s} (a) + λ M_{s} (a) > I_{s} (b) + λ M_{s} (b)) \geq p (M_{s} (a) > M_{s} (b))

Lemma 2 represents the case where the prior knowledge is supporting a rather than being adversarial. It shows that compared to using (4) as the action-value function, when a (unknown state) has higher probability given the task history compared to b, using action value function (6) has higher probability to choose a. These two lemmas show that the proposed query selection policy provides a balance between (2) and (5), and accordingly between the adversarial and supporting prior information. This balance is achieved through the selection of λ. Detailed proofs of Lemmas are provided in the Supplementary Materials.

The query λ should be updated dynamically such that the emphasis on mutual information component of the proposed function is increased with the number of sequences; i.e., the λ value should be decreased as the number of sequences is increasing [33]. We introduce a theorem that defines upper and lower bounds for the λ value to satisfy to guarantee that including the target state in the selected subset has higher probability when the proposed policy in (8) is used in MAB formulation compared to other policies (2) and (5).

Theorem 1. Let $σ \in A$ be the target state and $∣ A ∣$ represent the size of the finite set $A$ . Consider three query selection policies as follows:

\begin{matrix} ϕ_{i, s}^{π_{1}} & = arg \max_{ϕ} I (σ, ε_{s}^{i} ∣ ϕ, H_{s - 1}) \\ ϕ_{i, s}^{π_{2}} & = arg \max_{ϕ} M (ϕ ∣ H_{s - 1}) \\ ϕ_{i, s}^{π_{3}} & = arg \max_{ϕ} I (σ, ε_{s}^{i} ∣ ϕ, H_{s - 1}) + λ M (ϕ ∣ H_{s - 1}) . \end{matrix}

If p(ϕ^π3 = σ) ≥ p(ϕ^πi = σ) for i = 1, 2, then ∃λ that satisfies

\begin{matrix} \frac{2 (s - 1) ∣ A ∣ (d_{(p_{s}^{ϕ^{π_{1}}}, U)}^{2} - d_{(p_{s}^{ϕ^{π_{3}}}, U)}^{2})}{\sum_{j = 1}^{s - 1} [d_{p_{i, j}^{ϕ^{π_{3}}}}^{2} - d_{p_{i, j}^{ϕ^{π_{1}}}}^{2}] 1_{ϕ} (σ)} \leq λ \leq \\ \frac{(s - 1) (d_{(p_{s}^{π_{3}}, U)}^{2} - d_{(p_{s}^{π_{2}}, U)}^{2})}{2 \sum_{j = 1}^{s - 1} [\log (\frac{p (σ ∣ ε_{j}^{i}, ϕ^{π_{2}}, H_{j - 1})}{p (σ ∣ ε_{j}^{i}, ϕ^{π_{3}}, H_{j - 1})})]} \end{matrix}

(9)

where

\begin{matrix} d_{(p_{i, s}^{π}, U)}^{2} = \sum_{a \in A} {(p (a ∣ ε_{s}^{i}, ϕ^{π}, H_{s - 1}) - \frac{1}{∣ A ∣})}^{2} \\ d_{p_{i, s}^{π}}^{2} = ∣ p (σ ∣ H_{s - 1}) - p (σ ∣ ε_{s}^{i}, ϕ^{π}, H_{s - 1}) ∣^{2} . \end{matrix}

This theorem shows the existence of the parameter λ makes the joint objective optimal. The proof of this theorem, which uses Pinsker’s Theorem [34] and the results of Lemmas 1 and 2 can be found in the Supplementary Materials.

IV. Results

To assess the performance of the proposed query selection method, an language-model-assisted EEG-based BCI typing system called RSVP Keyboard^™ [35] has been used as a test framework. Ten healthy participants (six females and four males), 20-35 years old were recruited under IRB-130107 protocol approved by Northeastern University. A DSI-24 Wearable Sensing EEG Headset was used for data acquisition, at a sampling rate of 300 Hz with active dry electrodes. EEG signals were acquired from 20 sensors according to International 10-20 System locations: Fp1, Fp2, Fz, F3, F4, F7, F8, Cz, C3, C4, T3, T4, T5, T6, P3, P4, O1, O2, A1 and A2. All participants were asked to perform two sessions including calibration and Copy Phrase. During calibration, the users were asked to attend to predefined target symbols within randomly ordered sequences to enable the system learn the class conditional EEG evidence distributions. Here, calibration session contains 100 sequences; each sequence includes five trials (letters); and one trial in each sequence is the target symbol which is displayed on the screen prior to each sequence. The time interval between trials is 200 ms. Optimal parameters for both target and non-target class distributions were learned using the calibration data which are used in Copy Phrase task. In Copy Phrase, participants were tasked to write a missing word in a pre-defined phrase using the system (a total of 6 words with various difficulties based on LM were typed). In addition to Copy Phrase, we also use the calibration data from each participant for BCI performance simulation. We present both simulation and realtime experiment results.

To evaluate the empirical performance of the proposed query selection we first utilized simulation. For our simulations we used conditional evidence distributions which are learned in the calibration session. More details about the simulation framework can be found in [35] study. During simulations, through Monte Carlo simulations samples from these conditional distributions were drawn to type ‘O’ and ‘C’ for each simulation in the phrase ‘IT_OCCURRED_RANDOMLY’. These letters are chosen because they have different difficulties to be typed based on the language model (prior information). The number of Monte-Carlo simulations is chosen to be 500. Figure 1 shows the simulation results. Figure 1 presents the typing performance for two users with different calibration performance which is quantified by the area under the receiver operating characteristics curve (AUC); AUC_U1 = 0.82 and AUC_u7 = 0.67. The bar plot next to each plot shows the prior information provided by the language model (LM) at the beginning of a decision cycle. For instance, it can be seen that the LM probability for ‘O’ is very low and it is not quite likely to start a word with this letter. Accordingly, MMI method highly influenced by the LM prior, needs more sequences to estimate the target letter. In the early sequences of the decision process, the Momentum-based approach on average is faster than MMI to pick the intended letter for the query subset. Although, due to noise in the EEG evidence and miss-classification of observations, Momentum gets close to zero and does not pick the intended state for the query subset. Overall, the proposed method outperforms the other methods. However, when there is a likely letter like ‘C’, MMI and the proposed method perform similarly. By comparing the simulation results of user 1 (lower AUC) with user 7 (higher AUC), it can be seen that all of the query selections are faster for the user 7 with higher calibration performance. It can also be observed that for user 1 even when there is more overlap between class conditional distributions (because of low AUC), the proposed method can estimate the target letter quite fast. On the other hand, it is more difficult for Momentum-based approach to capture the target letter. Moreover, MMI also requires more number of sequences.

Fig. 1: — Probability of the letter completion for 500 Monte-Carlo simulations for typing three target symbols in phrase ‘*occurred*’. Intended symbols contain: (a)‘O’ and (b)‘C’ in the target phrase. Simulation results are reported for two users with different calibration performances. User 1 with AUC = 0.67 has lower performance than user 7 with AUC = 0.82. Bar plots show the LM prior probability over all typing symbols.

As described above, the participants also attended Copy Phrase sessions after the calibration. Each participant attended four Copy Phrase tasks with different query selection methods. The order of the tasks were randomly assigned for each participant to avoid the learning impact.

Figure 2 shows the average performance of all the query selection methods for all users including a statistical test results for Copy Phrase sessions. We reported the query selection performance in terms of two measures: accuracy in typing a letter correctly (ATL) that is the total number of correctly typed letters divided by the total number of typed letters and the information transfer rate (ITR) [36]. ITR summarizes the accuracy and speed into a single metric and it is commonly used to measure BCI performance. These results show that the proposed method outperforms the other methods both in terms of speed and accuracy. We used the Wilcoxon signed-rank test as a non-parametric statistical hypothesis test to perform a paired-comparison between the proposed method and the other query selection methods. The proposed method significantly enhanced the ITR compared to the other methods. Our statistical analysis also shows that the proposed method significantly improved the ATL compared to MMI and random query selection.

Fig. 2: — Average of information transfer rate for four query selection methods. All of the results belongs to 10 users attending the copy phrase task in RSVP Keyboard experiment. p corresponds to the Wilcoxon signed-rank test.

V. Conclusion

Being motivated by the MAB framework, we proposed a tractable solution to the subset query optimization for recursive Bayesian state estimation to enhance the estimation speed and accuracy. More specifically, a new action-value function was introduced for query selection, which uses a linear combination of mutual information and a momentum term which is a function of logarithmic changes of the posterior probability across sequences. We have also presented a bound for the action-value tuning parameter, which guarantees that the proposed query selection policy outperforms the others. An BCI typing system has been used as a test framework to assess the performance of the proposed method. Our results for both simulation and the human-in-the-loop experiment showed that the proposed method outperforms the alternative methods as shown by analytical results.

Supplementary Material

j_kocanaogullari_marghi_queryselectionrsvp_spl_supplementary

NIHMS1512096-supplement-j_kocanaogullari_marghi_queryselectionrsvp_spl_supplementary.pdf^{(155.8KB, pdf)}

Acknowledgments

^† Our work is supported by NSF (IIS-1149570, CNS-1544895), NIDLRR (90RE5017-02-01), and NIH (R01DC009834).

^‡ Our work is supported by NSF (IIS-1717654), and by the Air Force Office of Scientific Research (AFOSR), the DDDAS Program, under Grant No. FA9550-16-1-0386.

References

[1].Wilson A, Fern A, and Tadepalli P, “A Bayesian approach for policy learning from trajectory preference queries,” in Advances in neural information processing systems, 2012, pp. 1133–1141. [Google Scholar]
[2].Hoi SC, Jin R, Zhu J, and Lyu MR, “Batch mode active learning and its application to medical image classification,” in Proceedings of the 23rd international conference on Machine learning ACM, 2006, pp. 417–424. [Google Scholar]
[3].Chepuri SP and Leus G, “Sparsity-promoting sensor selection for non-linear measurement models,” IEEE Transactions on Signal Processing, vol. 63, no. 3, pp. 684–698, 2015. [Google Scholar]
[4].Golovin D, Krause A, and Ray D, “Near-optimal Bayesian active learning with noisy observations,” in Advances in Neural Information Processing Systems, 2010, pp. 766–774. [Google Scholar]
[5].Higger M, Quivira F, Akcakaya M, Moghadamfalahi M, Nezamfar H, Cetin M, and Erdogmus D, “Recursive Bayesian coding for bcis,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 25, no. 6, pp. 704–714, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
[6].Jedynak B, Frazier PI, and Sznitman R, “Twenty questions with noise: Bayes optimal policies for entropy loss,” Journal of Applied Probability, vol. 49, no. 1, pp. 114–136, 2012. [Google Scholar]
[7].Moghadamfalahi M, Akcakaya M, Nezamfar H, Sourati J, and Erdogmus D, “An active rbse framework to generate optimal stimulus sequences in a bci for spelling,” IEEE Transactions on Signal Processing, vol. 65, no. 20, pp. 5381–5392, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
[8].Koçanaogulları A, Akçakaya M, and Erdogmus D, “On analysis of active querying for recursive state estimation,” IEEE SIGNAL PROCESSING LETTERS, vol. 25, no. 6, p. 743, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Tsiligkaridis T, Sadler BM, and Hero AO, “Collaborative 20 questions for target localization,” IEEE Transactions on Information Theory, vol. 60, no. 4, pp. 2233–2252, 2014. [Google Scholar]
[10].Ungarala S, Dolence E, and Li K, “Constrained extended kalman filter for nonlinear state estimation,” IFAC Proceedings Volumes, vol. 40, no. 5, pp. 63–68, 2007. [Google Scholar]
[11].Schneider R and Georgakis C, “How to not make the extended kalman filter fail,” Industrial & Engineering Chemistry Research, vol. 52, no. 9, pp. 3354–3362, 2013. [Google Scholar]
[12].Quadrana M, Cremonesi P, and Jannach D, “Sequence-aware recommender systems,” arXiv preprint arXiv:1802.08452, 2018. [Google Scholar]
[13].Haykin S, Huber K, and Chen Z, “Bayesian sequential state estimation for mimo wireless communications,” Proceedings of the IEEE, vol. 92, no. 3, pp. 439–454, 2004. [Google Scholar]
[14].Zhao T and Nehorai A, “Distributed sequential Bayesian estimation of a diffusive source in wireless sensor networks,” IEEE Transactions on Signal Processing, vol. 55, no. 4, pp. 1511–1524, 2007. [Google Scholar]
[15].Woodward P, “Probability and information theory, with applications to radar: International series of monographs on electronics and instrumentation, vol. 3,” 2014. [Google Scholar]
[16].Gauvain J-L and Lee C-H, “Maximum a posteriori estimation for multivariate Gaussian mixture observations of markov chains,” IEEE transactions on speech and audio processing, vol. 2, no. 2, pp. 291–298, 1994. [Google Scholar]
[17].Schmidt HG, De Volder ML, De Grave WS, Moust JH, and Patel VL, “Explanatory models in the processing of science text: The role of prior knowledge activation through small-group discussion.” Journal of Educational Psychology, vol. 81, no. 4, p. 610, 1989. [Google Scholar]
[18].Zhao B, Rubinstein BI, Gemmell J, and Han J, “A Bayesian approach to discovering truth from conflicting sources for data integration,” Proceedings of the VLDB Endowment, vol. 5, no. 6, pp. 550–561, 2012. [Google Scholar]
[19].Zhao B and Han J, “A probabilistic model for estimating real-valued truth from conflicting sources,” Proc. of QDB, 2012. [Google Scholar]
[20].Lidoris G, Wollherr D, and Buss M, “Bayesian state estimation and behavior selection for autonomous robotic exploration in dynamic environments,” in Intelligent Robots and Systems, 2008. IROS 2008. IEEE/RSJ International Conference on. IEEE, 2008, pp. 1299–1306. [Google Scholar]
[21].Baram Y, Yaniv RE, and Luz K, “Online choice of active learning algorithms,” Journal of Machine Learning Research, vol. 5, no. Mar, pp. 255–291, 2004. [Google Scholar]
[22].Kocák T, Neu G, Valko M, and Munos R, “Efficient learning by implicit exploration in bandit problems with side observations,” in Advances in Neural Information Processing Systems, 2014, pp. 613–621. [Google Scholar]
[23].Chu H-M and Lin H-T, “Can active learning experience be transferred?” in Data Mining (ICDM), 2016 IEEE 16th International Conference on. IEEE, 2016, pp. 841–846. [Google Scholar]
[24].Lin T, Li J, and Chen W, “Stochastic online greedy learning with semi-bandit feedbacks,” in Advances in Neural Information Processing Systems, 2015, pp. 352–360. [Google Scholar]
[25].Lazaric A, Brunskill E et al. , “Sequential transfer in multi-armed bandit with finite set of models,” in Advances in Neural Information Processing Systems, 2013, pp. 2220–2228. [Google Scholar]
[26].Schrater P, “Structure learning in human sequential decision-making,” in PLoS Computational Biology, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
[27].Rosin CD, “Multi-armed bandits with episode context,” Annals of Mathematics and Artificial Intelligence, vol. 61, no. 3, pp. 203–230, 2011. [Google Scholar]
[28].McInerney R, Roberts S, and Rezek I, “Sequential Bayesian decision making for multi-armed bandit,” in Fifth Workshop on Multi-agent Sequential Decision Making in Uncertain Domains (MSDM) Toronto, Canada, 2010, p. 38. [Google Scholar]
[29].Wang E, Kurniawati H, and Kroese DP, “Cemab: A cross-entropy-based method for large-scale multi-armed bandits,” in Australasian Conference on Artificial Life and Computational Intelligence Springer, 2017, pp. 353–365. [Google Scholar]
[30].Hernández-Lobato JM, Hoffman MW, and Ghahramani Z, “Predictive entropy search for efficient global optimization of black-box functions,” in Advances in neural information processing systems, 2014, pp. 918–926. [Google Scholar]
[31].Farias VF and Madan R, “The irrevocable multiarmed bandit problem,” Operations Research, vol. 59, no. 2, pp. 383–399, 2011. [Google Scholar]
[32].Memmott T, Kocanaogullari A, Erdogmus D, Bedrick S, Peters B, Fried-Oken M, and Oken B, “Bcipy: A python framework for brain-computer interface research,” in 7th International BCI meeting 2018 in Asilomar, CA, 2018. [Google Scholar]
[33].Bilmes JA, “Dynamic Bayesian multinets,” in Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence Morgan Kaufmann Publishers Inc., 2000, pp. 38–45. [Google Scholar]
[34].Fedotov AA, Harremoës P, and Topsoe F, “Refinements of pinsker’s inequality,” IEEE Transactions on Information Theory, vol. 49, no. 6, pp. 1491–1498, 2003. [Google Scholar]
[35].Orhan U, Nezamfar H, Akcakaya M, Erdogmus D, Higger M, Moghadamfalahi M, Fowler A, Roark B, Oken B, and Fried-Oken M, “Probabilistic simulation framework for eeg-based bci design,” Brain-Computer Interfaces, vol. 3, no. 4, pp. 171–185, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
[36].Obermaier B, Neuper C, Guger C, and Pfurtscheller G, “Information transfer rate in a five-classes brain-computer interface,” IEEE Transactions on neural systems and rehabilitation engineering, vol. 9, no. 3, pp. 283–288, 2001. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

j_kocanaogullari_marghi_queryselectionrsvp_spl_supplementary

NIHMS1512096-supplement-j_kocanaogullari_marghi_queryselectionrsvp_spl_supplementary.pdf^{(155.8KB, pdf)}

[R1] [1].Wilson A, Fern A, and Tadepalli P, “A Bayesian approach for policy learning from trajectory preference queries,” in Advances in neural information processing systems, 2012, pp. 1133–1141. [Google Scholar]

[R2] [2].Hoi SC, Jin R, Zhu J, and Lyu MR, “Batch mode active learning and its application to medical image classification,” in Proceedings of the 23rd international conference on Machine learning ACM, 2006, pp. 417–424. [Google Scholar]

[R3] [3].Chepuri SP and Leus G, “Sparsity-promoting sensor selection for non-linear measurement models,” IEEE Transactions on Signal Processing, vol. 63, no. 3, pp. 684–698, 2015. [Google Scholar]

[R4] [4].Golovin D, Krause A, and Ray D, “Near-optimal Bayesian active learning with noisy observations,” in Advances in Neural Information Processing Systems, 2010, pp. 766–774. [Google Scholar]

[R5] [5].Higger M, Quivira F, Akcakaya M, Moghadamfalahi M, Nezamfar H, Cetin M, and Erdogmus D, “Recursive Bayesian coding for bcis,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 25, no. 6, pp. 704–714, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] [6].Jedynak B, Frazier PI, and Sznitman R, “Twenty questions with noise: Bayes optimal policies for entropy loss,” Journal of Applied Probability, vol. 49, no. 1, pp. 114–136, 2012. [Google Scholar]

[R7] [7].Moghadamfalahi M, Akcakaya M, Nezamfar H, Sourati J, and Erdogmus D, “An active rbse framework to generate optimal stimulus sequences in a bci for spelling,” IEEE Transactions on Signal Processing, vol. 65, no. 20, pp. 5381–5392, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] [8].Koçanaogulları A, Akçakaya M, and Erdogmus D, “On analysis of active querying for recursive state estimation,” IEEE SIGNAL PROCESSING LETTERS, vol. 25, no. 6, p. 743, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] [9].Tsiligkaridis T, Sadler BM, and Hero AO, “Collaborative 20 questions for target localization,” IEEE Transactions on Information Theory, vol. 60, no. 4, pp. 2233–2252, 2014. [Google Scholar]

[R10] [10].Ungarala S, Dolence E, and Li K, “Constrained extended kalman filter for nonlinear state estimation,” IFAC Proceedings Volumes, vol. 40, no. 5, pp. 63–68, 2007. [Google Scholar]

[R11] [11].Schneider R and Georgakis C, “How to not make the extended kalman filter fail,” Industrial & Engineering Chemistry Research, vol. 52, no. 9, pp. 3354–3362, 2013. [Google Scholar]

[R12] [12].Quadrana M, Cremonesi P, and Jannach D, “Sequence-aware recommender systems,” arXiv preprint arXiv:1802.08452, 2018. [Google Scholar]

[R13] [13].Haykin S, Huber K, and Chen Z, “Bayesian sequential state estimation for mimo wireless communications,” Proceedings of the IEEE, vol. 92, no. 3, pp. 439–454, 2004. [Google Scholar]

[R14] [14].Zhao T and Nehorai A, “Distributed sequential Bayesian estimation of a diffusive source in wireless sensor networks,” IEEE Transactions on Signal Processing, vol. 55, no. 4, pp. 1511–1524, 2007. [Google Scholar]

[R15] [15].Woodward P, “Probability and information theory, with applications to radar: International series of monographs on electronics and instrumentation, vol. 3,” 2014. [Google Scholar]

[R16] [16].Gauvain J-L and Lee C-H, “Maximum a posteriori estimation for multivariate Gaussian mixture observations of markov chains,” IEEE transactions on speech and audio processing, vol. 2, no. 2, pp. 291–298, 1994. [Google Scholar]

[R17] [17].Schmidt HG, De Volder ML, De Grave WS, Moust JH, and Patel VL, “Explanatory models in the processing of science text: The role of prior knowledge activation through small-group discussion.” Journal of Educational Psychology, vol. 81, no. 4, p. 610, 1989. [Google Scholar]

[R18] [18].Zhao B, Rubinstein BI, Gemmell J, and Han J, “A Bayesian approach to discovering truth from conflicting sources for data integration,” Proceedings of the VLDB Endowment, vol. 5, no. 6, pp. 550–561, 2012. [Google Scholar]

[R19] [19].Zhao B and Han J, “A probabilistic model for estimating real-valued truth from conflicting sources,” Proc. of QDB, 2012. [Google Scholar]

[R20] [20].Lidoris G, Wollherr D, and Buss M, “Bayesian state estimation and behavior selection for autonomous robotic exploration in dynamic environments,” in Intelligent Robots and Systems, 2008. IROS 2008. IEEE/RSJ International Conference on. IEEE, 2008, pp. 1299–1306. [Google Scholar]

[R21] [21].Baram Y, Yaniv RE, and Luz K, “Online choice of active learning algorithms,” Journal of Machine Learning Research, vol. 5, no. Mar, pp. 255–291, 2004. [Google Scholar]

[R22] [22].Kocák T, Neu G, Valko M, and Munos R, “Efficient learning by implicit exploration in bandit problems with side observations,” in Advances in Neural Information Processing Systems, 2014, pp. 613–621. [Google Scholar]

[R23] [23].Chu H-M and Lin H-T, “Can active learning experience be transferred?” in Data Mining (ICDM), 2016 IEEE 16th International Conference on. IEEE, 2016, pp. 841–846. [Google Scholar]

[R24] [24].Lin T, Li J, and Chen W, “Stochastic online greedy learning with semi-bandit feedbacks,” in Advances in Neural Information Processing Systems, 2015, pp. 352–360. [Google Scholar]

[R25] [25].Lazaric A, Brunskill E et al. , “Sequential transfer in multi-armed bandit with finite set of models,” in Advances in Neural Information Processing Systems, 2013, pp. 2220–2228. [Google Scholar]

[R26] [26].Schrater P, “Structure learning in human sequential decision-making,” in PLoS Computational Biology, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] [27].Rosin CD, “Multi-armed bandits with episode context,” Annals of Mathematics and Artificial Intelligence, vol. 61, no. 3, pp. 203–230, 2011. [Google Scholar]

[R28] [28].McInerney R, Roberts S, and Rezek I, “Sequential Bayesian decision making for multi-armed bandit,” in Fifth Workshop on Multi-agent Sequential Decision Making in Uncertain Domains (MSDM) Toronto, Canada, 2010, p. 38. [Google Scholar]

[R29] [29].Wang E, Kurniawati H, and Kroese DP, “Cemab: A cross-entropy-based method for large-scale multi-armed bandits,” in Australasian Conference on Artificial Life and Computational Intelligence Springer, 2017, pp. 353–365. [Google Scholar]

[R30] [30].Hernández-Lobato JM, Hoffman MW, and Ghahramani Z, “Predictive entropy search for efficient global optimization of black-box functions,” in Advances in neural information processing systems, 2014, pp. 918–926. [Google Scholar]

[R31] [31].Farias VF and Madan R, “The irrevocable multiarmed bandit problem,” Operations Research, vol. 59, no. 2, pp. 383–399, 2011. [Google Scholar]

[R32] [32].Memmott T, Kocanaogullari A, Erdogmus D, Bedrick S, Peters B, Fried-Oken M, and Oken B, “Bcipy: A python framework for brain-computer interface research,” in 7th International BCI meeting 2018 in Asilomar, CA, 2018. [Google Scholar]

[R33] [33].Bilmes JA, “Dynamic Bayesian multinets,” in Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence Morgan Kaufmann Publishers Inc., 2000, pp. 38–45. [Google Scholar]

[R34] [34].Fedotov AA, Harremoës P, and Topsoe F, “Refinements of pinsker’s inequality,” IEEE Transactions on Information Theory, vol. 49, no. 6, pp. 1491–1498, 2003. [Google Scholar]

[R35] [35].Orhan U, Nezamfar H, Akcakaya M, Erdogmus D, Higger M, Moghadamfalahi M, Fowler A, Roark B, Oken B, and Fried-Oken M, “Probabilistic simulation framework for eeg-based bci design,” Brain-Computer Interfaces, vol. 3, no. 4, pp. 171–185, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] [36].Obermaier B, Neuper C, Guger C, and Pfurtscheller G, “Information transfer rate in a five-classes brain-computer interface,” IEEE Transactions on neural systems and rehabilitation engineering, vol. 9, no. 3, pp. 283–288, 2001. [DOI] [PubMed] [Google Scholar]

PERMALINK

Optimal Query Selection Using Multi-Armed Bandits

Aziz Koçanaoğulları

Yeganeh M Marghi

Murat Akçakaya

Deniz Erdoğmuş

Abstract

I. Introduction

II. Preliminaries

III. Method and Analysis

IV. Results

Fig. 1:

Fig. 2:

V. Conclusion

Supplementary Material

Acknowledgments

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Optimal Query Selection Using Multi-Armed Bandits

Aziz Koçanaoğulları

Yeganeh M Marghi

Murat Akçakaya

Deniz Erdoğmuş

Abstract

I. Introduction

II. Preliminaries

III. Method and Analysis

IV. Results

Fig. 1:

Fig. 2:

V. Conclusion

Supplementary Material

Acknowledgments

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases