Abstract
Motivated by a class of Partially Observable Markov Decision Processes with application in surveillance systems in which a set of imperfectly observed state processes is to be inferred from a subset of available observations through a Bayesian approach, we formulate and analyze a special family of multi-armed restless bandit problems. We consider the problem of finding an optimal policy for observing the processes that maximizes the total expected net rewards over an infinite time horizon subject to the resource availability. From the Lagrangian relaxation of the original problem, an index policy can be derived, as long as the existence of the Whittle index is ensured. We demonstrate that such a class of reinitializing bandits in which the projects’ state deteriorates while active and resets to its initial state when passive until its completion possesses the structural property of indexability and we further show how to compute the index in closed form. In general, the Whittle index rule for restless bandit problems does not achieve optimality. However, we show that the proposed Whittle index rule is optimal for the problem under study in the case of stochastically heterogenous arms under the expected total criterion, and it is further recovered by a simple tractable rule referred to as the 1-limited Round Robin rule. Moreover, we illustrate the significant suboptimality of other widely used heuristic: the Myopic index rule, by computing in closed form its suboptimality gap. We present numerical studies which illustrate for the more general instances the performance advantages of the Whittle index rule over other simple heuristics.
1. INTRODUCTION
Modern sensing technologies offer the possibility of efficiently performing tasks by adaptively deploying its sensing resources based on the information extracted from past measurements. Yet, realizing such system’s overall performance gains requires appropriate on-line sensing rules. Thus, the general problem in sensor management is to design sensing algorithms that allow for the fruitful adoption of cutting edge technologies. A natural procedure to derive those rules is to represent the underlying resource allocation problem by some stochastic dynamic optimization model, whose optimal solution is traditionally characterized by a dynamic programming (DP) framework. However, those formulations, at least for realistic scenarios, typically have a prohibitively large size (possibly infinite), which dramatically hinders its practical application. Thus, fully exploiting the performance advantages offered by the new technologies by means of active dynamic sensing policies remains very challenging. For this reason, the design of both computationally feasible and nearly optimal sensing strategies, as the ones proposed in this paper, continues to be a highly active applied research area.
An additional challenge to the design of adequate on-line active sensing schemes is to take into account specific situations that may affect the system’s performance. For detection objectives, there have been significant efforts to deal with more general situations, for example with multiple objects, or with mobile objects, and even to include false targets. Yet, despite this abundant literature, the case in which targets react to sensing or may evade the searcher, remains understudied today. This paper addresses these two challenges by proposing a tractable scheduling rule for a multiple target detection problem in which targets react to sensing by remaining frozen in their current state and sensing is subject to misdetection errors. We formulate this detection problem, as a partially observable Markov decision processes (POMDP) with special structure, which further fits into the framework of the continuous state multi-armed restless bandit problem (MARBP).
The MARBP constitutes a theoretical framework under which resource allocation problems under uncertainty can be fruitfully analyzed. In its general version, the MARBP consists of choosing a subset of arms to activate at each period of time (out of a possibly larger set of arms), where the state of each arm evolves randomly over time, affecting their resulting flow of rewards (and/or costs). A natural goal for this problem is to choose the arms to activate so as to achieve the maximum expected total discounted or time-average rewards over an infinite time horizon.
In the so-called classic Bayesian Bernoulli version of the problem, whose origin dates back to the Second World War, arms’ states evolve stochastically only when chosen, yielding a binary random reward. Such a variant, despite its simplified dynamics, was regarded unsolvable until Gittins and Jones [2,3] showed that its optimal solution admitted a simple expression in terms of an index function attached to each arm depending on its current state. The resulting optimal index rule activates at each time period the arm whose current index value is the maximum. More than a decade later, Whittle [21] proposed and studied the more general restless case in which non-chosen arms continue to evolve, and pointed out that neither the existence of the index function extending the classic case nor the optimality of the resulting index rule was guaranteed for such variant.
This indexability property, that is, the existence of an index function, introduced by Whittle for MARBP problems, cannot be taken for granted as it needs to be established for each specific model. Niño-Mora [14,15] provided the first general sufficient indexability conditions based on the achievable region approach to stochastic optimization which can be systematically deployed under certain conditions. Furthermore, the indexability of special classes of MARBP has been specifically addressed and thoroughly studied using various approaches. These include some families of restless bandits which arise in machine maintenance and stochastic scheduling problems with switching costs, as those in Glazebrook, Ruiz-Hernandez, and Kirkbride [4], the bidirectional bandits introduced in Glazebrook, Hodge, and Kirkbride [5], the reinitializing bandits in Jacko and Sanso [7], and restless models in telecommunication and opportunistic spectrum access as in Liu and Zhao [10], among others. These papers are part of the body of literature that has contributed to a significant advance in the understanding of this property, yet as Liu, Weber, and Zhao [11] put it “[…] establishing indexability is still an open problem and often relies on numerical algorithms ”. Moreover, even when indexability is ensured, index computation usually poses further significant challenges [15].
In this paper, we establish the indexability of a class of MARBP that derives from a concrete family of POMDPs and it is motivated by a surveillance systems application. POMDPs admit a widespread range of applications, for example in navigation problems, artificial intelligence, sensor systems, machine maintenance, telecommunication networks, among others but their optimal policy is often computationally intractable. Therefore, the most commonly used solution methods seek to find good approximate solutions based on some discretization or reduction of its infinite state space (see, e.g., [12,18]). Still, the high computational cost of solving POMDPs is the main cause limiting their practical implementation. The resulting bandit formulation is one in which arms (until reaching a terminal state), generate a decreasing stream of random rewards when chosen to be active and passive arms continue to change state (even if not chosen), although they do so according to a simple transition rule: returning to its initial state (i.e., to its state at time 0 when the controller starts operating). The class of problems is introduced in Section 2, and following Jacko [6] we shall refer to them as reinitializing bandits.
These reinitializing bandits have some common features with models previously addressed: it is similar to the reward depletion and replenishment model presented in [4], and it also shares with bidirectional bandits in [5], the property that the active and passive actions produce opposite movements on the state space. Another related application is found in [7], where a new type of congestion control scheduling method based on a MARBP is proposed, motivated by the Internet flows behaving according to the Transmission Control Protocol, and thus admitting a reinitializing feature. In [10], a similar problem with applications in opportunistic spectrum access is considered. The problem is formulated as a MARBP and, using a similar approach to the one deployed in this paper based on DP, indexability is established and the Whittle index in closed form is derived. Later in [11], following the same rationale, the authors studied the case in which the active action resets the state is considered and solved though a Whittle index policy. Both [10,11] share with the model presented in this paper the feature of having a continuous state space. In [11], as well as in this work a property of the problem is exploited to reduce the state space from a continuous one to a numerable one.
Despite these similarities, those models and the one addressed in this paper differ in two main aspects. First, the inclusion of an absorbing state in the model is a distinctive feature of the problem addressed in this paper that has not been considered in the previous works. Second, the introduction of imperfect observability of the state instead of the perfect observability (when sensing) assumption included in the models in [10,11], makes the resulting MARBP more realistic and challenging. Another novel contribution of this work is the introduction of a target’s reacting to the sensing actions, along the lines of what was done in [9,19].
1.1. Main Contributions and Paper Structure
We start in Section 2 by describing the problem, stating the model’s assumptions and formulating it as a MARBP. In Section 3, we demonstrate the existence of the index for this class of problems by establishing the monotonicity in an activity charge λ of activation policies using properties of the corresponding DP formulation. Once indexability is established, a closed-form formula for the Whittle index is derived which, despite the problem’s simplified dynamics, it is far from being trivially deduced. Moreover, the importance of the indexability result is increased by the fact that the resulting Whittle index can be used as a well performing approximate solution method for a special family of POMDPs.
In Section 4, we proceed to study the properties of the proposed Whittle index rule, with a special emphasis on its relative performance when compared with other commonly used naive heuristics. Weber and Weiss [20] showed the asymptotical optimality of Whittle’s heuristics under certain conditions and for a limiting case in which the number of arms in total and the number of arms to be activated go to infinity in a constant ratio. We prove the optimality of the Whittle Index rule for the proposed MARBP in the special case of a finite number of heterogenous arms and when considering the Expected Total criterion (i.e., when the discount factor is 1). Moreover, in the case in which all arms can be activated at each time slot, we are able to give a closed-form expression for the suboptimality gap of other naive yet widely used index rules (e.g., the Myopic Index rule), a result that is in stark contrast to the equivalence between these rules reported for other similar models; see, for example [1,10,13,22].
To conclude the paper, in Section 5 we use numerical studies to illustrate how the theoretical results of the paper are deployed, we analyze also the performance of the alternative heuristics revised in Section 4 for this class of problems, and we show how well the Whittle index rule performs even in those cases in which the optimality of the rule is not analytically established.
2. PROBLEM DESCRIPTION AND MARBP FORMULATION
Consider the following problem in surveillance systems. There are N-independent locations (or sites), each containing one target (or object) hidden in it. There are M (1 ≤ M ≤ N) sensors, each of which at every discrete period can search at most one of those locations. All sensors in the system are synchronized to operate over time slots t = 0, 1,…, where a time slot corresponds to a Pulse Repetition Interval. Each target can be in one of two possible visibility states: a hidden (or bad) state, in which it is completely invisible to sensors but cannot perform its tasks, and an exposed (or good) state, in which it can perform its tasks but is prone to being detected by sensors. Targets are such that: (1) they perceive if they are being sensed; (2) they do not wish to be found, but they wish to perform their tasks; (3) while being sensed they do not change their status, that is, they stay frozen at their current state; and (4) while not being sensed they alternate between the two states randomly. Thus, if a target is in site n and is not sensed in period t, it becomes exposed in period t + 1 with probability and it becomes hidden with probability regardless of its initial state. The probability that a sensor searching for a target at site n finds it when it is visible is 0 < 1 − αn < 1, and hence the probability that an unfound target is visible at slot t changes by Bayes’ theorem as the sensor’s detection output is observed. The cost of a single search of a location n is cn ≥ 0 and yields a reward rnβt when it succeeds at finding target n in slot t, where 0 ≤ β ≤ 1 is a discount factor.
The goal is to design a tractable policy which addresses the following question: How should the N locations be scheduled for being sensed so as to be close to maximizing the total expected discounted reward of finding all targets, using M or less sensors at each time slot?
We therefore next formulate and investigate the following MARBP problem. The N targets are represented by N-independent projects (or arms) labeled by each yielding a positive reward if completed. Sensing decisions are thus formulated by binary action processes an,t ∈ {0,1}, where an,t = 1 represents that target n is sensed in slot t and an,t = 0, otherwise. Every unfound target is an incomplete project that can be in two states: (a “good/exposed” state: 1, a “bad/hidden” state: 0). It is assumed that projects can only be completed while being active on them and if they are in the “good” state. If a project is rested then its state changes randomly, while if a project is active its state can only evolve from the “good” state to a completed state T or it simply stays in its current state. Such a dependence of the transitions probabilities on the selected action is described in Figure 1.
Figure 1.
The Markov chain associated to a generic project given each possible action, active: 1 and passive: 0. The arrows represent one-period transitions among the states 0 (bad) and 1 (good) with given probabilities under actions 0 (on the left) and 1 (on the right).
The state of a project n defines a random process St,n which is only partially observable to the decision maker in the following sense: only under the active action at time t in project n (i.e., an,t = 1) an imperfect measurement of its current state on,t is available. Furthermore, whenever project n is at state 0 and it is activated, the measurement of its state will coincide with its true state (i.e., on,t = sn,t = 0); however, if process n is at state 1 at time t and it is activated, its measurement is correct (i.e., on,t = sn,t = 1) only with probability (1 − αn). Hence, the observation process is subject to misdetection errors with probability where 0 < αn < 1. Notice that we exclude the extreme cases α = 0 or α = 1, respectively corresponding to complete observability of the process, because we are interested in the non-trivial problem of how to use the partial information given by the observable process on,t to gain information on the partially observable process Sn,t.
The resulting model is thus a POMDP in which the N states of the projects are observable only through the active action, and they are imperfectly observed in the particular sense that the “good” state can be mistaken for the “bad” state with a given misdetection probability αn. We shall further consider that project n is completed as soon as a “good” state is observed, that is, when on,t = 1. In terms of the transition probabilities in Figure 1, we have that And to introduce the reinitializing feature, we assume that under the passive action the project’s state resets the probability of being in the “good” state to its value at time 0, which we denote 0 < ϕ0 < 1, regardless its current state, that is, and
We further consider that at every discrete time slot over an infinite horizon t = 0, 1,…, at most M processes can be selected for activation/observation, with 1 ≤ M ≤ N, incurring in an observation cost per activated process denoted by cn ≥ 0 and yielding a final reward per job completion rn > 0 if a process n is observed to be at state 1, that is, on,t = 1, yielding no more rewards thereafter. Hence, at each time slot we must decide on which processes to observe so as to maximize the total expected rewards.
Observation decisions are thus formulated by binary action processes an,t ∈ {0, 1}, where an,t = 1 represents that process n is observed in slot t and an,t = 0, otherwise. Any feasible observation scheduling rule which prescribes how to sequentially observe processes over time, will be denoted as π, and belongs to the class Π(M) of admissible policies composed by the non-anticipative scheduling polices (i.e., those based on the history of states and actions) which observe M or less processes per slot, that is,
At t = 0 process n has probability of being in state sn,0 = 1, since we assume that at time t = 0 no process was previously activated. Thereafter the posterior probability that some process n is in state 1, denoted as (henceforth referred to as its belief state), must be computed conditioning on past observations and also on the selected actions via Bayes’ rule. Notice that this posterior probability is a sufficient statistic of each project’s state. Even though each project has three possible states: exposed (1), hidden (0) or completed (T), the last two states yield no reward and the state of completion T is perfectly observed (as the reward is then collected). Therefore, the only information gained from the measurements is about the exposed (1) and hidden states (2).
Finally, given all the elements of the model, we denote by Rn(pn,t, an,t) ≜ (rn (1 − αn) pn,t − cn) an,t the one-slot expected net reward function earned when taking action an,t at time slot t on process n when its probability state is pn,t.
We shall consider that the objective of the controller is to design a policy that sequentially selects at most M out of N processes to observe at each time slot so as to maximize the total expected discounted reward over an infinite horizon, given a discount factor 0 ≤ β ≤ 1. Such an objective can be addressed by considering the following dynamic optimization problem: find an expected β-discounted optimal policy such that:
| (1) |
where is the initial joint belief state, and denotes expectation under policy π conditional on p0 = ϕ0. Further, as it will be discussed later, (1) is bounded by a finite constant, thus the problem of finding an expected total-optimal policy is well defined for this model and thus it will be considered by letting β = 1.
The optimal scheduling problem posed by (1) describes a constrained POMDP consisting of optimally deciding which processes to observe to maximize rewards given the resource constraint and based on the current estimate of the belief state of all processes. The approach followed in this paper to address the high computational cost of optimally solving POMDPs is to exploit the fact that the POMDP in (1) can be analyzed as a MARBP with a continuous state variable p = (pn) and a reinitializing feature. Thus, each process n constitutes an independent single-bandit model, with two possible actions: “observe” and “not observe” and whose state is given by its belief state, that is, the probability pn,t of being in state 1 at time t. The optimal solution for MARBPs is also generally intractable (see, e.g., [17]), yet we shall follow the solution approach for MARBPs based on a Lagrangian approach, first proposed by Whittle [21], which often results in nearly optimal and tractable solution.
Since for the model at hand, processes’ state transitions are independent, the stochastic evolution of each arm depends only on the decisions taken for it and on its own specific parameters. Hence, each arm is a single-bandit problem which can be individually considered for establishing index existence and also for index computation. Therefore, in what follows we first describe the elements of the single-bandit problem modeling the optimal observation decisions of an individual process, and next, based on these elements, we define the indexability property that is required to hold in order to derive the Whittle Index rule for the original problem (1). Thus, the following discussion focuses on a generic single-bandit problem, and henceforth its label n is dropped to simplify the notation.
2.1. Single-Bandit POMDP: Definition
Each of the N processes is represented by a single-bandit problem, which can be defined by its composing elements as follows.
Action Space: a binary action set at ∈ {0, 1}, where at = 1 represents that the process is observed in slot t and 0, otherwise;
-
State Space: a continuous state space, denoted by ≜ [0, ϕ0] containing all the possible belief states of the random process. The final state of the process T, which is an absorbing state that after being reached is never abandoned, causes the process to yield no rewards nor costs thereafter. Given that the only case in which we are certain that a project will not yield a reward (by being in the exposed state) is when it is completed (i.e., it is in state T), we shall assume that projects only have a zero probability of being in the good state if they have been successfully completed. Thus the terminal state can be conveniently represented by the belief state 0.
Since at the state T there is no actual decision to take, we therefore adopt the convention that at p = 0 the selected action is a = 0, yielding no rewards and producing no transitions. For the rest of the belief states, there is a decision to be taken by the controller, and thus we shall refer to that set of states as the controllable set of states, that is, p ∈ \ {0}.
- State Dynamics: a transition rule that specifies how the state evolves stochastically over in time depending on the selected action a and on the current state p. If p = 0, it stays at this state thereafter. For any p ∈ \ {0}, the belief state process pt evolves according to the dynamics below:
(2) Rewards: a one-period reward function given by R (pt, at) ≜ (r(1 − α)pt − c) at for pt ∈ \ {0} and R(0, 0) ≜ 0;
Costs: a fixed parameter, λ ∈ ℝ, introduced to represent an extra activity-charge that must be paid by the controller whenever at = 1.
Thus, the infinite horizon single-bandit problem is formulated as
| (3) |
We shall further denote the optimal active set for (3) as a function of the parameter λ by A*(λ). Hence, for some p ∈ \ {0}, p ∈ A* (λ) if and only if pt = p, then the optimal action .
Let us now introduce the definition of indexability.
Definition 2.1: For any value of parameter λ ∈ ℝ, subproblem (3) is indexable if its optimal active set A*(λ) decreases monotonically from \ {0} to Ø as λ goes from − ∞ to ∞.
Whittle’s original indexability definition was formulated in terms of optimal passive sets, letting the multiplier λ be a subsidy for passivity. In Definition 2.1, the parameter λ as can be interpreted as a tax for activity. Such a definition ensures the existence of critical values of the multiplier λ which induce a nesting ordering of the optimal active sets as a function of λ.
3. WHITTLE INDEX EXISTENCE AND INDEX CHARACTERIZATION
In this section, we shall establish the validity of the following theorem, which ensures that indexability holds for the model at hand.
Theorem 3.1: The single-bandit problem (3) is indexable according to Definition 2.1.
DP Analysis and Proof of Theorem 3.1
As in [10,11], we shall prove 3.1 following a DP approach.
First, we define b(p) ≜ 1 − (1 − α)p, which denotes the probability that the process in belief state p has not reached its final state after one active time period. Further, we shall use the notation ϕ1(p) ≜ αp/1 − (1 − α)p to represent the posterior probability of a misdetection when the process is observed in a belief state p.
Hence, for some fixed parameter λ, the DP equation for the β-discounted problem (3) is written as
| (4) |
where we have used the fact that when the final state of the process (represented by p = 0) is reached, the process yields no rewards nor costs and the selected action by default is at = 0, by setting for all possible β and λ.
Next, we write the DP equations for each possible partition of the set ℙ \ {0}, that is, in terms of the optimal active set A*(λ) for a fixed λ, as follows:
| (5) |
| (6) |
The proof of Theorem 3.1 is based on the following property of the optimal value function.
Lemma 3.2: Function is nonnegative, piecewise-linear in λ and non-increasing in λ.
Proof: We shall use the fact that the evolution of the state variable after t (unsuccessful) active slots starting from an initial belief state p generates an iterated mapping p ↦ ϕ1(p), that is, and for t ≥ 1. Such a mapping represents the Bayesian update of the belief state and it is decreasing in t, since for all p and α it holds that ϕ1(p) < p, and further, it defines a Möbius Transformation. Using the matrix form of such non-linear functions, we can derive by induction a closed-form expression for the tth iterate of ϕ1(p) to be
| (7) |
Note that from (7) it can also be shown that is a decreasing function in t. (See [23] for a detailed description on how to derive such closed-form expressions.)
Thus, once the process leaves state ϕ0, as long as it does not reach its final state, it only returns to ϕ0 after reinitializing the process, that is, after being passive for a time slot. Hence, for any p ∈ A*(λ), we denote by t*(p, λ) the number of (unsuccessful) active time slots that, starting from a belief state p, may elapse until it is optimal to be passive, and we rewrite (5) as a function of t*(p, λ) as follows:
| (8) |
The first term in expression (8) is the optimal expected discounted reward generated during the t*(p, λ) active time slots and the second term represents the expected discounted optimal value function starting at the reinitializing state in t*(p, λ) + 1, that is, after allowing for one passive slot.
The decreasing feature of the active dynamics (7) implies that starting from some state p in the optimal active set, the original optimization problem can be analyzed as an optimal stopping problem. From some p in the active set, the system may visit states for t = 1, 2… (by repeatedly selecting the active action starting from p) or state 0 (if it reaches its final state), but the states different from ϕ0 can only be visited in a predetermined (decreasing) order, as Exploiting this deteriorating feature, the solution to problem (3) can be equivalently described in terms of optimal active sets or in terms of the optimal number of active time slots starting from a given state.
Also, notice that if ϕ0 ∉ A*(λ), it follows from (6) that and further given the problem’s dynamics, the optimal belief state trajectory remains constant at ϕ0, thus never activating the process, that is, A*(λ) = Ø and hence, t*(ϕ0, λ) = 0. Thus, the non-trivial case to consider corresponds to all possible λ such that ϕ0 ∈ A*(λ), that is, where t*(ϕ0, λ) > 0.
Next, we will invoke the auxiliary results in Lemma 3.3 (whose proof is deferred to the Appendix) to simplify expression (8).
Lemma 3.3: We have
For any λ such that ϕ0 ∈ A*(λ), setting p = ϕ0 in (8) and using Lemma 3.3, can be computed in closed-form as a function of t*(ϕ0, λ) with the expression below:
| (9) |
Denote by Vβ(ϕ0, λ, i) the expression (9) evaluated by setting t*(ϕ0, λ) = i. Notice that, solving problem (4), that is, finding the states that belong to A*(λ), is therefore equivalent to finding the maximum positive integer i such that it holds: Thus, if t*(ϕ0, λ) = i it must be the case that only if t ≤ i.
Thus, substituting for p in Eqs (5) and (6), and given that using (4) we have that:
| (10) |
where we have also assumed that in the case of the maximum being achieved by both actions, the selected action by default is the passive action. Further, Vβ(ϕ0, λ, i) is computed for t*(ϕ0, λ) = i using expression (9). Thus, rearranging (10) we have that
| (11) |
Therefore, for some given λ such that ϕ0 ∈ A*(λ), t*(ϕ0, λ) is determined as the maximum non-negative integer i such that (11) holds. Furthermore, for t*(ϕ0, λ) = i, where i is some positive integer, it must be the case that:
| (12) |
The first relation in (12) is a consequence of the fact that t*(ϕ0, λ) = i, while the second relation in (12), that is, follows from the fact that Notice also that if starting at ϕ0 it is optimal to be active for i time slots, then it must be optimal to be active at any time slot between 0 and i − 1.
Consider now some λ′ < λ. It follows from (9) that, for a fixed i, Vβ (ϕ0, λ, i) is a linear decreasing function of λ. Thus, when (12) holds for a given λ, it will hold also for λ′ < λ. Therefore, the set of integers for which the optimality (12) holds is a non-increasing set with respect to λ. Therefore, t*(ϕ0, λ) is a non-increasing (piece-wise constant) function of λ and it further follows that as in (9) is a nonnegative and non-increasing piece-wise linear function in λ.
Next, we announce the following corollary, which is a direct consequence of Lemma 3.2.
Corollary 3.4: If p ∈ A*(λ) for some p ∈ \ {0}, then it must be that p ∈ A*(λ′) for λ′ < λ.
Proof: The proof follows from the relation between t*(ϕ0, λ) and the optimal active set A*(λ). Suppose it is known that when the process is at state ϕ0 it is optimal to take the active action (as long as the process does not yield its final reward) for the next i steps, that is, t*(ϕ0, λ) = i. Then, it must be the case that the belief states in the sequence for t = 0, …, (i − 1) belong to the active set A*(λ) Further, given the non-decreasing property of t*(ϕ0, λ), the set A*(λ′) for λ′ < λ must (at least) include the set composed by the sequence for t = 0, …, (i − 1).
Alternatively, after some algebraic manipulations, expression (12) can be written for fixed i and s = 1, …, (i − 1) as a function of λ which is also linear and decreasing in λ.
Proof: Finally, indexability of the single-process problem (3), as defined in Definition 2.1, follows from Corollary 3.4.
3.1. The Whittle Index
Based on the indexability result established by 3.1, we announce in (13) the Whittle index closed-form expression for the single-process problem (3).
Theorem 3.5: The Whittle index, denoted by λW (p), for the single-bandit problem (3) and for p ∈ \ {0}, is computed as follows:
| (13) |
| (14) |
Proof: Once indexability has been established, the Whittle index λW (p) for some belief state p ∈ \ {0} is computed as the value of the multiplier λ such that the active and passive action are indifferent when the process is at state p, that is, the value of as computed by Eqs (5) (i.e., with a = 1) or (6) (i.e., with a = 0) is the same. Further, given the properties of the active dynamics previously derived, it follows that if for λ = λW (p) at p both actions are indifferent, then it must be true that for belief states larger than p it is optimal to be active (until either the process reaches it final state or it reaches a belief state below p) while for belief states smaller than p it is optimal to reinitialize the process. Moreover, for some p ∈ \ {0} there will exist a strictly positive integer t such that and takes its maximum value setting a = 1 (i.e., and it is computed using (5)) while takes its maximum vale when a = 0 (i.e., as in (6)).
The above reasoning allows us to conclude that t*(ϕ0, λW (p)) is exactly the number of active slots required to make the state go from the initial state ϕ0 to a state at most equal to p, that is, Notice that λW (p) can be interpreted as the value of λ such that the optimal maximum number of (unsuccessful) periods that a process starting at the initial state must be activated before advising to reinitialize the process is exactly t*(ϕ0, λW (p)). From expression (7) and given its definition, d(ϕ0, p) can be computed in closed-form using the corresponding expression (14).
Therefore from Eqs (5) and (6), using the fact that for λ = λW (p) it holds that t*(ϕ0, λW (p)) = d(ϕ0, p), we write the DP equations for some as:
| (15) |
| (16) |
The critical value λW (p) is such that (15) equals (16). Thus,
| (17) |
Next, we compute using expression (9) setting t*(ϕ0, λW (p)) = d(ϕ0, p) as in (14) and substitute it in (17). After tedious yet straightforward algebraic manipulations expression (13) is obtained.
Corollary 3.6: The Whittle Index defined in (13) for the single-bandit problem (3) is a continuous and monotone increasing function in p, for any p ∈ \ {0}.
Proof: Both properties can be shown through algebraic manipulations of expression (13). The function d(ϕ0, p) is a piecewise constant(left continuous)function. In particular, it remains constant for all p ∈ \ {0} such that p ≠ ϕt(ϕ0) for t = 0, 1, …, while it has decreasing jump discontinuities in the set p = ϕt(ϕ0) as t = 0, 1, ….
Thus, within the belief state intervals [ϕt(ϕ0) ; ϕt−1(ϕ0) ), with ϕt(ϕ0) given by expression (7), for all natural t ≥ 1 it therefore holds that d(ϕ0, p) remains constant and λ*(p) is a linear, continuous and increasing function in p.
In order to show continuity of the index function λ*(p), it remains to establish the continuity for the set of points in which the function d(ϕ0, p) has jump discontinuities. For such a purpose, using the fact that at those critical values of p the active and passive actions are indifferent to the decision maker, it can be shown that
which completes the proof of Corollary 3.6.
Corollary 3.7: The single-bandit problem (3) is optimally solvable by threshold policies, that is, for every λ ∈ ℝ there exists a threshold p*(λ) such that for any p ∈ \ {0} it is optimal to activate the process if and only if p > p*(λ).
Proof: It follows from the analysis to derive the Whittle index in Theorem 13, that the optimal policy for problem (3) can be expressed as follows: a* = 1 for all and a* = 0, otherwise.
3.2. Example
To illustrate the previous analysis with an example, consider a process with the following parameters: ϕ0 = 0.95, r = 1, α = 0.35 and c = 0. Figure 2 plots the corresponding Whittle index function, given by expression (13), for instances with discount factors β ∈ {0, 0.1, 0.3, 0.5, 0.7, 0.9, 0.95, 0.99}.
Figure 2.
(Color online) The Whittle index of a process with parameters: ϕ0 = 0.95, R = 1, α = 0.35 and c = 0, and computed for instances with discount factors β ∈ {0, 0.1, 0.3, 0.5, 0.7, 0.9, 0.95, 0.99}. The optimal active set in this example is the set of p for which λW (p) > 0.
Notice that the Whittle index for the instance β = 0 reduces to the index λW (p) = R(p, 1), commonly known as the Myopic index, and thus henceforth denoted as λM (p) ≜ R(p, 1). Further, when β = 0 the optimal policy advices to observe the process regardless of its current state p, that is, the optimal active set is \ {0} or the optimal stopping time for any p ∈ \ {0} is infinite. However, as β increases, the optimal active set becomes a subset of the controllable states, for example, when β = 0.3 it holds that, starting from ϕ0 it is optimal to be active for 4 periods (until state is reached), while for β = 0.99 it is optimal to be active for 3 periods (until state is reached). In the limit, for β = 1 the Whittle index converges to an index that takes value 0 for all \ {0, ϕ0}, while in the reinitializing state ϕ0 it takes the value R(ϕ0, 1).
It follows from the previously mentioned equivalence between the optimal active sets and the optimal stopping times, that for every β, the Whittle index policy can be equivalently expressed in terms of the optimal maximum possible number of observations starting from the initial state ϕ0 until the process yields its final reward, which we shall denoted by , therefore admitting a simpler and tractable expression alternative to the computation of expression (13). Thus, for β = 1 the Whittle Index rule is equivalent to a 1-limited Round Robin observation rule, for β = 0 it is equivalent to an ∞-limited Round Robin rule and for some general 0 < β < 1 it is equivalent to a -limited Round Robin rule.
4. PROPERTIES OF THE WHITTLE INDEX RULE
We now explicitly define a Whittle Index rule for the multi-armed problem (1) based on the index expression (13), and two alternative naive heuristics index rules. We shall further establish the optimality of the Whittle Index rule for solving problem (1) for the special case of N stochastically heterogeneous processes (i.e., having distinct parameter specifications) under the Expected Total criterion, that is, the case corresponding to letting β = 1. We shall also give a closed-form expression for the suboptimality gap of the alternative index rules for the special case in which there is no constraint in the number of processes that can be simultaneously observed, that is, for M = N.
Definition 4.1: The Whittle Index rule for the multi-armed problem (1) is implemented as follows: at time t, the index is computed using expression (13) for each of the N processes independently, and the M processes yielding the highest index values, as long as they are positive (i.e., ), are observed at time t. Further, in the case of a tie among two or more (positive) index values, we shall choose to observe the process that has been least (unsuccessfully) observed up to time t. If processes have been previously observed the same number of periods, ties are broken arbitrarily.
The use of such a problem specific tie-breaking rule is a novel feature proposed in this paper. Any identical processes, when in the same state at time t, will have the same index value at that time, although they may have been (unsuccessfully) observed a different number of times. For instance, if at time t the controller must choose between two processes, both at a common state ϕ0, but process 1 has been observed before while process 2 has never been observed, the expected net reward of observing each of them respectively is R(ϕ0, 1)b(ϕ0) and R(ϕ0, 1). Naturally, if the least observed process has a higher priority despite the fact that their Whittle Index value is the same, then we expect to obtain a higher immediate expected reward by observing it. This difference will be particularly important for the discounted case, in which the time of job completion affects the rewards obtained from them. Moreover, the inclusion of this additional tie-breaking rule can be used for simplifying some of the optimality results’ proofs presented in this section.
We shall further define two alternative well-known index-based heuristics for the multi-armed problem (1): the Myopic and the Belief Index rules, respectively taking index λM(p) ≜ R(p, 1) and λB(p) ≜ p. For the sake of the fairness in the comparison, they will be implemented in an analogous way as the Whittle index (i.e., using the same tie-breaking rule). Usually for the cases in which not only the optimal policy for the MARBP but also the Whittle Index rule for a single-bandit subproblem is not easily derived, this type of simpler rules are the most commonly implemented. For an example of the application of these two alternative index-based policies and a comparison of their performance against the Whittle Index rule see, for example [16] or [8].
Furthermore, we propose the following tractable heuristic rule:
The -limited Round Robin rule, observes the M least observed processes whose state is greater than as long as
Theorem 4.2: The Whittle index rule in Definition 4.1 is optimal for problem (1) for the case of N stochastically heterogeneous processes for any 1 ≤ M ≤ N when β = 1. Further, it is equivalent to the following simple 1-limited Round Robin rule: at each time slot t, observe (at most) M processes only if they are in their reset state , as long as .
Proof: Following an approach similar to the one in [10], we shall show optimality by deriving and comparing relevant bounds on the resulting value functions under different rules. Consider first the case in which c = 0. A natural upper bound for the objective function (1) when β = 1 and for any 1 ≤ M ≤ N is such that . Given that each one of the N processes generates a reward rn when observed at state s = 1, the best that any observation scheduling rule could do is to succeed with all of them, hence is the (obvious) maximum attainable value for the total expected objective function.
Next, we compute the expected value of the objective function under the Whittle Index rule (as in Definition 4.1).
For β = 1, the Whittle rule induces a 1-limited Round Robin scheme in which every process is observed once every two slots, as long as it is in the state and until it yields its final reward. Under such a rule, every process will yield its final reward in finite time with probability 1, given that the probability of completing a job by time t is . So for any possible 1 ≤ M ≤ N, all the N processes will be eventually operated under this rule, though at different moments of time, and all the possible rewards will be achieved in a finite time.
Using the above reasoning, for β = 1 and operated under the Whittle Index rule the expected flow of rewards yields the following value for the objective function, denoted by :
| (18) |
By Lemma 3.3, (18) is reduced to rN, which coincides with the upper bound of the objective function (1). Hence, given that , the Whittle Index rule is optimal . Further, notice that because the 1-limited Round Robin scheme can be implemented for every process in finite time, such an optimality result is true regardless of the value of M.
Regarding the case in which problem (1) is considered for c > 0 or under the β-discounted criterion, we cannot show optimality using the rough bound on the value function for the expected total case, yet an upper bound can be derived solving the Whittle relaxation of problem (1) and using a Lagrangian approach to solve it, that is,
| (19) |
where is defined as in (3) for every n = 1, …, N.
By solving the convex optimization problem posed by (19), it is derived that for the case β → 1 results in λ* = 0 and where J is the set of processes for which it holds that By an analogous reasoning to the one used for computing the total expected value function under the Whittle index rule in (20), it follows that the 1-limited Round Robin scheme induced by this rule will yield all the final rewards incurring in an expected cost equal to the expected cost of the induced cycles, that is, Thus, again
From the above proof it follows as well that when c = 0 the optimality result will hold for any -limited Round Robin rule such that the length of its active cycle is finite. Given that the Myopic and Belief rule are equivalent to a ∞-limited Round Robin rule, we can expect them to be suboptimal in that case. Next, we introduce a theorem stating its suboptimality gap in closed-form for that case and under certain assumptions.
Theorem 4.3: The suboptimality gap for the total expected performance achieved under the Myopic index rule or the Belief index rule for the special case of stochastically heterogeneous processes with c = 0, β = 1 and
-
(a)
M = N, is
-
(b)
M = 1 and processes such that: and is where rmin ≤ r1 ≤ rmax.
Proof: First, we compute the value of the total expected performance under the Myopic index rule or the Belief index rule under the assumptions in (a). Notice that both rules are equivalent in the case of heterogeneous processes with c = 0 and M = N. Both index functions are strictly increasing in the belief state, and both are strictly positive when considering a given state of the set of controllable states, therefore inducing identical decisions over time. However, notice that both policies are not equivalent to the Whittle index rule, basically because the Myopic and Belief index rules are equivalent to an ∞-limited Round Robin rule while the Whittle index rule is equivalent to a 1-limited Round Robin rule.
For β = 1, c = 0 and M = N, operated under the Myopic (or the Belief) index rules, the expected flow of rewards yields the following expected value for the objective function:
| (20) |
As a consequence of Lemma 3.3, (20) reduces to Hence, the suboptimality gap for these two simple index rules is computed in closed-form for the case M = N and c = 0 to be Thus, as it will be illustrated through the computational experiments, the gap decreases with
Next, we compute the value of the total expected performance under the Myopic index rule or the Belief index rule under the assumptions in (b). The relation between the restarting states (together with the tie-breaking rule) ensures that under the Myopic or Belief rule no process will be activated two consecutive times until there is only one process that has not yet yield its final reward. Once there is only one process, which we denote by the subscript l, with the possibility of yielding a reward, the Myopic (and the Belief) policy would activate that process over an infinite period of time. As a consequence of Lemma 3.3, (20) reduces to
For the case of stochastically identical process, (20) for case (b) reduces to and the suboptimality gap to (1 − ϕ0)r.
Regarding the case in which problem (1) is considered for β = 1 and c > 0, following an analogous reasoning, it can be shown that the simpler index rules will also be suboptimal in this case. Let be the maximum prescribed number of consecutive active slots for process n on excess of 1 that are prescribed under the Myopic or Belief index rules when the cost is c. The simpler index rules can be shown to attain the following expected performance values: Therefore, it follows that the Whittle index rule is also not only optimal for any c when β = 1 but also that the other index rules are suboptimal, since it holds that
For the more general case, in which M < N and β = 1, the computation in closed-form of the total expected value function under the Myopic or Belief index rule is less straightforward. Yet, it can be intuitively argued that both of them will also be suboptimal. Under any of these heuristics the first M terms of the sum defining coincide with the ones achieved by the Whittle index rule. From the M + 1th term onwards both the Myopic or Belief index rule, with a strictly positive probability, take an action that is different from the optimal action (i.e., the action prescribed by the Whittle index rule), thus generating a total expected reward strictly less than rN.
In fact, denoting by the expected number of processes which have not yet yielded their final reward at time t for some 1 ≤ M ≤ N, the probability that at time t the naive index rules diverge from the optimal action (i.e., the action prescribed by the Whittle index rule) is at least equal to: When the event occurs, the simpler index rules prescribe to activate all of the processes, even if they are not in their initial state (i.e., they never prescribe to reinitialize those processes). Furthermore, from the time at which the M + 1th term of the sum is added onwards, the event may occur with a strictly positive probability for any 1 ≤ M ≤ N, that is, with the probability that in the previous M observations, at least (N − M) and at most (N − 1) processes have reached its final state.
The number of processes reaching its final state at a given time defines a binomial random variable, hence: Notice further that for M = N, that probability is equal to 1, which yields a value function as in (20). Also, as M diminishes, the minimum for the probability of divergence from the optimal action decreases, and since this probability tends to zero as t → ∞, we can expect that the best performance of the simpler index rules will be achieved when the number of possible observations per period is the lowest. We would like to emphasize that this suboptimality result is also noteworthy since it implies that the Myopic Index and the Belief Index rules perform worse as the resource constraint is less binding.
For the β-discounted criterion, for each β the Whittle index rule is equivalent to a simple -limited round robin heuristic rule. In general, for every β the Whittle index rule will be equivalent to a heuristic that advices to observe processes starting at the initial state at most for slots (unless the process reaches its final state) and resting for one slot after that. Given the fact that the Whittle index rule in this case also induces cycles equivalent to a -limited Round Robin policy, we expect that this simple rule will also outperform the other rules for the β-discounted criterion, especially as β → 1 and M increases. Also, following this reasoning it can be argued that this will be the case in any other situation in which the observation rule selected tends to reinitialize processes less often than the optimal rule would do.
5. NUMERICAL STUDY
We present the results of some of the simulation studies performed with the goal of illustrating the ideas presented in the previous sections. The experiments are based on MATLAB implementation codes developed by the author, where the relative performance of the proposed Whittle index rule is compared against the other previously described naive index policies and to the corresponding upper bound. For each instance, 104 independent simulation runs were performed on a horizon of T = 104 time slots. In each experiment, the resulting mean total reward function under different rules is reported together with 95% confidence intervals around that mean to evaluate the statistical significance of the results. All the results reported in this section are statistically significant at a 95% confidence level.
In all instances, the rules considered are: the Whittle index rule, the myopic index rule, the belief state index rule (both as defined in Section 4), the t*-limited round robin rule, based on the simple rule equivalent of the Whittle index rule proposed in Section 4, the -limited round robin rule with and the random selection policy which picks a process to observe at random, with each process having the same probability of being selected.
5.1. Experiment #1
In this experiment we illustrate the optimality result announced in 4.2 and the suboptimality gap of the naive index rules presented in 4.3 while studying the effect of varying the hard sample-path resource constraint in an instance with non-identical processes for the case β = 1. We considered a total of N = 100 processes, where 75 have a reinitializing state equal to ϕ0 = 0.5 while for the 25 remaining it is ϕ0 = 0.8. For the N processes we have considered a common misdetection error α = 1/3, r = 1 and c = 0. This base instance was modified by letting M increase in 1 unit from 1 to N. The upper bound on the objective function is in this case rN = 100.
Following the arguments introduced in Section 4, it can be shown that for the case β =1 and c = 0 any -limited round robin observation scheduling rule with will be optimal. However, the lifetime of the system (i.e., the time until all processes yield its final reward) will grow significantly the more the cycle diverges from the optimal one. By the same reasoning, the optimality of other finite limited Round Robin rules different from the t*-limited rule will not hold for c ≠ 0 or β < 1; however, we expect that these rules, though suboptimal, may outperform the Myopic and Belief rules only in some instances. These issues shall be explored in Experiment 5.3.
In this experiment, we illustrate the optimality of other -limited round robin rules by implementing a rule with a finite cycle of (at most) consecutive active time slots, that is, a 3-limited Round Robin rule. Results displayed in Figure 3(a) show that the Whittle index rule is statistically identical to the upper bound on the total expected rewards. Furthermore, the -limited round robin policy, which was computed in this experiment both for and (though the figure displays only the 3-limited), resulted in both cases also optimal in terms of the objective function.
Figure 3.
(Color online) Computational experiments: performance results (a) Mean Total Reward vs. M (Non-Identical Processes). (b) Mean Total Reward vs. M (Identical Processes). (c) Mean Total Discounted Reward versus c. (d) Mean Total Discounted Reward versus α.
Results displayed in Figure 3(a) show that both the Whittle index rule and the 3-limited Round Robin rules outperform all the other rules with the largest suboptimality gap (of around 43%) occurring again for M ≥ 70. The Myopic and the Belief index rules are statistically equivalent in practically all of the values of M, with their resulting mean performance deteriorating as M increases. All the policies are statistically equivalent (and moreover optimal) only when M takes very small values (M ≤ 3) and the variability of the performance attained by the optimal rules is relatively constant over the values of M whereas it decreases for the suboptimal ones as M grows. It is also noteworthy that in almost all the range of values of M these two naive index rules do not result in statistically significant improvements over the random policy.
5.2. Experiment #2
In this experiment we consider a base instance with N = 100 identical processes, each with parameters ϕ0 = 0.6, α = 0.25, r = 1 and c = 0, for the expected total criterion (i.e., for β = 1). This base instance was modified to asses the effect of varying of the hard sample-path resource constraint, specifically by letting M (the number of processes that can be observed at a time slot t) increase by 1 unit from 1 to N. Again, the rough upper bound on the objective function is in this case rN = 100. Results displayed in Figure 3(b).
As expected, the Myopic and the Belief index rules are statistically equivalent policies for the identical processes case and they are both suboptimal. Once more, their suboptimality gap grows as M increases, reaching its theoretical maximum value of 40% = (1 − ϕ0)(corresponding for M = N) for M ≈ 70. Remarkably, for the range of values of M smaller than that and until the set of values for which all the policies are statistically equivalent and optimal (roughly M ≤ 3), these two heuristic rules do statistically significant improve over the random policy, contrary to what occurred in the previous experiment for non-identical processes. Finally, notice that the variability of the performance achieved by all the rules exhibits the same behavior as in the previous experiment.
5.3. Experiment #3
In this experiment we study the effect of including a strictly positive observation cost c > 0 when the discount factor equals β = 0.95 and for a case of identical processes each with parameters ϕ0 = 0.55, α = 1/3, r = 1, M = 100. This base instance was modified to asses the effect of including a strictly positive observation cost, specifically by letting c increase in 0.05 units from 0 to 0.75. We expect that the naive index rules will perform better as the cost increases, and to even outperform the 3-limited Round Robin when c becomes sufficiently large. The upper bound on the objective function is in this case computed for each instance using the Lagrangian relaxation approach described in Section 4.
Results displayed in Figure 3(c) show that the Whittle Index rule is statistically equivalent to the upper bound for any value of the observation cost c while, as expected, the suboptimality gap of the Myopic and Belief index rules decreases as c grows. Moreover, for small values of c (approximately less than 0.03) the 3-limited Round-Robin rule performs better than the other naive heuristics, while it performs statistically equivalent to them when c is around 0.06 and it is overperformed by them for larger values of c. The poor performance attained by the random rule is mainly explained by the fact that its definition does not take into account the observation cost.
In fact, it turns out that as c varies each of the index rules becomes equivalent to a simple –limited Round-Robin rule. Thus, to explain such results, we summarize the equivalence relations in Table 1. For c = 0 and β = 0.95 the Whittle index rule is equivalent to a 2-limited Round Robin rule, while the Myopic and Belief index rule are ∞-limited, therefore resulting in the maximum possible divergence among these index rules. When the observation cost is in the range [0.2895, 0.367) all three index polices become equivalent to a 1-limited Round Robin rule but the total expected discounted rewards are reduced to an approximate value of 20. Also, from the table it follows that for values of c less than 0.0433 both naive index policies are overperformed by the 3-limited Round-Robin rule.
Table 1.
Equivalence between index rules and -limited round Robin rules as a function of c, with a is a integer such that a ≥ 4
| c | Whittle index rule | Myopic index rule | Belief index rule |
|---|---|---|---|
| [0.55, ∞) | 0-limited | 0-limited | 0-limited |
| [0.367, 0.55) | 0-limited | 0-limited | 1-limited |
| [0.2895, 0.367) | 1-limited | 1-limited | 1-limited |
| [0.1923, 0.2895) | 1-limited | 1-limited | 2-limited |
| [0.1195, 0.1923) | 1-limited | 2-limited | 2-limited |
| [0.07971, 0.1195) | 1-limited | 2-limited | 3-limited |
| [0.06, 0.07971) | 1-limited | 3-limited | 3-limited |
| [0.0433, 0.06) | 2-limited | 3-limited | 3-limited |
| [0.02887, 0.0433) | 2-limited | 3-limited | 4-limited |
| [0.0148, 0.02887) | 2-limited | 4-limited | 4-limited |
| (0, 0.0148) | 2-limited | a-limited | a-limited |
| 0 | 2-limited | ∞-limited | ∞-limited |
5.4. Experiment #4
In this final experiment we study the effect of differences in the misdetection error probability α in an instance of identical processes for the case β = 0.9. Once more, we have considered a total of N = 100 processes, where in the base instance all of them have ϕ0 = 0.5, α = 0.05, r = 1 and c = 0 and M = N. Then we modify the base instance by letting the misdetection probability vary from α = 0.05 to 0.5 in increments of 0.05. The upper bound on the objective function in each case is computed using the Lagrangian relaxation value.
Results displayed in Figure 3(d) show that the performance measure decreases for all rules as the misdetection error probability grows. The naive index rules are statistically equivalent to the random policy and their variability is smaller than the variability of Whittle index rule and the 3-limited round robin rule.
Further, the Whittle index rule is statistically equal to the upper bound over all values of the parameter α, overperforming the all the other rules. The 3-limited round robin rule results equivalent to the Whittle index policy for values of α larger than 0.35 and it further overperforms the other index rules for all values of α.
6. CONCLUDING REMARKS
In this paper, we have proposed a simple yet intractable POMDP model with application in surveillance systems dealing with the detection and expulsion of smart intruders. The model admits other applications in similar contexts. For example, consider a supervisory control system in which multiple processes are monitored and controlled. The state 1 represents an abnormal state and 0 the normal state. While a processes is being monitored, its state can only change if an abnormality is detected and corrected. The objective is to control the processes, identifying and rectifying processes that are in the abnormal state to ensure the quality or security of the system.
For solving the original model, we reformulated it as a MARBP with reinitializing states and we introduced a novel dynamic scheduling policy based on an index function which was further shown to be optimal in some special cases. Moreover, this optimality property was observed in more general scenarios through simulations. For the proposed model, the paper established analytically the existence of the Whittle index, obtaining a closed-form expression for it and analytically showing the suboptimality of other widely used heuristics under some conditions.
Besides the above mentioned theoretical results that are of concern for the model introduced by this work, we believe two important conclusions can be drawn from these results which have a relevance that goes beyond the scope of this paper. The first one is in relation to the design of simple tractable heuristics based on myopic approaches which is in stark contrast to the results valid for the MARBP models studied in [10,11] in which myopic rules where optimal or nearly optimal. This paper shows that in problems in which the passive action has a recovery effect on the states of arms (a situation likely to occur when arms suffer from exhaustion, as human resources), those polices which do not advice arms to rest are very likely to be substantially far from the optimal. Hence, if we must use a heuristic policy in such instances, it is more reasonable to deploy a heuristic that is defined in such a way that it will cyclically alternate between working and resting every arm. As shown in this paper, for the present model and in the case β = 1 and c = 0 any cycled policy of this sort, regardless of the cycle composition, will be optimal in terms of the objective function.
The second conclusion is regarding the potentially good performance of the Whittle index policy as an approximate solution method for POMDP models. In this particular case, given the simplicity of the model, this rule turns out to be optimal in many instances. We believe that the results reported in this paper suggest that in more complex cases (such as the model in [16]), the application of this approach may lead to designing a rule that may offer significant performance gains at a feasible computational cost. We regard this direction as a highly fruitful one to continue research.
Finally, it is our hope that this work will contribute to stimulate the development of tractable decision-making rules for models that extend the present formulation. For example, to solve a more complex problem that includes false positives, the case in which targets can re-appear in the sites after being detected or even a model in which smart targets have a different reaction that the one modelled here.
Acknowledgements
The author is grateful for the contributions of Peter Jacko to my thinking about bandit problems and for his immense encouragement for the completion of this paper. This work was partially supported by grant SA-2012/00331 of the Department of Industry, Innovation, Trade and Tourism (Basque Government).
APPENDIX
Proof of Lemma 3.3
Lemma 3.3 states the validity of the following results:
Proof: We show these results by induction. We start with part (a), and show that
For t = 1, by definition of it holds
Next, if we let it be true for some t ≥ 1, we then have that
Thus, it holds that
Next, using the expression (7) that
| (A.1) |
we conclude that
Finally, straightforward algebra yields that
which completes the proof of part (a).
Next, we shall prove part (b).
Notice that admits expression (7) because is the tth iterate of a Möbius Transformation defined by whose associated matrix is
Thus, by properties of Möbius Transformations, ϕt(p) is also a Möbius Transformation with associated matrix (Φ1)t.
We shall also prove part (b) by induction. By definition and it holds for t = 1:
Next, let it be true for t ≥ 1, that is,
Then, it holds that
Next, substituting ϕt+1(p) by its equal according to expression (7), we get that
which shows part (b), and hence completes the proof of Lemma 3.3 holds for all natural t.
References
- 1.Ahmad S, Liu M, Javidi T, Zhao Q, Krishnamachari B. Optimality of myopic sensing in multichannel opportunistic access. IEEE Transactions on Information Theory. 2009;55(9):4040–4050. [Google Scholar]
- 2.Gittins JC, Jones DM. A dynamic allocation index for the sequential design of experiments. In: Gani J, editor. Progress in Statistics. North–Holland Amsterdam: 1974. pp. 241–266. [Google Scholar]
- 3.Gittins JC, Jones DM. Bandit processes and dynamic allocation indices (with discussion) Journal of the Royal Statistical Society B. 1979;41:148–177. [Google Scholar]
- 4.Glazebrook KD, Ruiz-Hernandez D, Kirkbride C. Some indexable families of restless bandit problems. Advances in Applied Probability. 2006;38(3):643–672. [Google Scholar]
- 5.Glazebrook KD, Hodge DJ, Kirkbride C. Monotone policies and indexability for bidirectional restless bandits. Advances in Applied Probability. 2013;45(1):51–85. [Google Scholar]
- 6.Jacko P. Optimal index rules for single resource allocation to stochastic dynamic competitors. Proceedings of ValueTools; 2011. pp. 425–433. [Google Scholar]
- 7.Jacko P, Sanso B. Optimal anticipative congestion control of flows with time-varying input stream. Performance Evaluation. 2012;69(2):86–101. [Google Scholar]
- 8.Jacko P, Villar SS. Opportunistic schedulers for optimal scheduling of flows in wireless systems with ARQ Feedback. 24th International Teletraffic Conference (ITC) IEEE; 2012. pp. 1–8. [Google Scholar]
- 9.Kreucher C, Blatt D, Hero A, Kastella K. Adaptive multi-modality sensor scheduling for detection and tracking of smart targets. Digital Signal Processing. 2006;16(5):546–567. [Google Scholar]
- 10.Liu K, Zhao Q. Indexability of restless bandit problems and optimality of whittle index for dynamic multichannel access. IEEE Transactions on Information Theory. 2010;56(11):5547–5567. [Google Scholar]
- 11.Liu K, Weber R, Zhao Q. Indexability and whittle index for restless bandit problems involving reset processes. In proceedings of the 50th IEEE Conference on Decision and Control and European Control Conference (CDC-ECC); 2011. pp. 7690–7696. [Google Scholar]
- 12.Lovejoy WS. Computationally feasible bounds for partially observed Markov decision processes. Operations Research. 1991;39(1):162–175. [Google Scholar]
- 13.Mansourifard P, Javidi T, Krishnamachari B. Optimality of myopic policy for a class of monotone affine restless multi-armed bandits. In the Proceedings of the 51th IEEE International Conference on Decision and Control (CDC); 2012. [Google Scholar]
- 14.Niño-Mora J. Restless bandits, partial conservation laws and indexability. Advances in Applied Probability. 2001;33(1):76–98. [Google Scholar]
- 15.Niño-Mora J. Dynamic priority allocation via restless bandit marginal productivity indices. Top. 2007;15(2):161–198. [Google Scholar]
- 16.Niño-Mora J, Villar SS. Sensor scheduling for hunting elusive hiding targets via Whittle’s Restless Bandit Index Policy. 5th International Conference on Network Games, Control and Optimization (NetGCooP); 2011. pp. 1–8. [Google Scholar]
- 17.Papadimitriou CH, Tsitsiklis JN. The complexity of optimal queuing network control. Mathematics of Operations Research. 1999;24(2):293–305. [Google Scholar]
- 18.Sondik EJ. The optimal control of partially observable Markov processes over the infinite horizon: discounted costs. Operations Research. 1978;26(2):282–304. [Google Scholar]
- 19.Villar SS. Restless bandit models for sensor management. LAP LAMBERT Academic Publishing; Germany: 2012. [Google Scholar]
- 20.Weber RR, Weiss G. On an index policy for restless bandits. Journal of Applied Probability. 1990;27(3):637–648. [Google Scholar]
- 21.Whittle P. Restless bandits: Activity allocation in a changing world. Journal of Applied Probability. 1988;25:287–298. [Google Scholar]
- 22.Zhao Q, Krishnamachari B, Liu K. On myopic sensing for multi-channel opportunistic access: Structure, optimality, and performance. IEEE Transactions on Wireless Communications. 2008;7(12):5431–5440. [Google Scholar]
- 23.Mathpages, Algebra, Linear Fractional Transformations. Available from http://www.mathpages.com/home/kmath464/kmath464.htm.



