A Predictive Approach to Nonparametric Inference for Adaptive Sequential Sampling of Psychophysical Experiments

Stephan Poppe; Philipp Benner; Tobias Elze

doi:10.1016/j.jmp.2012.04.002

. Author manuscript; available in PMC: 2013 Jun 1.

Published in final edited form as: J Math Psychol. 2012 Jun 1;56(3):179–195. doi: 10.1016/j.jmp.2012.04.002

A Predictive Approach to Nonparametric Inference for Adaptive Sequential Sampling of Psychophysical Experiments

Stephan Poppe ^a,^✉, Philipp Benner ^a, Tobias Elze ^b

PMCID: PMC3399698 NIHMSID: NIHMS374479 PMID: 22822269

Abstract

We present a predictive account on adaptive sequential sampling of stimulus-response relations in psychophysical experiments. Our discussion applies to experimental situations with ordinal stimuli when there is only weak structural knowledge available such that parametric modeling is no option. By introducing a certain form of partial exchangeability, we successively develop a hierarchical Bayesian model based on a mixture of Pólya urn processes. Suitable utility measures permit us to optimize the overall experimental sampling process. We provide several measures that are either based on simple count statistics or more elaborate information theoretic quantities. The actual computation of information theoretic utilities often turns out to be infeasible. This is not the case with our sampling method, which relies on an efficient algorithm to compute exact solutions of our posterior predictions and utility measures. Finally, we demonstrate the advantages of our framework on a hypothetical sampling problem.

Keywords: Adaptive Sequential Sampling, Optimal Design, Active Learning, Predictive Inference, Psychophysics, Efficient Statistical Computations

1. Introduction and Motivation

The application of adaptive measurement methods have a long tradition in psychophysics. The need for such methods is mainly due to the limited number of measurements that can be taken during experiments. Most of the classical methods are motivated by their simplicity, both conceptually and computationally. With the advent of modern computers and the continuing progress in statistical theory, the development of more sophisticated adaptive sampling procedures has recently seen much progress.

Especially the consideration of Bayesian experimental designs based on the information theoretic description of experimental objectives and their numerical approximation (cf. MacKay (1992); Chaloner and Verdinelli (1995)) has moved into the focus of contemporary research. Recent developments are, for instance, the Ψ-method introduced in Kontsevich and Tyler (1999), the consideration of multidimensional stimulus spaces in Kujala and Lukka (2006), and the framework of adaptive design optimization (ADO) for model discrimination proposed in Cavagnaro et al. (2010). Another interesting example is given by Kujala (2010) who considers random cost as a further constraint for experimental observations.

Our contribution to the field is twofold. First, we address the case where no particular statistical model in form of parametric curves can be assumed. We present a complete formal description of a suitable framework for such a nonparametric setting. Second, we do not rely on any numerical approximation and the quantities of interest can be computed efficiently and exactly.

Our method applies to the following experimental setting: In a psychophysical experiment, the causal relation X → Y between a physical stimulus X and the psychological response Y of an observer is investigated. Before the experiment, a discrete set of L stimuli Inline graphic = {x₁, …, x_L} and K possible responses = {y₁, …, y_K } is determined. The stimuli is considered to be ordinal, i.e. the set of stimuli is assumed to be linearly ordered, for instance by strength or any other property associated with the physical parameters. The actual experiment is then performed in a sequential manner. That is, at the n-th stage of the experiment, a particular stimulus X_n ∈ Inline graphic is set and the participant’s response Y_n to that stimulus is recorded.

Figure 1 outlines such a experiment for illustrative purposes, taken from Elze et al. (2011), Experiment 2. Figure 1A shows the experimental setup that involves a two-alternative-forced-choice (2AFC) discrimination task: An observer has to report which of two possible target stimuli has been presented on a computer monitor. The discrimination performance was impaired by the presentation of a second stimulus, the so-called mask stimulus, at a position close to the target location. Within this experimental setting, the actual stimulus of interest X is the time interval between the offset of the target and the onset of the mask, the so-called interstimulus interval (ISI) that takes values in Inline graphic = {20 ms, 40 ms, …, 200 ms}, whereas the response Y takes values in = {correct, incorrect}, depending on whether or not the observer reported the correct target stimulus.

Illustrative example of a simple psychophysical experiment. A: The observer has to report which of two possible target stimuli were presented on a computer display. The detection of the target stimulus is impaired by a mask stimulus which is presented in close proximity to the position of the target after a variable interstimulus interval (ISI). B: The observed relative frequency of correct responses of a single observer for 20 different discrete ISIs ≤ 200 ms, which have been uniformly sampled. Each ISI occured 16 times.

One of the most often considered approaches to model such an experimental situation is to assume a multinomial sampling law. More specifically, given any stimulus {X = x} the generation of the response Y is thought to be described by multinomial parameters p_x_, = (p_x_,_y)_y_∈ ∈ Δ, where

Δ_{Y} = {p_{Y} = {(p_{y})}_{y \in Y} \in R^{Y} ∣ p_{y} \geq 0 for all y \in Y and \sum_{y \in Y} p_{y} = 1}

(1)

is the probability simplex, such that the conditional probability of the event {Y = y} given {X = x} is

P [Y = y ∣ X = x, p_{X, Y}] = p_{x, y} .

Within this setting, the statistical task is then entirely focused on the estimation of the psychometric rates $p_{X, Y} = {(p_{x, Y})}_{x \in X} \in Δ_{Y}^{X}$ . A low-dimensional parametric family of functions Inline graphic = {f_θ| θ ∈ Θ} can be used to introduce dependencies among the psychometric rates, such that

P [Y = y ∣ X = x, θ] = f_{θ} (x, y) .

This facilitates the inference task by exploiting structural knowledge about the interlink between the stimuli. The function f_θ is commonly termed psychometric function in the particular case of a binary response, e.g. a 2AFC experiment as outlined above. Especially sigmoid curves are a common choice if the stimulus X can be considered to be real-valued. Their geometric parameters such as slope and threshold can serve to actually define relevant psychophysical quantities of interest. There exists a vast literature on the statistical inference of psychometric rates and functions, see for instance Wichmann and Hill (2001); Kuss et al. (2005), which provide a good entry point into the literature.

From a mere statistical point of view, this parametric approach seems to be a reasonable strategy as it allows sharing statistical strength across stimuli. By learning the psychometric rate for a particular stimulus we also learn about all other rates because dependencies are introduced by the parametric family. Hence, the parametric approach allows seemingly good estimates even with few experimental data. Nevertheless, this modeling approach is unsuitable and can even bear the risk of a severe bias if there is no or only vague knowledge about the potential shapes of the functions f_θ and no member of the proposed parametric family Inline graphic does match with the actual psychometric rates. For instance, in the above example it seems hard to motivate any plausible regular functional form (see Figure 1B).

This form of bias is of course avoided by allowing the psychometric rates to freely range over $Δ_{Y}^{X}$ , which we refer to as the nonparametric approach¹. Clearly, much more data is then needed to draw informative inferences.

In this paper, we use an intermediate approach fairly balancing the advantages and disadvantages of both approaches by exploiting the fact that Inline graphic is of ordinal structure. Loosely speaking, we allow neighboring stimuli to share statistical strength by joining their respective psychometric rates. Each possible way of joining neighboring stimuli imposes a particular partition on the stimulus space , which is then assessed by a suitable Bayesian inference scheme. The resulting model is a variant of the product partition model proposed by Hartigan (1990) and the inhomogeneous Bernoulli process with piecewise constant probabilities described in Endres et al. (2008).

In our description of the model and the respective adaptive sampling procedures we follow the predictive paradigm as pioneered, for instance, in Roberts (1965); de Finetti (1974); Geisser (1993), by putting special emphasis on the prediction for the observables, which are the stimulus-response outcomes of the sequential experiment. By imposing a particular epistemic condition of partial exchangeability, the psychometric rates naturally emerge as a particular limiting statistic of the data rather than an external quantity. The corresponding Bayesian model matches extensionally with a multinomial sampling model with unknown parameters.

2. A Predictive Perspective on Sequential Experiments

2.1. Sequential Construction of Adaptive Sampling Processes

A reasonable adaptive experimental design requires that we have at least partial control of the sampling process concerning the presentation of stimuli. This experimental controllability can be subject to uncertainty, e.g. the experimental setup might be prone to errors in generating the required stimulus. For simplicity, we shall nonetheless assume that we can adjust the stimulus in the way we want. Our personal action policy, which is the subjective assessment of which stimulus might be best to choose, is then expressed by a probability measure P [X]. By setting a stimulus X w.r.t. P [X], we subsequently observe a respective response Y, where our prediction is described by a conditional measure P [Y |X].

In the following, we shall extend this scenario to sequential sampling schemes. We consider that at any sampling step n ∈ ℕ we can freely choose a stimulus X_n that results in the observation of an instance (X_n, Y_n). Let E_n = (X_n, Y_n) denote the n-th experiment, such that we refer to the first n experiments of the overall experimental process E = (E_n)_n_∈ℕ by E_n = (E₁, …, E_n) = (X_n, Y_n), where X_n = (X₁, …, X_n) and Y_n = (Y₁, …, Y_n). Two important statistics that summarize the data from the experiments E_n are given by the total count statistic n = (n_x_,)_x_∈ = (n_x_,_y )_x_∈
,_y_∈, where n_x_,_y is the (absolute) frequency of the event {X = x, Y = y} in E_n, and n = (n_x)_x_∈ is the related stimulus count statistic, i.e. n_x = Σ_y_∈ n_x,y. We illustrate this notation by the following short example.

Example 1

Suppose we run an experiment with a set of stimuli Inline graphic = {a, b, c, d} and dichotomous responses = {0, 1}. If in eight trials we observe that

\begin{array}{l} E_{8} = (\begin{matrix} X_{8} \\ Y_{8} \end{matrix}) = (\begin{matrix} X_{1} & X_{2} & X_{3} & X_{4} & X_{5} & X_{6} & X_{7} & X_{8} \\ Y_{1} & Y_{2} & Y_{3} & Y_{4} & Y_{5} & Y_{6} & Y_{7} & Y_{8} \end{matrix}) \\ = (\begin{matrix} a & b & a & c & d & b & b & c \\ 1 & 1 & 0 & 1 & 0 & 1 & 0 & 1 \end{matrix}), \end{array}

(2)

then the two statistics just defined are

\begin{array}{l} n_{X, Y} = (\begin{matrix} n_{a, Y} & n_{b, Y} & n_{c, Y} & n_{d, Y} \end{matrix}) \\ = (\begin{matrix} n_{a, 0} & n_{b, 0} & n_{c, 0} & n_{d, 0} \\ n_{a, 1} & n_{b, 1} & n_{c, 1} & n_{d, 1} \end{matrix}) = (\begin{matrix} 1 & 1 & 0 & 1 \\ 1 & 2 & 2 & 0 \end{matrix}), \\ n_{X} = (n_{a}, n_{b}, n_{c}, n_{d}) = (2, 3, 2, 1) . \end{array}

We now take a sequential and prediction oriented perspective. We fix our expectations about the experimental course E by specifying a sequence of conditional measures in form of kernels π_n(E_n₊₁ | E_n), n ∈ ℕ₀, where each π_n describes our uncertainty about the outcome of the n + 1-th experiment given the experimental data from the previous n sampling steps. Each such sequence of kernels π_n(· | ·), n ∈ ℕ₀, defines a unique measure π on the space of experimental courses, such that marginally

π (E_{n}) = \prod_{i = 0}^{n - 1} π_{i} (E_{i + 1} ∣ E_{i}), n \in N .

Thus, we can formally describe the experimental course E as a random process, which we call the adaptive sampling process E distributed with respect to π (i.e E ~ π). This process describes our belief dynamic since it is constructed from our experimental predictions (π_n)_n∈ℕ₀. Each experimental prediction π_n can in turn be constructed from two separate kernels

π_{n} (E_{n + 1} ∣ E_{n}) = π_{n}^{X} (X_{n + 1} ∣ E_{n}) π_{n}^{Y} (Y_{n + 1} ∣, X_{n + 1}, E_{n}) .

There exists a very natural interpretation for each of these kernels in terms of experimental design and prediction:

The Stimulus Placement Rule: $π_{n}^{X} (X_{n + 1} ∣ E_{n})$ specifies our action policy in the n-th sampling step. It determines which stimulus X_n₊₁ we select given the n previous experiments E_n.
The Response Prediction Rule: $π_{n}^{Y} (Y_{n + 1} ∣ X_{n + 1}, E_{n})$ is our prediction of the response knowing the outcome of the last n experiments E_n, i.e. which response Y_n₊₁ we expect given the actual stimulus X_n₊₁.

Given a particular assignment for the prediction rule $π_{n}^{Y}$ , we want to learn the stimulus-response relation X → Y in an optimal manner with regard to the inference scheme and external constraints. Thus, we want to derive a placement rule $π_{n}^{X}$ that allows us to adapt the experimental course to our objectives.

In order to understand the logic of adaptive sampling strategies it is worthwhile to first consider the case of a non-adaptive design. Such a conventional design consists of a fixed sequence of stimuli $x^{*} = {(x_{n}^{*})}_{n \in N}$ , such that

π_{n}^{X} (X_{n + 1} = x ∣ E_{n}) = {\begin{array}{l} 1 & if x = x_{n + 1}^{*} \\ 0 & otherwise \end{array} .

Clearly, such an action policy does not take any information about the already collected data into account. A more reasonable strategy allows the decision for a stimulus $x_{n + 1}^{*}$ to depend on the outcomes of the n foregoing experiments E_n, i.e. $x_{n + 1}^{*} = x_{n + 1}^{*} (E_{n})$ . More precisely, instead of describing one fixed sequence of stimuli we rather specify a decision rule that determines a stimulus $x_{n + 1}^{*}$ on the basis of the previous experiments E_n. Many of the classical adaptive procedures for testing psychometric functions, such as PEST (Taylor and Creelman (1967)), QUEST (Watson and Pelli (1983)) and the up-down procedures (Levitt (1971)) can be described that way. A comprehensive review of these methods can be found in Leek (2001).

One principled way to obtain a decision rule is to specify a utility measure U_n₊₁(x, E_n), which quantifies the utility of a stimulus x based on the outcome of the previous experiments E_n. Given such a measure, an optimal stimulus is determined by

x_{n + 1}^{*} (E_{n}) = \underset{x \in X}{arg max} U_{n + 1} (x, E_{n}) .

(3)

A proper placement rule in the case of multiple optimal stimuli Inline graphic (E_N ) ⊆ is given by

π_{n}^{X} (X_{n + 1} ∣ E_{n}) = {\begin{array}{l} \frac{1}{∣ O_{n + 1} (E_{n}) ∣} & if X_{n + 1} \in O_{n + 1} (E_{n}) \\ 0 & otherwise \end{array} .

Hence, by taking on the utility-oriented perspective, the problem of determining a placement rule becomes a problem of choosing suitable utility measures U_n₊₁, n ∈ ℕ₀. Many possible utility measures exist and which one we choose depends solely on our objectives. For instance, if we want to place the stimuli in a random but balanced manner, then we could choose the utility measure

U_{n + 1} (x, E_{n}) = - n_{x} (E_{n}),

where n_x is the stimulus count statistic of x. Clearly, the respective placement rule selects stimuli that have seen the least trials. This random uniform sampling scheme has also been called method of constant stimuli (cf. McKee et al. (1985)) within the context of psychophysical experiments. It is usually considered as a non-adaptive strategy (Watson and Fitzhugh (1990)).

Another common and more sensible strategy to obtain a placement rule is suggested by the theory of optimal sequential decisions under uncertainty (cf. DeGroot (2004); Berger (1993)). In principle we should consider that only a finite number of experiments is performed, say N ∈ ℕ, such that we should formulate our objectives in form of a global utility measure u(E_N ) for the outcome of the overall experimental course E_N. As a matter of rationality, one should choose a sequence of descision rules $x_{N}^{*} (E_{N - 1}) = (x_{1}^{*}, x_{2}^{*} (E_{1}), \dots, x_{N}^{*} (E_{N - 1}))$ , such that the expected global utility

\bar{u} = \sum_{y_{N} \in Y^{N}} u (x_{N}^{*}, y_{N}) \prod_{n = 0}^{N - 1} π_{n + 1}^{Y} (Y_{n + 1} = y_{n + 1} ∣ X_{n + 1} = x_{n + 1}^{*}, X_{n} = x_{n}^{*}, Y_{n} = y_{n})

is maximized. This optimization problem is highly non-trivial, but can be solved, at least in principle, with backward induction (Berger (1993); Bernardo and Smith (1995); DeGroot (2004)). For a concise description of the backward induction method see in particular Müller et al. (2007). This procedure leads to a sequence of local utility measures u_n₊₁(x, y, E_n) that are induced from both the global utility u and the predictions π^Y. The optimal stimulus for the n + 1-th experiment is determined by maximizing the expected local utility, i.e.

x_{n + 1}^{*} (E_{n}) = \underset{x \in X}{arg max} \sum_{y \in Y} u_{n + 1} (x, y, E_{n}) π_{n}^{Y} (Y_{n + 1} = y ∣ X_{n + 1} = x, E_{n}) .

(4)

The local utility measure u_n₊₁(x, y, E_n) for a particular stimulus-response (x, y) is the expected global utility ū as of the n + 1-th sampling step given that all subsequent decisions are made in the same optimal manner. Although this scheme leads in principle to an optimal design, in most cases, except for trivial settings, the actual computation turns out to be infeasible.

It is primarily for this reason that various approximations to such an optimal design have been developed. One of them is to resort to a myopic adaptive optimal design by directly specifying local utility measures u_n₊₁(x, y, E_n), n = 0, …, N − 1, in an attempt to optimize the global utility. Likewise, the optimal stimulus is determined by (4), such that we optimize the next step, but with the hope that this also maximizes the global utility. We shall discuss particular choices for such measures based on information theoretical considerations in section 3.7.

2.2. Partial Exchangeable Response Processes

Concerning the choice of a proper prediction rule $π_{n}^{Y}$ , we already mentioned in the introduction that the response generation X → Y is usually thought to be governed by a multinomial sample law described by some unknown psychometric rates p ∈ Δ. From a predictive perspective this amounts to the assumption of a specific form of partial exchangeability, which has been introduced in de Finetti (1980) as a generalization of the concept of exchangeability (de Finetti (1937)). Roughly speaking, we have to assume that particular temporal orderings within any finite sequence of experiments E_n do not provide relevant information for our predictions.

In order to make this more precise, we need to introduce the following statistics and respective variables. We define the response statistic $y_{n_{X}}^{X} = {(y_{n_{x}}^{x})}_{x \in X}$ , where $y_{n_{x}}^{x} = (y_{1}^{x}, \dots, y_{n_{x}}^{x})$ . Here $y_{i}^{x} = y$ indicates that in the i-th trial in which {X = x} occurred the response outcome was the event {Y = y}. It is crucial to notice that the count statistic n can also be computed from the response statistic. The random variables related to the response statistic are denoted by Y = (Y^x)_x_∈ with $Y^{x} = {(Y_{i}^{x})}_{i \in N}$ , such that $Y_{i}^{x} = y_{i}^{x}$ . Since we often need to refer to a particular subset of Y, we also define $Y_{n_{X}}^{X} = {(Y_{n_{x}}^{x})}_{x \in X}$ , where $Y_{n_{x}}^{x} = (Y_{1}^{x}, \dots, Y_{n_{x}}^{x})$ , x ∈ Inline graphic , i.e. $Y_{n_{X}}^{X} = y_{n_{X}}^{X}$ .

Example 2

In our previous example the response statistic $y_{n_{x}}^{x}$ is given by

y_{2}^{a} = (1, 0), y_{3}^{b} = (1, 1, 0), y_{2}^{c} = (1, 1), y_{1}^{d} = (0),

such that we observed that

(Y_{2}^{a}, Y_{3}^{b}, Y_{2}^{c}, Y_{1}^{d}) = (\begin{matrix} Y_{1}^{a} & Y_{1}^{b} & Y_{1}^{c} & Y_{1}^{d} \\ Y_{2}^{a} & Y_{2}^{b} & Y_{2}^{c} \\ Y_{3}^{b} & Y_{3}^{c} \end{matrix}) = (\begin{matrix} 1 & 1 & 1 & 0 \\ 0 & 1 & 1 \\ 0 \end{matrix}),

We now proceed as follows. Instead of specifying the prediction rule $π_{n}^{Y}$ directly, we rather fix a probability measure P for Y, i.e Y ~ P, such that we treat Y as a random process called the response process. This response process is meant to describe our personal beliefs about the observational part of the experiments irrespective of the actual underlying sequence of stimuli X_N, which clearly depends on our action policy. A proper prediction rule is then given by

π_{n}^{Y} (Y_{n + 1} = y ∣ X_{n + 1} = x, E_{n}) = \frac{P [Y_{n_{X}}^{X}, Y_{n_{x} + 1}^{x} = y]}{P [Y_{n_{X}}^{X}]} .

(5)

Furthermore, and more importantly, the response process Y can be utilized to derive an intrinsic statistical model for X → Y by requiring the following form of partial exchangeability (cf. Link (1980); de Finetti (1980)):

The response process Y is said to be partial exchangeable iff

P [Y_{n_{X}}^{X} = y_{n_{X}}^{X}] = P [Y_{n_{X}}^{X} = y_{n_{X}}^{X}]

for every two responses $y_{n_{X}}^{X}$ and ${\tilde{y}}_{n_{X}}^{X}$ , which share the same count statistic n. This is equivalent to require that the response process Y_n is summarized by the count statistics n (cf. Lauritzen (1974)), i.e. every sequence $y_{n_{X}}^{X}$ with the same count statistic is predicted to be equally likely. We illustrate this notion of exchangeability by the following example.

Example 3

Consider we observed the data set in (2), see example 1. The alternative observations

(Y_{2}^{a}, Y_{3}^{b}, Y_{2}^{c}, Y_{1}^{d}) = (\begin{matrix} 0 & 1 & 1 & 0 \\ 1 & 1 & 1 \\ 0 \end{matrix}) or (\begin{matrix} 0 & 1 & 1 & 0 \\ 1 & 0 & 1 \\ 1 \end{matrix}),

do all preserve the count statistics n, such that we would judge them all to be equally likely under the condition of partial exchangeability. Note that all of these equivalent observations are related by particular permutations, i.e. by exchanging positions within each of the columns $Y_{n_{x}}^{x}$ , x ∈ Inline graphic , but not across them.

An import subclass of partially exchangeable response processes is described by the multinomial sampling laws², where

Y^{x} ~ Mult [p_{X, Y}],

such that the marginal probability mass function is given by

f_{Mult [p_{X, Y}]} [y_{n_{X}}^{X}] = \prod_{x \in X, y \in Y} p_{x, y}^{n_{x, y}} .

This class is especially important because for each partial exchangeable random process there exists a mixing measure μ on $Δ_{Y}^{X}$ (Link (1980); Bernardo and Smith (1995)), also called a de Finetti measure, such that

P [Y_{n_{X}}^{X} = y_{n_{X}}^{X}] = \int_{Δ_{Y}^{X}} f_{Mult [p_{X, Y}]} [y_{n_{X}}^{X}] μ (d p_{X, Y}) .

Therefore each partial exchangeable random process is a mixture of the multinomial sampling laws. Within this context, each $p_{X, Y} \in Δ_{Y}^{X}$ can be interpreted as a possible limit of the relative frequencies, i.e.

lim_{n_{x} \to \infty} \frac{n_{x, Y}}{n_{x}} = p_{x, Y}, x \in X,

such that the measure μ expresses our prediction for these limits. Hence, under the condition of partial exchangeability we can formally identify these limits as the psychometric rates we are uncertain about, but with the important distinction that the rates are not external quantities but asymptotic statistics of the response process itself. The corresponding Bayesian model can thus be described by

Y^{X} ~ Mult [p_{X, Y}],

whereas

p_{X, Y} ~ μ .

By conditioning on finite data $Y_{n_{X}}^{X}$ , the resulting response process is still partial exchangeable, such that the respective de Finetti measure $μ (\cdot ∣ Y_{n_{X}}^{X})$ describes our posterior belief about the psychometric rates and can be seen to be a Bayesian posterior of p.

3. Response Processes with Proximally Related Stimuli

Here, we introduce a particularly useful instance of a partial exchangeable response process, namely the multibin Pólya mixture urn process, which allows a flexible modeling of similarities between stimuli. We first consider the case of only one stimulus and continue with the easiest non-trivial case of two stimuli before we develop the full process with multiple stimuli. Afterwards we shall consider the construction of respective adaptive sampling procedures.

3.1. One stimulus

Let us consider only one stimulus, i.e. Inline graphic = {x}, with the following probability assignment for the response process Y ^x: We say that Y ^x is a Pólya urn process with parameters α_x_, = (α_x_,_y )_y_∈, where α_x_,_y > 0, if

P [Y_{n_{x}}^{x}] = \frac{Beta (n_{x, Y} + α_{x, Y})}{Beta (α_{Y})},

where Beta is the multinomial beta function. The corresponding prediction rule is

π_{n}^{Y} (Y_{n + 1} = y ∣ X_{n + 1} = x, X_{n}, Y_{n}) = \frac{n_{x, y} + α_{x, y}}{n_{x} + α_{x}},

where α_x = Σ_y_∈ α_x,y, which is commonly known as the (generalized) Bayes-Laplace rule. The parameters α_x_, have a natural interpretation as pseudo counts added to the actual counts n_x_,. The term Pólya urn stems from the original introduction as an urn model in Eggenberger and Pólya (1923). In this picture, balls are successively drawn from an urn that is initially filled with K balls of different colors. Each color indicates a particular y ∈ Inline graphic and the respective ball is assigned an initial weight of α_x_,_y. At each sampling step, a ball is thought to be drawn with chance given by its individual mass and put back to the urn with another ball of the same color and mass one. The so generated sequence of colors Y ^x is then described by the above Pólya urn process, which is partial exchangeable.

The respective de Finetti measure is given by a Dirichlet distribution with parameters α_x_,, i.e.

p_{x, Y} ~ Dir [α_{x, Y}],

where the respective density is

f (p_{x, Y}) = \frac{1}{Beta (α_{x, Y})} \prod_{y \in Y} p_{y}^{α_{x, y} - 1} .

Conditional on finite data $Y_{n_{x}}^{x}$ , the resulting response process is again a Pólya urn process with parameters $α_{x, Y}^{'}$ given by the update rule

α_{x, y}^{'} = α_{x, y} + n_{x, y}, y \in Y,

i.e. the observed counts n_x_, are just added to the pseudo counts α_x_,. Hence, assuming a Pólya urn process for describing our personal predictions justifies the formal adoption of the standard Bayesian model of multinomial sampling and a subjective prior in form of the Dirichlet distribution.

3.2. Two stimuli

We now consider a stimulus space with two elements, say Inline graphic = {x₁, x₂}. Based on the previously described Pólya urn process we can think of two extremal urn models for the response process Y. First, we consider the case where the underlying mechanisms of (X = x₁) → Y and (X = x₂) → Y are identical. More precisely, we say that x₁ and x₂ are similar, denoted x₁ ~ x₂, if we expect that there is no difference in the generation of Y given either x₁ or x₂. This similarity allows us to define a bin b = {x₁, x₂}, such that we can exploit the similarity by joining the observations from both stimuli³. We call this binning scheme B₁ and conditional on it the response process Y is characterized by

P [Y_{n_{X}}^{X} ∣ B_{1}] = \frac{Beta (n_{b, Y} + α_{b, Y})}{Beta ({α_{b,}}_{Y})},

where n_b_, = (n_b_,_y )_y_∈ is the total count n of observations joined in bin b, i.e. n_b_,_y = n_x₁,y + n_x₂,y, y ∈ Inline graphic , and α_b_, describes the pseudo counts. The induced prediction rule is

π_{n}^{Y} (Y_{n + 1} = y ∣ X_{n + 1} = x, E_{n}, B_{1}) = \frac{n_{b, y} + α_{b, y}}{n_{b} + α_{b}},

where x ∈ b. The so described response process is partial exchangeable and the respective de Finetti measure μ(·|B₁) can informally be described by the density

f (p_{X, Y}) = \frac{1}{Beta (α_{b, Y})} \prod_{y \in Y} p_{b, y}^{α_{b, y} - 1} \prod_{x \in b} δ (p_{b, Y} - p_{x, Y}),

where δ is the Dirac delta function. That is, the measure μ(·|B₁) concentrates on the diagonal of the product simplex $Δ_{Y}^{X} = Δ_{Y}^{x_{1}} \times Δ_{Y}^{x_{2}}$ , which is the set

{p_{X, Y} \in Δ_{Y}^{X} ∣ p_{x_{1}, y} = p_{x_{2}, y}, y \in Y},

such that if we identify both psychometric rates p_x₁, and p_x₂, by just one rate p_b_, ∈ Δ, then this rate is described by a Dirichlet distribution with parameters α_b_,. Similar to the case of a single stimulus, the update rule for the parameters is given by

α_{b, y}^{'} = α_{b, y} + n_{b, y}, y \in Y .

The second case we have to consider is that both mechanism (X = x₁) → Y and (X = x₂) → Y are independent. In order to describe a respective response process, consider two independent Pólya urns both filled with the same kind of colored balls, where we mark the two urns x₁ respectively x₂. In each sampling step we first decide from which urn to sample and then proceed as before. To be consistent with our notation we introduce a binning scheme B₂ with two separate bins b₁ = {x₁} and b₂ = {x₂}. Instead of assigning weights to the urns directly, we assign them to our bins, i.e. α_b₁, for the first bin and α_b₂, for the second one. The respective response is given by the product measure

P [Y_{n_{X}}^{X} ∣ B_{2}] = \frac{Beta (n_{b_{1}, Y} + α_{b_{1}, Y})}{Beta ({α_{b_{1},}}_{Y})} \frac{Beta (n_{b_{2}, Y} + α_{b_{2}, Y})}{Beta ({α_{b_{2},}}_{Y})} .

The so induced prediction rule is

π_{n}^{Y} (Y_{n + 1} = y ∣ X_{n + 1} = x, E_{n}, B_{2}) = \frac{n_{I_{B_{2}} (x), y} + α_{I_{B_{2}} (x), y}}{n_{I_{B_{2}} (x)} + α_{I_{B_{2}} (x)}},

(6)

where x ∈ Inline graphic and I_B₂(x) tells us the bin to which x is assigned. This response process is partially exchangeable and the respective de Finetti measure μ(·|B₂) is a product measure on $Δ_{Y}^{X}$ , where p_x₁, ~ Dir[α_b₁,] and p_x₂, ~ Dir[α_b₂,] are independently distributed, i.e. the respective density is given by the product density

f (p_{X, Y}) = \frac{1}{Beta (α_{b_{1}, Y})} \prod_{y \in Y} p_{x_{1}, y}^{α_{b_{1}, y} - 1} \times \frac{1}{Beta (α_{b_{2}, Y})} \prod_{y \in Y} p_{x_{2}, y}^{α_{b_{2}, y} - 1} .

Conditional on finite data $Y_{n_{X, Y}}^{X}$ , the respectively updated parameters $α_{B, Y}^{'}$ are

α_{b_{i}, y}^{'} = α_{b_{i}, y} + n_{b_{i}, y}, y \in Y, i \in {1, 2} .

We can utilize both schemes B₁ and B₂ to model our beliefs about the similarity and dissimilarity between the stimuli. If B₁ represents the hypothesis that x₁ ~ x₂, then B₂ is the alternative hypothesis that x₁ ≁ x₂. If we assess our a priori belief in B₁ with a probability of P [B₁], then our a priori belief for model B₂ is P [B₂] = 1 − P [B₁]. Our overall prediction for the response process is then given by the mixture measure

P [Y_{n_{X}}^{X}] = P [Y_{n_{X}}^{X} ∣ B_{1}] P [B_{1}] + P [Y_{n_{X}}^{X} ∣ B_{2}] P [B_{2}] .

(7)

The induced prediction rule is

π_{n}^{Y} (Y_{n + 1} = y ∣ X_{n + 1} = x, E_{n}) = \frac{P [Y_{n_{X}}^{X}, Y_{n_{x} + 1}^{x} = y]}{P [Y_{n_{X}}^{X}]},

which can be rewritten as

π_{n}^{Y} (Y_{n + 1} = y ∣ X_{n + 1} = x, E_{n}) = \sum_{i = 1}^{2} π_{n}^{Y} (Y_{n + 1} = y ∣ X_{n + 1} = x, E_{n}, B_{i}) P [B_{i} ∣ Y_{n_{X}}^{X}],

(8)

where

P [B_{i} ∣ Y_{n_{X}}^{X}] = \frac{P [Y_{n_{X}}^{X} ∣ B_{i}]}{P [Y_{n_{X}}^{X}]} P [B_{i}], i \in {1, 2} .

(9)

The latter probability measure $P [B_{i} ∣ Y_{n_{X}}^{X}]$ can be interpreted as the Bayesian posterior assessment for each binning scheme B_i. The respective de Finetti measure for p is given by

μ (\cdot) = P [B_{1}] μ (\cdot ∣ B_{1}) + P [B_{2}] μ (\cdot ∣ B_{2}),

which can be seen to be a model average over B₁ and B₂. One of our major goals is to generalize this mixture to multiple stimuli and responses, for which we want to develop adaptive sequential sampling strategies.

3.3. Multiple stimuli

Here, we briefly discuss the multibin Pólya process on discrete finite stimulus spaces Inline graphic = {x₁, …, x_L}, where we introduce an equivalence relation ~_B that describes similarities in . This relation induces a partition B of into |B| bins, called a multibin. We then bin the data with respect to the resulting multibin B = {b₁, …, b_|_B_|}, such that we get the binned count statistic n_B_, = (n_b_,)_b_∈_B. In full analogy to the scenario of two stimuli, we set pseudo counts α_B_, = (α_b_,)_b_∈_B and fix the multibin Pólya urn process Y with

P [Y_{n_{X}}^{X} ∣ B] = \prod_{b \in B} \frac{Beta (n_{b, Y} + α_{b, Y})}{Beta (α_{b, Y})} .

The induced prediction rule is

π_{n}^{Y} (Y_{n + 1} = y ∣ X_{n + 1} = x, E_{n}, B) = \frac{n_{I_{B} (x), y} + α_{I_{B} (x), y}}{n_{I_{B} (x)} + α_{I_{B} (x)}},

where I_B (x) denotes the bin of stimulus x given the multibin B, i.e. if x ∈ b then I_B (x) = b. Likewise the above case of two separate stimuli, the respective de Finetti measure μ(·|B) can be informally described by the density

f (p_{X, Y}) = \prod_{b \in B} \frac{1}{Beta (α_{b, Y})} \prod_{y \in Y} p_{b, y}^{α_{b, y} - 1} \prod_{x \in b} δ (p_{b, Y} - p_{x, Y}),

such that conditional on some finite data $Y_{n_{X}}^{X}$ the relevant parameters α_B_, are simply updated according to

α_{b, y}^{'} = α_{b, y} + n_{b, y}, b \in B, y \in Y .

3.4. A Hierachical Bayesian Model for Proximally Related Stimuli

We assumed that the stimulus space Inline graphic = {x₁, x₂, …, x_L} exhibits some well-ordering, as indicated by the indexing. We need to introduce some suitable terminology. Let ( ) = {b_i_,_j = {x_i, x_i₊₁ …, x_j }⊆ | i ≤ j} denote the class of consecutive bins, where we call each partition B of a proximal multibin if it consists only of consecutive bins. The class of all proximal multibins with m bins is denoted Inline graphic ( ), such that $P (X) = \cup_{m = 1}^{L} P_{m} (X)$ constitutes the class of all proximal multibins. These definitions are illustrated by the following example.

Example 4

Consider a set of stimuli Inline graphic = {1, 2, 3}, such that

\begin{array}{l} C (X) = {b_{1, 1}, b_{2, 2}, b_{3, 3}, b_{1, 2}, b_{2, 3}, b_{1, 3}} \\ = {{1}, {2}, {3}, {1, 2}, {2, 3}, {1, 2, 3}} . \end{array}

Furthermore, Inline graphic ( ) = {B₁, B₂, B₃, B₄}, where

\begin{array}{l} B_{1} = {b_{1, 1}, b_{2, 2}, b_{3, 3}}, & B_{2} = {b_{1, 2}, b_{3, 3}}, \\ B_{3} = {b_{1, 1}, b_{2, 3}}, & B_{4} = {b_{1, 3}} . \end{array}

For each consecutive bin b ∈ Inline graphic we choose pseudo counts α_b_, and construct the following response process: By fixing a priori beliefs P [B], B ∈ ( ), we can describe the full response process Y as the mixture

P [Y_{n_{X}}^{X}] = \sum_{B \in P (X)} P [B] P [Y_{n_{X}}^{X} ∣ B] = \sum_{B \in P (X)} P [B] \prod_{b \in B} \frac{Beta (n_{b, Y} + α_{b, Y})}{Beta (α_{b, Y})} .

(10)

We shall refer to this process as the multibin Pólya mixture process. The induced prediction rule is

π_{n}^{Y} (Y_{n + 1} = y ∣ X_{n + 1} = x, E_{n}) = \frac{P [Y_{n_{X}}^{X}, Y_{n_{x} + 1}^{x} = y]}{P [Y_{n_{X}}^{X}]},

(11)

which can be rewritten as

π_{n}^{Y} (Y_{n + 1} = y ∣ X_{n + 1} = x, E_{n}) = \sum_{B \in P (X)} π_{n}^{Y} (Y_{n + 1} = y ∣ X_{n + 1} = x, E_{n}, B) P [B ∣ Y_{n_{X}}^{X}],

(12)

where

P [B ∣ Y_{n_{X}}^{X}] = \frac{P [Y_{n_{X}}^{X} ∣ B]}{P [Y_{n_{X}}^{X}]} P [B], B \in P (X),

(13)

is the posterior for the multibins B ∈ Inline graphic ( ). The Bayesian model that corresponds to the multibin Pólya mixture process is described by the hierarchical model

\begin{array}{l} B ~ P [\cdot] \\ p_{X, Y} ~ μ (\cdot ∣ B) \\ Y^{X} ~ Mult [p_{X, Y}] . \end{array}

(14)

There are many ways how to look at the so described hierarchical model. For instance, the model can be seen as a Bayesian regression model for the psychometric rates, where a non-trivial prior in form of

μ (\cdot) = \sum_{B \in P (X)} μ (\cdot ∣ B) P [B]

is chosen. This prior assigns probability mass on particular diagonals of the product simplex, such that the psychometric rates become piecewise constant w.r.t. to a particular multibin, which are in turn assumed to be random quantities. From that point of view, the described model can be seen to be a generalization of the inhomogenous Bernoulli process described in Endres et al. (2008). Likewise, the model can be interpreted as a particular clustering model. Observations are clustered with respect to an unknown partition of the data space. From this point of view, the model is related to the so called product partition model as introduced by Hartigan (1990), which requires a particular prior structure for P [B] that will be discussed next.

3.5. Efficient Model Evaluation

The actual computation of the sum-product Σ_B_∈
(
)Π_b_∈_B in equation (10) can become computationally very demanding and infeasible as it may take up to Inline graphic (2^L⁻¹ ) steps. This is especially problematic for adaptive sampling where the relevant quantities have to be computed as quickly as possible. However, based on the computational approaches taken by Yao (1984); Barry and Hartigan (1992); Endres and Földiák (2005); Fernhead (2006); Hutter (2007) it can be shown that particular prior structures of P [B] lead to a drastic reduction of the computational effort. In fact, it can be reduced to Inline graphic (L³) if

P [B] = \frac{1}{c (β, γ)} β_{∣ B ∣} \prod_{b \in B} γ_{b},

(15)

where |B| is the number of bins in B. The respective computational algorithm is given in lemma 1 (see appendix) to which we refer as Proximal Multi-Bin Summation (ProMBS). It is an abstracted and generalized version of the algorithm presented in Endres and Földiák (2005). The parameters γ = (γ_b)_b_∈
(
), with γ_b ≥ 0, assess the a priori importance of each consecutive bin b ∈ Inline graphic ( ) and have been also called cohesions in Barry and Hartigan (1992), whereas β = (β₁, …, β_L), with β_l ≥ 0, determine a relative weight for each class ( ), m = 1, …, L, in ( ).

Given a set of parameters (α, β, γ) a particular multibin Pólya mixture process is fixed that describes our a priori belief about the response process Y. All relevant posterior quantities are determined by the updated parameters

\begin{array}{l} α_{b, y}^{'} = α_{b, y} + n_{b, y} \\ β_{m}^{'} = \frac{β_{m}}{d (α, β, γ, n_{X, Y})} \\ γ_{b}^{'} = γ_{b} \frac{Beta (n_{b, Y} + α_{b, Y})}{Beta (α_{b, Y})} \end{array}

where

d (α, β, γ, n_{X, Y}) : = \sum_{B \in P (X)} β_{∣ B ∣} \prod_{b \in B} γ_{b} \frac{Beta (n_{b, Y} + α_{b, Y})}{Beta (α_{b, Y})} .

The ProMBS algorithm can be used to efficiently compute d(α, β, γ, n), which will reappear in many other relevant expressions. For instance, we can rewrite equation (10) as

P [Y_{n_{X}}^{X}] = \frac{d (α, β, γ, n_{X, Y})}{c (β, γ)},

whereas the normalization constant in (15) is given by

c (β, γ) = d (α, β, γ, 0),

such that the prediction rule in equation (11) becomes

π_{n}^{Y} (Y_{n + 1} = y ∣ X_{n + 1} = x, E_{n}) = \frac{d (α, β, γ, n_{X, Y}^{+ (x, y)})}{d (α, β, γ, n_{X, Y})},

where $n_{X, Y}^{+ (x, y)}$ is the count statistic n incremented by one count for the event {X = x, Y = y}. Likewise, given a stimulus x ∈ Inline graphic , the marginal posterior density of the respective psychometric rate p_x_, is given by

f (p_{x, Y} ∣ Y_{n_{X}}^{X}) = \frac{d (α, β, \tilde{γ} (p_{x, Y}), n_{X, Y})}{d (α, β, γ, n_{X, Y})},

where

{\tilde{γ}}_{b} (p_{x, Y}) = {\begin{array}{l} γ_{b} \frac{\prod_{y \in Y} p_{x, y}^{n_{b, y} + α_{b, y} - 1}}{Beta (n_{b, Y} + α_{b, Y})} & if x \in b \\ γ_{b} & otherwise \end{array} .

The pointwise evaluation of this density can get computationally very expensive, but is nevertheless possible without Monte Carlo methods. Alternatively, the density $f (p_{x, Y} ∣ Y_{n_{X}}^{X})$ can be described by its moments. Here, the k-th raw moment is

E [{(p_{x, y})}^{k} ∣ Y_{n_{X}}^{X}] = \frac{d (α, β, γ, n_{X, Y}^{+ k (x, y)})}{d (α, β, γ, n_{X, Y})},

where we add k events {X = x, Y = y} to the count statistic n. We can also compute the posterior for the multibins (13)

P [B ∣ Y_{n_{X}}^{X}] = \frac{β_{∣ B ∣} \prod_{b \in B} γ_{b} \frac{Beta (n_{b, Y} + α_{b, Y})}{Beta (α_{b, Y})}}{d (α, β, γ, n_{X, Y})},

and by introducing a variable M ∈ {1, …, L} that restricts the model to multibins from Inline graphic ( ), we can compute

P [M = m ∣ Y_{n_{X}}^{X}] = \frac{\sum_{B \in P_{m} (X)} β_{m} \prod_{b \in B} γ_{b} \frac{Beta (n_{b, Y} + α_{b, Y})}{Beta (α_{b, Y})}}{d (α, β, γ, n_{X, Y})}

which is the posterior assessment that X → Y is described by a multibin model with m bins. Since neighboring stimuli potentially share strength, it is of interest to quantify to which extent this is happening. Such an informative statistic is the effective count n̄_x, which we define as the expectation value

{\bar{n}}_{x} (Y_{n_{X}}^{X}) = \sum_{B \in P (X)} n_{I_{B} (x)} P [B ∣ Y_{n_{X}}^{X}],

(16)

where I_B (x) = b if x ∈ b and b ∈ B. An efficient evaluation is possible because

{\bar{n}}_{x} (Y_{n_{X}}^{X}) = \frac{d (α, β, \tilde{γ}, n_{X, Y})}{d (α, β, γ, n_{X, Y})},

where

{\tilde{γ}}_{b} = {\begin{array}{l} γ_{b} n_{b} & if x \in b \\ γ_{b} & otherwise \end{array} .

3.6. Prior Selection

For the multibin Pólya mixture process we need to select parameters (α, β, γ). With β_m we specify the importance of models consisting of exactly m bins. For instance, if we want to fall back to a simpler problem with at most l < L bins we can set β_l = 1 and β_j = 0 for j ≠ l. A reasonable choice is often given by

β_{m} = {(\begin{array}{l} L - 1 \\ m - 1 \end{array})}^{- 1}, for m = 1, 2, \dots, L .

The binomial coefficient gives the number of multibin models that constist of m bins. Hence, if for some constant c all γ_b = c, b ∈ Inline graphic ( ), then the above prior assigns a uniform distribution to P [M = m], the a priori probability of multibins with m bins. In any case, there are only L values that we have to specify for β. However, for α_b_, and γ_b we need to select values for each b ∈ ( ). If we are dealing with many stimuli the assignment of parameters can get a very extensive task. A first and very convenient choice is to set all α_b_, and γ_b to one, which means that we use an uninformative prior for the pseudo counts and we do not have any preference for specific bins. A more elaborate and natural way is to compute α_b_, in a hierarchical manner, i.e. if we require that

α_{b, Y} = \frac{1}{∣ b ∣} \sum_{x \in b} α_{x, Y},

then we only need to specify α_x_, for each x ∈ Inline graphic .

3.7. Placement Rules for Adaptive Sampling

We discussed in section 2.1 a utility based approach concerning the choice of a suitable action policy for choosing the placement rule $π_{n}^{X}$ . We shall assume that our objective is to become most informed about the stimulus-response relation modeled by the multibin Pólya mixture process. We present here several proposals for how this might be achieved.

As already mention earlier, a most trivial sampling scheme is to distribute measurements uniformly, where the respective utility is

U_{n + 1} (x, E_{n}) = - n_{x},

(17)

which guarantees at least some homogeneity in the data, but does not take any properties of the underlying model into account. A more sensible utility is given by

U_{n + 1} (x, E_{n}) = - {\bar{n}}_{x},

where n̄_x is the effective count. The resulting placement rule essentially follows the logic of random uniform sampling, but takes the sharing of strength between neighboring stimuli into account. The attractive feature of this adaptive scheme is of course its conceptual and computational simplicity.

More elaborate adaptive strategies can be obtained by considering utilities based on information-theoretical quantities. The idea of using information theoretic utility measures for experimental designs was probably first considered by Cronbach (1953). A more detailed study of their application in experimental designs was later given by Lindley (1956), whereas explications of Lindley’s ideas within the context of Bayesian experimental design can be found in Bernardo (1979); Chaloner and Verdinelli (1995), but see also the application for optimizing sequential experimental designs in DeGroot (1962).

Consider that we want to learn as much as possible about which multibin model B ∈ Inline graphic ( ) describes the data best. Our a priori expectation is given by the probability P [B], whereas our a posteriori assessment is expressed by $P [B ∣ Y_{n_{X}}^{X} = y_{n_{X}}^{X}]$ . The information we would gain about B if we learn that { $Y_{n_{X}}^{X} = y_{n_{X}}$ } is then quantified by the Kullback-Leiber divergence⁴

D_{KL} (P [B ∣ Y_{n_{X}}^{X} = y_{n_{X}}^{X}] | | P [B]) = - \sum_{B \in P (X)} P [B ∣ Y_{n_{X}}^{X} = y_{n_{X}}^{X}] log (\frac{P [B]}{P [B ∣ Y_{n_{X}}^{X} = y_{n_{X}}^{X}]}),

which measures the deviation between both measures. Hence, if we want to learn as much as possible about B then it seems reasonable to adopt the Kullback-Leibler divergence as a global utility measure. This, however, leads to almost intractable computations, as mentioned already in section 2.1. Nonetheless, it motivates the following myopic adaptive sampling strategy with local utility measure

u_{n + 1}^{MB} (x, y, E_{n}) = D_{KL} (P [B ∣ Y_{n_{X}}^{X}, Y_{n_{x} + 1}^{x} = y] | | P [B ∣ Y_{n_{X}}^{X}]),

i.e. we try to improve the incremental information in every sampling step. This divergence can be computed with

u_{n + 1}^{MB} (x, y, E_{n}) = \frac{d (α, β, \tilde{γ}, n_{X, Y})}{d (α, β, γ, n_{X, Y})},

where

{\tilde{γ}}_{b} = {\begin{array}{l} γ_{b} ln (\frac{n_{b, y} + α_{b, y}}{n_{b} + α_{b}} \frac{d (α, β, γ, n_{X, Y})}{d (α, β, γ, n_{X, Y}^{+ (x, y)})}) & if x \in b \\ γ_{b}, & otherwise . \end{array}

The respective expected local utility is

U_{n + 1}^{MB} (x, E_{n}) = \sum_{y \in Y} u_{n + 1}^{MB} (x, y, E_{n}) π_{n + 1}^{Y} (Y_{n + 1} = y ∣ X_{n + 1} = x, E_{n}),

such that the optimal stimulus is determined by

x_{n + 1}^{*} (E_{n}) = \underset{x \in X}{arg max} U_{n + 1}^{MB} (x, E_{n}) .

It is noteworthy that the expected gain can be rewritten in terms of a mutual information⁵, namely

U_{n + 1}^{MB} (x, E_{n}) = MI (B : Y_{n_{x} + 1}^{x} ∣ Y_{n_{X}}^{X}),

such that by maximizing the expected local utility we do in fact maximize the mutual information between B and $Y_{n_{x} + 1}^{x}$ conditional on $Y_{n_{X}}^{X}$ . The so obtained adaptive sampling scheme, shortly denoted u^MB, is fully equivalent to the Bayesian framework of adaptive design optimization (ADO) for model discrimination (cf. Cavagnaro et al. (2010)).

The very same line of reasoning applies to the case where we do have strong evidence for a particular multibin model, say B ∈ Inline graphic ( ), but want to optimally learn the psychometric rates p. The respective local utility measures are given by

u_{n + 1}^{Ψ ∣ B} (x, y, E_{n}) = D_{KL} (μ [\cdot ∣ Y_{n_{X}}^{X}, Y_{n_{x} + 1}^{x} = y, B] | | μ [\cdot ∣ Y_{n_{X}}^{X}, B]),

where $μ [\cdot ∣ Y_{n_{X}}^{X}, B ∣]$ and $μ [\cdot ∣ Y_{n_{X}}^{X}, Y_{n_{x} + 1}^{x} = y, B]$ are the respective successive posterior measures for the psychometric rates p conditional on B. These utility measures can be computed with

u_{n + 1}^{Ψ ∣ B} (x, y, E_{n}) = \frac{d (α, β, \tilde{γ}, n_{X, Y})}{d (α, β, γ, n_{X, Y})},

where

{\tilde{γ}}_{b} = {\begin{array}{l} γ_{b} (- ln (\frac{n_{b, y} + α_{b, y}}{n_{b} + α_{b}}) + ψ (n_{b, x} + α_{b, x} + 1) - ψ (n_{b} + α_{b} + 1)) & if x \in b \\ γ_{b} & otherwise \end{array},

and ψ is the psigamma function. We expect the resulting adaptive sampling scheme, denoted u^Ψ|^B, to optimize the inferential task for the psychometric rate p given that B is the ‘true’ underlying model.

In practice, we usually have only vague knowledge about the underlying multi-bin model. Henceforth, it seems reasonable to consider the local utility measures

u_{n + 1}^{Total} (x, y, E_{n}) = D_{KL} (μ [\cdot ∣ Y_{n_{X}}^{X}, Y_{n_{x} + 1}^{x} = y] | | μ [\cdot ∣ Y_{n_{X}}^{X}]),

where $μ [\cdot ∣ Y_{n_{X}}^{X}]$ and $μ [\cdot ∣ Y_{n_{X}}^{X}, Y_{n_{x} + 1}^{x} = y]$ are the respective successive posterior measures for the psychometric rates p. This measure decomposes as

u_{n + 1}^{Total} (x, y, E_{n}) = u_{n + 1}^{MB} (x, y, E_{n}) + u_{n + 1}^{Ψ} (x, y, E_{n}),

where

u_{n + 1}^{Ψ} (x, y, E_{n}) : = \sum_{E \in P (X)} P [B ∣ Y_{n_{X}}^{X}, Y_{n_{x} + 1}^{x} = y] u_{n + 1}^{Ψ ∣ B} (x, y, E_{n})

is the model-averaged local utility for optimizing the inference of the psychometric rates given a multibin model. Hence, if we want to learn about p given that B is uncertain, then we also have to make inference about B. That is the uncertainty about B also influences our uncertainty about p. We thus expect this strategy, called u^Total, to optimize the inference of B and p. It might also be worthwile to base the sampling process solely on u^Ψ because it allows us to optimize the inference of the psychometric rates regardless of the underlying multibin model.

The actual effect of the proposed adaptive sampling strategies are difficult to describe and we shall proceed by discussing a simple example. In general, it can be said that there is no such thing as a universal criterium for optimal adaptive sampling as long as we do not clearly formulate what we want to achieve. Whether or not a chosen strategy is appropriate for the experiment at hand must be carefully assessed from case to case, for example with simulation studies or experimental pre-studies.

4. A Practical Demonstration

In order to demonstrate our framework we consider a hypothetical 2AFC experiment with 35 stimuli Inline graphic = {x₁, x₂, …, x₃₅} and a dichotomous outcome = {s, f }. The stimulus-response relation (ground truth) is given by an asymmetric U-shaped function with a small irregularity on the right side (see Figure 2). We assume that we have no a priori knowledge about the curve, except that proximal stimuli are likely to cause similar responses. In a naïve approach we would simply distribute N measurements random uniformly over Inline graphic , see (17), and infer the psychometric rates for each x ∈ independently. Figure 3 shows the result of such an experiment with 200 and 400 samples, where we chose pseudo counts α_x_, = (1, 1), y ∈ . The variance of the estimate is large because there are only very few samples for each stimulus x. A highly irregular curve as an estimate for the stimulus-response relation is therefore obtained and many measurements are taken at regions that are quite uninformative. This motivates two advantages of our proposed framework: First, we can use samples from neighboring stimuli to share statistical strength. Second, we want to distribute the samples such that we maximize the information that we gain with each measurement.

Hypothetical stimulus-response curve used as ground truth for the practical demonstration. The curve shows the probability of the response s (success) for a given stimulus x.

A hypothetical experiment with (a) 200 and (b) 400 samples uniformly distributed over the stimulus space . The stimulus-response relation (ground truth), shown as dashed line, is inferred without taking information from proximal x into account. The thick continuous line shows the first moment $E [p_{x, s} ∣ Y_{n_{X}}^{X}]$ of the stimulus-response function given all outcomes of previous measurements. The standard deviation is shown as a thin continuous line around the expectation. The marginal posterior density $f (p_{x, s} ∣ Y_{n_{X}}^{X})$ is plotted as shadings in the back of the figure. The number of measurements at each x ∈ is shown as a bar plot in the lower plot with utility U (x) (thin continuous line), which is the negative number of counts, see (17).

Inline graphic — A hypothetical experiment with (a) 200 and (b) 400 samples uniformly distributed over the stimulus space . The stimulus-response relation (ground truth), shown as dashed line, is inferred without taking information from proximal x into account. The thick continuous line shows the first moment $E [p_{x, s} ∣ Y_{n_{X}}^{X}]$ of the stimulus-response function given all outcomes of previous measurements. The standard deviation is shown as a thin continuous line around the expectation. The marginal posterior density $f (p_{x, s} ∣ Y_{n_{X}}^{X})$ is plotted as shadings in the back of the figure. The number of measurements at each x ∈ is shown as a bar plot in the lower plot with utility U (x) (thin continuous line), which is the negative number of counts, see (17).

With our framework we can optimize the experimental process. We choose the following parameters:

α_{x, Y} = (1, 1), β_{m} {(\begin{array}{l} L - 1 \\ m - 1 \end{array})}^{- 1}, γ_{b} = 1 for b \in C (X) and m = 1, 2, \dots, L

With α_x_, = (1, 1) we assign equal pseudo counts to all stimuli and responses, since we have no a priori knowledge about the shape of the curve. The particular choice of β_m and γ_b assigns a uniform distributions to P [M = m], which is the a priori probability of multibins with m bins (see section 3.6). Finally, we set all γ_b = 1 since all bins should receive an equal weight.

With this prior setting we observe a smoothening of the inferred stimulus-response curve (see Figure 4). However, in contrast to common kernel smoothing approaches, our method does not smear sharp transitions but represents a higher degree of uncertainty whenever necessary.

A hypothetical experiment with (a) 200 and (b) 400 samples uniformly distributed. The ground truth is inferred by taking information from neighboring x into account. This sharing of statistical strength allows a much more accurate inference of the stimulus-response function as compared to Figure 2.

In order to distribute samples more efficiently, we can utilize the adaptive sampling strategies described in section 3.7. Unfortunately, it is difficult to compare the performance of different strategies since there is no general criterion for optimality. A good intuition can however be gained from the distribution of measurements.

The Figures 5, 6, 7, and 8 show the time course of the experiment for the adaptive sampling schemes u^Ψ, u^MB, u^Total, and the strategy based on the effective count n̄_x. All simulations were initialized with the same random seeds for each stimulus x ∈ Inline graphic , such that results can be better compared. The uninformative parameter setting lead to a uniformly distributed expected local utility for all four strategies before the first experiment. Therefore, the first measurement was always taken at x = 20.

Adaptive sampling in a hypothetical experiment with scheme u^Ψ. The figure shows the experiment after 1, 2, 10, 50, 100, and 200 samples. At first, samples are uniformly distributed. In (a) only one measurement was taken and the expected utility is largest at the left boundary. (d) shows the experiment after 50 samples where the algorithm starts to locate measurements at sloped regions. The general shape of the stimulus-response function is already well established after 100 samples (e). Many measurements are also taken at x = 1 and x = 35 since those x have only one neighbor.

Adaptive sampling in a hypothetical experiment with scheme u^MB. The figure shows the experiment after 1, 2, 10, 50, 100, and 200 samples. After the first sample (a) two maxima of the expected utility arise next to the measurement. The first 10 samples (c) are allocated near the initial measurement, whereas afterwards the algorithm starts move further right (d) until almost the full stimulus space has been explored (e–f).

Adaptive sampling in a hypothetical experiment with scheme u^Total. The figure shows the experiment after 1, 2, 10, 50, 100, and 200 samples. The sampling behavior clearly shows a mixture of both schemes u^MB and u^Ψ.

Adaptive sampling in a hypothetical experiment with the strategy based on the effective count *n̄_x*. The figure shows the experiment after 1, 2, 10, 50, 100, and 200 samples. In (a) no measurements are taken and we can see our prior expectation. The respective utility is uniformly distributed on . The general shape of the stimulus-response function is already well established after 100 samples. Measurements are mostly allocated at regions where the slope of the stimulus-response function is high. Many measurements are also taken at x = 1 and x = 35 since those x have only one neighbor.

Already after the first sample one can observe a striking difference between u^Ψ and u^MB. Whereas the expected utility for u^Ψ is maximal at the very left stimulus x = 1, the expected utility for u^MB shows two maxima directly next to the previous measurement at x = 20. This reveals the very distinct properties of both schemes, which are better seen after 10 and 50 samples. u^Ψ causes a uniform distribution of the first samples on Inline graphic . On the other hand, the scheme u^MB places all measurements around the initial sample and only gradually moves further away from x = 20. The U-shape of the stimulus-response function is already very well established after 100 samples and after 200 samples one can see that both measures place most samples at positions where the psychometric function is highly sloped. This behavior is expected since those regions allow less sharing of statistical strength. For the scheme u^Total we observe the same initial behavior as for u^Ψ, but the count statistic after 200 samples shows characteristics of both u^Ψ and u^MB, which is expected from its definition. It is however quite noteworthy that the effective counts show a very similar behavior as u^Ψ. Figure 9 summarizes the differences between the four adaptive sampling strategies by showing the experiment after 200 samples.

Direct comparison of the four adaptive sampling strategies: (a) u^Ψ, (b) u^MB, (c) u^Total, and (d) effective counts. A relatively balanced distributions of measurements is observed with u^Ψ and the effective counts. On the other hand, u^MB and therefore also u^Total lead to a more peaked distribution.

Note also that many measurements are taken at the boundaries. This is because these stimuli have only one neighbor which limits the extent to which statistical strength can be shared. We now assume that we have prior knowledge about the ground truth, i.e. we have a strong belief about its value at x₁ and x₃₅. We incorporate this information into our model and thereby optimize the sampling process further. That is, we set α_1, = α_35, = (100, 1). This parameter setting alters our prior expectation about the stimulus-response relation, i.e. p_x_,_s is expected to be close to one at x₁ and x₃₅, see Figure 10(a). The expected utility to sample at these points is substantially reduced, since we have a strong belief about the response. Figure 10(b) shows the experiment after 200 samples. One can see that all measurements that were previously allocated at the boundaries are now distributed elsewhere. Further optimizations of the sampling process are possible if more prior knowledge is available.

Demonstration of using prior knowledge to limit sampling at the boundaries with scheme u^Ψ. (a) shows the prior belief before any measurements are made and (b) shows the posterior after 200 samples. The prior setting leads to a strong belief about the response at the boundaries which substantially reduces the expected utility at x₁ and x₃₅. The result is that all measurements that were previously allocated at the boundaries are now placed elsewhere.

5. Conclusion

We have introduced a framework for adaptive sequential sampling which helps to optimize the measurement process in a wide range of psychophysical experiments, especially when there is only vague prior knowledge about the relation X → Y. Our framework consists of two major components. The first is a response process that we use to make predictions based on a finite number of observations. We termed it the multibin Pólya mixture process as it consists of a mixture of binned Pólya urns. On top of the response process we defined various adaptive sampling process, which are equipped with utility measures to actively guide the course of an experiment. We also demonstrate the effect of several sampling strategies on a hypothetical experiment and how prior knowledge can be used to further optimize the allocation of measurements. During experiments it is of great importance that decisions for the next stimulus are computed fast. Although our model is computationally demanding, we provide an algorithm that makes it applicable in typical psychophysics experiments.

Supplementary Material

Download video file^{(3.2MB, avi)}

Download video file^{(3MB, avi)}

Download video file^{(3.3MB, avi)}

Download video file^{(3.1MB, avi)}

Highlight.

We present a predictive account on adaptive sampling in psychophysical experiments.
Our method applies to situations where there is only weak knowledge available.
We demonstrate the advantages of our framework on a hypothetical sampling problem.

Acknowledgments

All three authors were supported by Max Planck Society. T. E. has been supported by NIH grant R01 EY018664. We thank Claudia Freigang, Pierre-Yves Bourguignon, and Wiktor Mlynarski for most helpful suggestions. We would also like to thank the anonymous reviewers whose suggestions have led to a considerable improvement of the manuscript.

Appendix

In Yao (1984); Barry and Hartigan (1992); Endres and Földiák (2005); Fernhead (2006); Hutter (2007) various algorithms are presented, which in their given context address all the very same computational problem. Within our terminology of proximal multibins the general problem can be described as follows:

Let f: C( Inline graphic ) → ℝ be any function from consecutive proximal bins to real numbers and g = (g₁, …, g_L) ∈ ℝ^L be any real valued vector. Define the sum-products

S_{m} [f] : = \sum_{B \in P_{m} (X)} \prod_{b \in B} f (b), m = 1, \dots, L,

and

S [f, g] : = \sum_{B \in P (X)} g_{∣ B ∣} \prod_{b \in B} f (b) .

Each S_m[f ] consists of $(\begin{array}{l} L - 1 \\ m - 1 \end{array})$ terms, whereas the weighted total sum-product S[f, g] consists of 2^L⁻¹ terms. For large L a computationally intractable effort of Inline graphic (2^L⁻¹) is expected. However, the following lemma provides a simple and efficient algorithm, which computes the sum-product in (L³).

Lemma 1 (The Proximal Multibin Summation (ProMBS) Algorithm)

Define the upper triangular matrices $A_{l} = {(a_{i, j}^{l})}_{L \times L}$ , l = 1, …, L, recursively by

a_{i, j}^{1} = {\begin{array}{l} f (b_{i, j}) & if i \leq j; \\ 0 & otherwise; \end{array} and a_{i, j}^{l} = {\begin{array}{l} a_{i - 1, j - 1}^{l - 1} & i, j > 1; \\ 1 & i = j = 1; \\ 0 & else; \end{array}

and define recursively the matrix-vector product

v^{l + 1} = {(v_{1}^{l + 1}, \dots, v_{L}^{l + 1})}^{T} = A_{l} \times v^{l}, l = 1, \dots L,

where the initial vector is v¹ = (0, …, 0, 1)^T. For every m, l ∈ ℕ with m ≤ l ≤ L it holds that

v_{m}^{l + 1} = S_{m} [f],

such that

S [f, g] = \sum_{m = 1}^{L} g_{m} S_{m} [f] .

A simple example might help to illustrate the abstract formulation of the ProMBS algorithm. Consider the case with | Inline graphic | = 3 and let us abbreviate f (b_i_,_j ) simply by f_ij. In the first step we compute

v^{2} = A_{1} v^{1} = (\begin{matrix} f_{11} & f_{12} & f_{13} \\ 0 & f_{22} & f_{23} \\ 0 & 0 & f_{33} \end{matrix}) (\begin{matrix} 0 \\ 0 \\ 1 \end{matrix}) = (\begin{matrix} f_{13} \\ f_{23} \\ f_{33} \end{matrix}) .

Note that $v_{1}^{2} = f_{13} = S_{1} [f]$ . In the next step we compute

v^{3} = A_{2} v^{2} = (\begin{matrix} 1 & 0 & 0 \\ 0 & f_{11} & f_{12} \\ 0 & 0 & f_{22} \end{matrix}) (\begin{matrix} f_{13} \\ f_{23} \\ f_{33} \end{matrix}) = (\begin{matrix} f_{13} \\ f_{11} f_{23} + f_{12} f_{33} \\ f_{22} f_{33}, \end{matrix})

where $v_{1}^{3} = f_{13} = S_{1} [f]$ and $v_{2}^{3} = f_{11} f_{23} + f_{12} f_{33} = S_{2} [f]$ . In the last step we find that

v^{4} = A_{3} v^{3} = (\begin{matrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & f_{11} \end{matrix}) (\begin{matrix} f_{13} \\ f_{11} f_{23} + f_{12} f_{33} \\ f_{22} f_{33} \end{matrix}) = (\begin{matrix} f_{13} \\ f_{11} f_{23} + f_{12} f_{33} \\ f_{11} f_{22} f_{33} \end{matrix}),

such that (v⁴)^T = (S₁[f ], S₂[f ], S₃[f ]) and S[f, g] = g₁S₁[f ]+ g₂S₂[f ]+ g₃S₃[f ].

Of course, the real computational power of the ProMBS algorithm becomes only evident when the set Inline graphic is large, but the case | | = 3 shows the general logic behind the ProMBS algorithm sufficiently enough. We should finally mention that numerical imprecisions can occur if the values of f become small. In such cases it is advisable to implement the ProMBS algorithm on a logarithmic number scale.

Footnotes

There is much ambiguity in the usage of the term nonparametric as different fields of statistics assign different meanings to what is actually meant by nonparametric. Here, we simply mean that no constraint in form of a parametric family is imposed that restricts the topological support for the psychometric rates p in $Δ_{X}^{Y}$ .

The multinomial sampling laws described here apply to the categorical process Y and are not to be confused with the related multinomial distribution for the respective count statistic n.

The illustrative metaphor of a bin is taken from Endres and Földiák (2005); Endres et al. (2008), whereas cluster, block or component might serve equally well.

⁴

The Kullback-Leibler divergence is formally defined as follows: If μ, ν are two measures on a measurable space [S, Inline graphic ], such that ν is absolute continuous w.r.t. μ, then $D_{KL} (μ | | ν) = - \int_{S} ln (\frac{d ν}{d μ} (s)) μ (d s)$ is the Kullback-Leibler divergence from μ to ν, where $\frac{d ν}{d μ}$ is the respective Radon-Nykodým derivate of ν w.r.t. μ.

⁵

The mutual information between two variables X and Y conditional on a third variable Z is formally defined as follows: If μ_Y_, _X _| _Z is the joint measure of X, Y conditional on Z and μ_X _| _Z, μ_Y _| _Z are the respective conditional marginal measures, such that μ_X_×_Y _| _Z is the product measure constructed from both marginal measures, then the mutual information between X and Y conditional on Z is defined as MI(X: Y |Z):= D_KL(μ_X_, _Y _| _Z ||μ_X_×_Y _| _Z ), i.e. the Kullback-Leiber divergence from the conditional product measure to the conditional joint measure.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Contributor Information

Stephan Poppe, Email: stephan.poppe@mis.mpg.de.

Philipp Benner, Email: philipp.benner@mis.mpg.de.

Tobias Elze, Email: tobias.elze@schepens.harvard.edu.

References

Barry D, Hartigan JA. Product partition models for change point problems. The Annals of Statistics. 1992;20 (1):260–279. [Google Scholar]
Berger JO. Statistical decision theory and Bayesian analysis. 2. Springer-Verlag; New York: 1993. [Google Scholar]
Bernardo JM. Expected information as expected utility. The Annals of Statistics. 1979;7 (3):686–690. [Google Scholar]
Bernardo JM, Smith AFM. Bayesian theory. Wiley, Chichester; 1995. [Google Scholar]
Cavagnaro DR, Myung JI, Pitt MA, Kujala JV. Adaptive design optimization: A mutual information-based approach to model discrimination in cognitive science. Neural Computation. 2010;22:887–905. doi: 10.1162/neco.2009.02-09-959. [DOI] [PubMed] [Google Scholar]
Chaloner K, Verdinelli I. Bayesian experimental design: A review. Statistical Science. 1995;10 (3):273–304. [Google Scholar]
Cronbach LJ. Tech Rep. Vol. 1. Illinois University; Urbana: Bureau of Research and Service; 1953. A consideration of information theory and utility theory as tools for psychometric problems. [Google Scholar]
de Finetti B. La prévision: ses lois logiques, ses sources subjectives. Ann Inst Poincare. 1937;7 (2):1–68. [Google Scholar]
de Finetti B. Theory of probability: A critical introductory treatment. Vol. 1. Wiley; London;, New York, N.Y: 1974. [Google Scholar]
de Finetti B. On the condition of partial exchangeability. In: Jeffrey RC, editor. Studies in inductive logic and probability. University of California Press; Berkeley: 1980. pp. 193–205. [Google Scholar]
DeGroot MH. Uncertainty, information, and sequential experiments. The Annals of Mathematical Statistics. 1962;33 (2):404–419. [Google Scholar]
DeGroot MH. Optimal statistical decisions, wiley classics library. Wiley-Interscience; Hoboken and N.J: 2004. [Google Scholar]
Eggenberger F, Pólya G. Über die statistik verketteter vorgänge. ZAMM - Zeitschrift für Angewandte Mathematik und Mechanik. 1923;3 (4):279–289. [Google Scholar]
Elze T, Song C, Stollhoff R, Jost J. Chinese characters reveal impacts of prior experience on very early stages of perception. BMC Neuroscience. 2011;12:14. doi: 10.1186/1471-2202-12-14. [DOI] [PMC free article] [PubMed] [Google Scholar]
Endres D, Földiák P. Bayesian bin distribution inference and mutual information. IEEE Transactions on Information Theory. 2005;51 (11):3766–3779. [Google Scholar]
Endres D, Oram M, Schindelin J, Foldiak P. Bayesian binning beats approximate alternatives: estimating peri-stimulus time histograms. In: Platt J, Koller D, Singer Y, Roweis S, editors. Advances in Neural Information Processing Systems 20. MIT Press; Cambridge, MA: 2008. pp. 393–400. [Google Scholar]
Fernhead P. Exact and efficient bayesian inference for multiple change-point problems. Statistics and Computing. 2006;16 (2):203–213. [Google Scholar]
Geisser S. Predictive inference: An introduction. Chapman & Hall; London: 1993. [Google Scholar]
Hartigan JA. Partition models. Communications in statistics. Theory and methods. 1990;19 (8):2745–2756. [Google Scholar]
Hutter M. Exact bayesian regression of piecewise constant functions. Bayesian Analysis. 2007;2 (4):635–664. [Google Scholar]
Kontsevich LL, Tyler CW. Bayesian adaptive estimation of psychometric slope and threshold. Vision Research. 1999;39 (16):2729–2737. doi: 10.1016/s0042-6989(98)00285-5. [DOI] [PubMed] [Google Scholar]
Kujala JV. Obtaining the best value for money in adaptive sequential estimation. Journal of Mathematical Psychology. 2010;54:475–480. [Google Scholar]
Kujala JV, Lukka TJ. Bayesian adaptive estimation: The next dimension. Journal of Mathematical Psychology. 2006;50:369–389. [Google Scholar]
Kuss M, Jakel F, Wichmann FA. Bayesian inference for psychometric functions. Journal of Vision. 2005;5 (5):8. doi: 10.1167/5.5.8. [DOI] [PubMed] [Google Scholar]
Lauritzen SL. Tech Rep. Vol. 18. Stanford University, Department of Statistics; 1974. On the interrelationships among sufficiency, total sufficiency and some related concepts. [Google Scholar]
Leek M. Adaptive procedures in psychophysical research. Attention, Perception, Psychophysics. 2001;63:1279–1292. doi: 10.3758/bf03194543. [DOI] [PubMed] [Google Scholar]
Levitt H. Transformed up-down methods in psychoacoustics. The Journal of the Acoustical Society of America. 1971;49 (2B):467–477. [PubMed] [Google Scholar]
Lindley DV. On a measure of the information provided by an experiment. The Annals of Mathematical Statistics. 1956;27 (4):986–1005. [Google Scholar]
Link G. Representation theorems of the de finetti type for (partially) symmetric probability measures. In: Jeffrey RC, editor. Studies in inductive logic and probability. University of California Press; Berkeley: 1980. pp. 207–231. [Google Scholar]
MacKay DJC. Information-based objective functions for active data selection. Neural Computation. 1992;4:590–604. [Google Scholar]
McKee S, Klein S, Teller D. Statistical properties of forced-choice psychometric functions: Implications of probit analysis. Attention, Perception, & Psychophysics. 1985;37 (4):286–298. doi: 10.3758/bf03211350. [DOI] [PubMed] [Google Scholar]
Müller P, Berry DA, Grieve AP, Smith M, Krams M. Simulation-based sequential bayesian design: Special issue: Bayesian inference for stochastic processes. Journal of Statistical Planning and Inference. 2007;137 (10):3140–3150. [Google Scholar]
Roberts HV. Probabilistic prediction. Journal of the American Statistical Association. 1965;60 (306):50–62. [Google Scholar]
Taylor MM, Creelman CD. Pest: Efficient estimates on probability functions. The Journal of the Acoustical Society of America. 1967;41 (4A):782–787. [Google Scholar]
Watson A, Fitzhugh A. The method of constant stimuli is inefficient. Attention, Perception, & Psychophysics. 1990;47 (1):87–91. doi: 10.3758/bf03208169. [DOI] [PubMed] [Google Scholar]
Watson A, Pelli D. Quest: A bayesian adaptive psychometric method. Attention, Perception, & Psychophysics. 1983;33:113–120. doi: 10.3758/bf03202828. [DOI] [PubMed] [Google Scholar]
Wichmann F, Hill N. The psychometric function: I. fitting, sampling, and goodness of fit. Attention, Perception, & Psychophysics. 2001;63:1293–1313. doi: 10.3758/bf03194544. [DOI] [PubMed] [Google Scholar]
Yao YC. Estimation of a noisy discrete-time step function: Bayes and empirical bayes approaches. The Annals of Statistics. 1984;12 (4):1434–1447. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Download video file^{(3.2MB, avi)}

Download video file^{(3MB, avi)}

Download video file^{(3.3MB, avi)}

Download video file^{(3.1MB, avi)}

[R1] Barry D, Hartigan JA. Product partition models for change point problems. The Annals of Statistics. 1992;20 (1):260–279. [Google Scholar]

[R2] Berger JO. Statistical decision theory and Bayesian analysis. 2. Springer-Verlag; New York: 1993. [Google Scholar]

[R3] Bernardo JM. Expected information as expected utility. The Annals of Statistics. 1979;7 (3):686–690. [Google Scholar]

[R4] Bernardo JM, Smith AFM. Bayesian theory. Wiley, Chichester; 1995. [Google Scholar]

[R5] Cavagnaro DR, Myung JI, Pitt MA, Kujala JV. Adaptive design optimization: A mutual information-based approach to model discrimination in cognitive science. Neural Computation. 2010;22:887–905. doi: 10.1162/neco.2009.02-09-959. [DOI] [PubMed] [Google Scholar]

[R6] Chaloner K, Verdinelli I. Bayesian experimental design: A review. Statistical Science. 1995;10 (3):273–304. [Google Scholar]

[R7] Cronbach LJ. Tech Rep. Vol. 1. Illinois University; Urbana: Bureau of Research and Service; 1953. A consideration of information theory and utility theory as tools for psychometric problems. [Google Scholar]

[R8] de Finetti B. La prévision: ses lois logiques, ses sources subjectives. Ann Inst Poincare. 1937;7 (2):1–68. [Google Scholar]

[R9] de Finetti B. Theory of probability: A critical introductory treatment. Vol. 1. Wiley; London;, New York, N.Y: 1974. [Google Scholar]

[R10] de Finetti B. On the condition of partial exchangeability. In: Jeffrey RC, editor. Studies in inductive logic and probability. University of California Press; Berkeley: 1980. pp. 193–205. [Google Scholar]

[R11] DeGroot MH. Uncertainty, information, and sequential experiments. The Annals of Mathematical Statistics. 1962;33 (2):404–419. [Google Scholar]

[R12] DeGroot MH. Optimal statistical decisions, wiley classics library. Wiley-Interscience; Hoboken and N.J: 2004. [Google Scholar]

[R13] Eggenberger F, Pólya G. Über die statistik verketteter vorgänge. ZAMM - Zeitschrift für Angewandte Mathematik und Mechanik. 1923;3 (4):279–289. [Google Scholar]

[R14] Elze T, Song C, Stollhoff R, Jost J. Chinese characters reveal impacts of prior experience on very early stages of perception. BMC Neuroscience. 2011;12:14. doi: 10.1186/1471-2202-12-14. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Endres D, Földiák P. Bayesian bin distribution inference and mutual information. IEEE Transactions on Information Theory. 2005;51 (11):3766–3779. [Google Scholar]

[R16] Endres D, Oram M, Schindelin J, Foldiak P. Bayesian binning beats approximate alternatives: estimating peri-stimulus time histograms. In: Platt J, Koller D, Singer Y, Roweis S, editors. Advances in Neural Information Processing Systems 20. MIT Press; Cambridge, MA: 2008. pp. 393–400. [Google Scholar]

[R17] Fernhead P. Exact and efficient bayesian inference for multiple change-point problems. Statistics and Computing. 2006;16 (2):203–213. [Google Scholar]

[R18] Geisser S. Predictive inference: An introduction. Chapman & Hall; London: 1993. [Google Scholar]

[R19] Hartigan JA. Partition models. Communications in statistics. Theory and methods. 1990;19 (8):2745–2756. [Google Scholar]

[R20] Hutter M. Exact bayesian regression of piecewise constant functions. Bayesian Analysis. 2007;2 (4):635–664. [Google Scholar]

[R21] Kontsevich LL, Tyler CW. Bayesian adaptive estimation of psychometric slope and threshold. Vision Research. 1999;39 (16):2729–2737. doi: 10.1016/s0042-6989(98)00285-5. [DOI] [PubMed] [Google Scholar]

[R22] Kujala JV. Obtaining the best value for money in adaptive sequential estimation. Journal of Mathematical Psychology. 2010;54:475–480. [Google Scholar]

[R23] Kujala JV, Lukka TJ. Bayesian adaptive estimation: The next dimension. Journal of Mathematical Psychology. 2006;50:369–389. [Google Scholar]

[R24] Kuss M, Jakel F, Wichmann FA. Bayesian inference for psychometric functions. Journal of Vision. 2005;5 (5):8. doi: 10.1167/5.5.8. [DOI] [PubMed] [Google Scholar]

[R25] Lauritzen SL. Tech Rep. Vol. 18. Stanford University, Department of Statistics; 1974. On the interrelationships among sufficiency, total sufficiency and some related concepts. [Google Scholar]

[R26] Leek M. Adaptive procedures in psychophysical research. Attention, Perception, Psychophysics. 2001;63:1279–1292. doi: 10.3758/bf03194543. [DOI] [PubMed] [Google Scholar]

[R27] Levitt H. Transformed up-down methods in psychoacoustics. The Journal of the Acoustical Society of America. 1971;49 (2B):467–477. [PubMed] [Google Scholar]

[R28] Lindley DV. On a measure of the information provided by an experiment. The Annals of Mathematical Statistics. 1956;27 (4):986–1005. [Google Scholar]

[R29] Link G. Representation theorems of the de finetti type for (partially) symmetric probability measures. In: Jeffrey RC, editor. Studies in inductive logic and probability. University of California Press; Berkeley: 1980. pp. 207–231. [Google Scholar]

[R30] MacKay DJC. Information-based objective functions for active data selection. Neural Computation. 1992;4:590–604. [Google Scholar]

[R31] McKee S, Klein S, Teller D. Statistical properties of forced-choice psychometric functions: Implications of probit analysis. Attention, Perception, & Psychophysics. 1985;37 (4):286–298. doi: 10.3758/bf03211350. [DOI] [PubMed] [Google Scholar]

[R32] Müller P, Berry DA, Grieve AP, Smith M, Krams M. Simulation-based sequential bayesian design: Special issue: Bayesian inference for stochastic processes. Journal of Statistical Planning and Inference. 2007;137 (10):3140–3150. [Google Scholar]

[R33] Roberts HV. Probabilistic prediction. Journal of the American Statistical Association. 1965;60 (306):50–62. [Google Scholar]

[R34] Taylor MM, Creelman CD. Pest: Efficient estimates on probability functions. The Journal of the Acoustical Society of America. 1967;41 (4A):782–787. [Google Scholar]

[R35] Watson A, Fitzhugh A. The method of constant stimuli is inefficient. Attention, Perception, & Psychophysics. 1990;47 (1):87–91. doi: 10.3758/bf03208169. [DOI] [PubMed] [Google Scholar]

[R36] Watson A, Pelli D. Quest: A bayesian adaptive psychometric method. Attention, Perception, & Psychophysics. 1983;33:113–120. doi: 10.3758/bf03202828. [DOI] [PubMed] [Google Scholar]

[R37] Wichmann F, Hill N. The psychometric function: I. fitting, sampling, and goodness of fit. Attention, Perception, & Psychophysics. 2001;63:1293–1313. doi: 10.3758/bf03194544. [DOI] [PubMed] [Google Scholar]

[R38] Yao YC. Estimation of a noisy discrete-time step function: Bayes and empirical bayes approaches. The Annals of Statistics. 1984;12 (4):1434–1447. [Google Scholar]

PERMALINK

A Predictive Approach to Nonparametric Inference for Adaptive Sequential Sampling of Psychophysical Experiments

Stephan Poppe

Philipp Benner

Tobias Elze

Abstract

1. Introduction and Motivation

Figure 1.

2. A Predictive Perspective on Sequential Experiments

2.1. Sequential Construction of Adaptive Sampling Processes

Example 1

2.2. Partial Exchangeable Response Processes

Example 2

Example 3

3. Response Processes with Proximally Related Stimuli

3.1. One stimulus

3.2. Two stimuli

3.3. Multiple stimuli

3.4. A Hierachical Bayesian Model for Proximally Related Stimuli

Example 4

3.5. Efficient Model Evaluation

3.6. Prior Selection

3.7. Placement Rules for Adaptive Sampling

4. A Practical Demonstration

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

Figure 8.

Figure 9.

Figure 10.

5. Conclusion

Supplementary Material

Highlight.

Acknowledgments

Appendix

Lemma 1 (The Proximal Multibin Summation (ProMBS) Algorithm)

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases