Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Mar 8.
Published in final edited form as: Electron J Stat. 2023 Sep 3;17(2):1996–2043. doi: 10.1214/23-ejs2151

Adversarial meta-learning of Gamma-minimax estimators that leverage prior knowledge

Hongxiang Qiu 1, Alex Luedtke 2
PMCID: PMC10923594  NIHMSID: NIHMS1923277  PMID: 38463692

Abstract

Bayes estimators are well known to provide a means to incorporate prior knowledge that can be expressed in terms of a single prior distribution. However, when this knowledge is too vague to express with a single prior, an alternative approach is needed. Gamma-minimax estimators provide such an approach. These estimators minimize the worst-case Bayes risk over a set Γ of prior distributions that are compatible with the available knowledge. Traditionally, Gamma-minimaxity is defined for parametric models. In this work, we define Gamma-minimax estimators for general models and propose adversarial meta-learning algorithms to compute them when the set of prior distributions is constrained by generalized moments. Accompanying convergence guarantees are also provided. We also introduce a neural network class that provides a rich, but finite-dimensional, class of estimators from which a Gamma-minimax estimator can be selected. We illustrate our method in two settings, namely entropy estimation and a prediction problem that arises in biodiversity studies.

Keywords: Gamma-minimax estimation, machine learning

1. Introduction

A variety of principles can be used to guide the search for a suitable statistical estimator. Asymptotic efficiency (Pfanzagl, 1990), minimaxity (Wald, 1945) and Bayes optimality (Berger, 1985) are popular examples of such principles. Defining the performance criteria underlying these principles requires specifying a model space, that is, a collection of possible data-generating mechanisms known to contain the true, underlying distribution.

It is often desirable to incorporate prior information about the true data-generating mechanism into a statistical procedure. This might be done differently in different statistical paradigms. For frequentist methods, such as those based on the asymptotic efficiency or minimax principle, the primary way to incorporate this information is via the definition of the model space. In the Bayesian paradigm, such information may be represented by further specifying a prior distribution (or prior for short) over the model space and aiming for an estimator that minimizes the induced Bayes risk. However, in many cases, there may be several priors that are compatible with the available information, especially in the context of rich model spaces.

The Gamma-minimax paradigm, proposed by Robbins (1951), provides a principled means to overcome the challenge of specifying a single prior distribution. Under this paradigm, a statistician first specifies a set Γ of all priors that are consistent with the available prior information and subsequently seeks an estimator that minimizes the worst-case Bayes risk over this set of priors. The Gamma-minimax paradigm may be viewed as a robust version of the Bayesian paradigm that is less sensitive to misspecification of a prior distribution (Vidakovic, 2000). When it is infeasible to specify a prior due to the complexity of the model space, the Gamma-minimax paradigm may also be viewed as a feasible substitute for the Bayesian paradigm. The Gamma-minimax paradigm is closely related to Bayes and minimax paradigms: when the set of priors consists of a single prior, a Gamma-minimax estimator is Bayes with respect to that prior; when the set Γ of priors is the entire set of possible prior distributions, a Gamma-minimax estimator is also minimax.

Gamma-minimax estimators have been studied for a variety of problems. Some explicit forms of Gamma-minimax estimators have been obtained. For example, Olman and Shmundak (1985) studied Gamma-minimax estimation of the mean of a normal distribution for the set of symmetric and unimodal priors on an interval and obtained an explicit form when this interval is sufficiently small. Eichenauer-Herrmann (1990) generalized this result to more general parametric models and Eichenauer-Herrmann, Ickstadt and Weiß (1994) obtained a further generalization with the requirement of symmetry on the priors dropped. Chen, Eichenauer-Herrmann and Lehn (1988) studied Gamma-minimax estimation for multinomial distributions and the set of priors with bounded mean. Chen et al. (1991) studied Gamma-minimax estimation for one-parameter exponential families and the set of priors that place certain bounds on the first two moments. These results do not deal with general model spaces, such as semiparametric or nonparametric models, and general forms of the set of priors that may not directly impose bounds on prior moments of the parameters of interest. One reason for this lack of generality might be that, in the existing literature, Gamma-minimaxity is defined only for parametric models. However, an issue with parametric models is that they often fail to contain the true data-generating mechanism, in which case output from the aforementioned statistical procedures may no longer be interpretable. Another possible reason is that it is typically intractable to analytically derive Gamma-minimax estimators, even for parametric models.

To overcome this lack of analytical tractability, meta-learning algorithms to compute a minimax or Gamma-minimax estimator have been proposed. Still, most of these works focus on parametric models. For example, Nelson (1966) and Kempthorne (1987) each proposed an algorithm to compute a minimax estimator. Bryan et al. (2007) and Schafer and Stark (2009) proposed an algorithm to compute an approximate confidence region of optimal expected size in the minimax sense. Noubiap and Seidel (2001) proposed an iterative algorithm to compute a Gamma-minimax decision for the set of priors constrained by generalized moment conditions. More recent works explored computing estimators under more general models. For example, Luedtke et al. (2020) introduced an approach, termed Adversarial Monte Carlo meta-learning (AMC), for constructing minimax estimators. In the special case of prediction problems with mean-squared error, Luedtke, Chung and Sofrygin (2020) studied the invariance properties of the decision problem and their implications for AMC.

In this paper, we make the following contributions:

  • 1

    We propose iterative adversarial meta-learning algorithms for constructing Gamma-minimax estimators for a general model space and class of estimators. We further provide convergence guarantees for these algorithms.

To our best knowledge, this is the first algorithm to compute Gamma-minimax estimators under general models, including infinite-dimensional models. We also show that, for certain problems, there is a unique Gamma-minimax estimator and our computed estimator converges to this estimator as the number of iterations increases to infinity.

Like the approach proposed in Noubiap and Seidel (2001), we consider sets of priors characterized by (in)equality constraints on prior generalized moments and our proposed iterative algorithm involves solving a discretized Gamma-minimax optimization problem in each intermediate step. However, we explicitly describe algorithms to solve these minimax problems, which facilitates the use of our approach by practitioners. When the space of estimators can be parameterized by a Euclidean space and gradients are available, we propose to use a gradient-based algorithm or a stochastic variant thereof. When gradients are unavailable, we propose to instead use fictitious play (Brown, 1951; Robinson, 1951) to compute a stochastic estimator, which is a mixture of deterministic estimators belonging to some specified collection. We also provide a convergence result that is applicable even when this collection has infinite cardinality. This is in contrast to the results in Robinson (1951), which require that each player has only finitely many possible deterministic strategies.

  • 2

    We propose a Markov chain Monte Carlo (MCMC) method to construct the approximating grids defining the discretized Gamma-minimax problems used in our iterative scheme.

Like the approach proposed in Noubiap and Seidel (2001), our proposed iterative algorithm relies on increasingly fine finite grids over the model space. However, since we allow the model space to be high or even infinite-dimensional, randomly adding points to the grid may lead to unacceptably slow convergence. To overcome this challenge, we propose to use MCMC to efficiently construct such grids.

Our theoretical results allow for many different choices of classes of estimators. Our final contribution concerns the introduction of one such class:

  • 3

    We introduce a new neural network architecture that can be used to parameterize statistical estimators and argue that this class represents an appealing class to optimize over.

For this final point, we build on existing works in adversarial learning (e.g., Goodfellow et al., 2014; Luedtke et al., 2020; Luedtke, Chung and Sofrygin, 2020) and extreme learning machines (Huang, Zhu and Siew, 2006). Thanks to the universal approximation properties of neural networks (e.g., Hornik, 1991; Csáji, 2001) and extreme learning machines (Huang, Chen and Siew, 2006), we also show that both of these parameterizations can achieve good performance for sufficiently large networks. Furthermore, inspired by pre-training (e.g., Erhan et al., 2010) and transfer learning (e.g., Torrey and Shavlik, 2009), we recommend leveraging knowledge of existing estimators as inputs to the network in settings where this is possible. Under such choices of the space of estimators, we can expect to obtain a useful estimator even if the associated nonconvex-concave minimax problems prove to be difficult.

This paper is organized as follows. In Section 2, we introduce the framework of Gamma-minimax estimation and regularity conditions that we assume throughout the paper. In Section 3, we describe our proposed iterative adversarial meta-learning algorithms. In Section 4, we discuss considerations when choosing hyperparameters in the algorithms. In Section 5, we demonstrate our method in three simulation studies. We conclude with a discussion in Section 6. Proof sketches of key results are provided in the main text, and complete proofs can be found in the appendix. We also provide a table summarizing the frequently used symbols in Table 7 in the appendix. The code for our simulations is available on GitHub (Qiu, 2022).

2. Problem setup

Let be a separable Hausdorff space of data-generating mechanisms that contains the truth P0 and is equipped with a metric ρ. Under a data-generating mechanism P, let X*𝒳* denote the random data being generated, where 𝒳* is the space of values that the random data takes. Let 𝒞 denote a known coarsening mechanism such that the observed data X=𝒞(X*)𝒳, where 𝒳 is the space of observed data. In some cases, the coarsening mechanism will be the identity map, whereas in other settings, such as those in which missing, censored or truncated data is present, the coarsening mechanism will be nontrivial (e.g., Birmingham, Rotnitzky and Fitzmaurice, 2003; Gill, van der Laan and Robins, 1997; Heitjan and Rubin, 1991; Heitjan, 1993, 1994). Let 𝒟 denote the space of estimators (or decision functions) equipped with a metric ϱ. In practice, for computational feasibility, we will mainly consider an estimator space 𝒟 that contains estimators parameterized by a Euclidean space such as linear estimators or neural networks, and approximates a more general space 𝒟0, for example, the space of all estimators satisfying certain smoothness conditions. We discuss considerations concerning the choice of 𝒟 in Section 4.2 and note that our proposed methods may be applied to broader estimator classes. We treat 𝒟 as fixed throughout this paper. Let R:𝒟×R denote a risk function that measures the performance of an estimator under a data-generating mechanism such that smaller risks are preferable. We suppose throughout that and 𝒟 are equipped with the topologies induced by ρ and ϱ, respectively.

We now present three examples in which we formulate statistical decision problems in the above form. The first example is a general example of point estimation. We use this example to illustrate how the Gamma-minimax estimation framework naturally fits into many statistical problems. The other two examples are more concrete and we will study them in the simulations and data analyses.

Example 1 (Point estimation).

Suppose that is a statistical model, which may be parametric, semiparametric, or nonparametric (Bickel et al., 1993). The data X* consists of n independently and identically distributed (iid) random variables Oi, i=1,...,n, following the true distribution P0. We set 𝒞 to be the identity function so that X=X*. We wish to estimate an aspect ΨP0R of P0. Then, we can consider 𝒟 to be a set of 𝒳R functions and the mean-squared error risk R(d,P)=EP[{d(X)Ψ(P)}2]. Some specific examples of estimands include:

  1. Mean: Ψ(P)=EPOi;

  2. Cumulative distribution function at a point o:Ψ(P)=PPOio;

  3. Correlation: with Oi=Xi,YiR2, Ψ(P)=EPXiYiEPXiEPYi.

Example 2 (Predicting the expected number of novel categories to be observed in a new sample).

Suppose that consists of multinomial distributions with an unknown number of categories. Let an iid random sample of size n be generated from the true multinomial distribution, so that X* is a multiset containing the number Xk of observations in each category k. Suppose that only categories with nonzero occurrences are observed, so that X is a left-truncated version of X*. In other words, X is the multiset 𝒞(X*)={Xk:Xk>0}. Then, we may wish to predict the number of new categories that would be observed if a new sample of size m were collected. This problem has been extensively studied in the literature, with applications in microbiome data, species taxonomic surveys, and assessment of vocabulary size, among other areas (e.g., Shen, Chao and Lin, 2003; Bunge, Willis and Walsh, 2014; Orlitsky, Suresh and Wu, 2016). This prediction problem can be formulated in our framework. For each P, let pk(k=1,,KP) be the probability of category k, and Ψ(P)(X*) be k=1KPI(Xk=0)(1(1pk)m), the expected number of new observed categories given the current full data X*. We consider 𝒟 to be a set of 𝒳R functions and set the risk to be the mean-squared error, that is, R(d,P)=EP[{d(X)Ψ(P)(X*)}2]. This prediction problem is known to be intrinsically difficult when the future sample size m is greater than the observed sample size n, and we might expect prior information to substantially improve prediction.

Example 3 (Entropy estimation).

Consider the same data-generating mechanism and observed data as in Example 2. We may wish to estimate Shannon entropy (Shannon, 1948) ΨP=Σk=1KPpklogpk, a measure of diversity. We consider 𝒟 to be a set of 𝒳R functions and set the risk to be the mean-squared error, that is, R(d,P)=EP[{d(X)Ψ(P)}2]. Jiao et al. (2015) proposed a rate-minimax estimator. Thus, in contrast to Example 2, this is an example of a statistical problem with a satisfactory solution. For these problems, we might not expect prior information to substantially improve estimation.

We now define Gamma-minimaxity within our decision-theoretic framework. We assume that is equipped with the Borel σ-field and let Π denote the set of all probability distributions on the measurable space (,). We also assume that, for any d𝒟 and any πΠ, PR(d,P) is π-integrable. The Bayes risk corresponding to an estimator d and a prior π is defined as r:(d,π)R(d,P)π(dP). Let ΓΠ be the set of priors such that all πΓ are consistent with the available prior information. An estimator is called a Γ-minimax estimator if it is in the set

argmind𝒟supπΓr(d,π). (1)

Throughout the rest of this paper, we assume the existence of this solution set and other solution sets to minimax problems, and that supπΓr(d,π) is finite for any d𝒟.

In this paper, we consider the case in which Γ is characterized by finitely many generalized moment conditions, that is,

Γ={πΠ:ΦkL1π,ΦkPπdPck,k=1,,K}

where each Φk:R is a prespecified function that extracts an aspect of a data-generating mechanism and ckR is a prespecified constant. The validity of our proposed template to find a Γ-minimax estimator in Section 3.1 does not require Γ to take this form, but our proposed algorithms in Sections 3.2 and 3.3 are computationally feasible for such constraints because these linear constraints lead to linear programs, which can be solved efficiently (e.g., Jiang et al., 2020). In principle, more general constraints can be handled by using suitable minimax problem solvers. Such constraints were considered in Noubiap and Seidel (2001) and can represent a variety of forms of prior information. For example, with Φk=±Ψκ for some κ1, Γ imposes bounds on prior moments of Ψ(P); with Φk(P)=±1(Ψ(P)I) for some known interval I, Γ imposes bounds on the prior probability of Ψ(P) lying in I. Similar prior information on aspects of P0 other than ΨP0 can also be represented. In addition, since an equality can be equivalently expressed by two inequalities, Γ may also impose equality constraints on prior generalized moments. Such information is commonly used to choose prior distributions in Bayesian settings (Sarma and Kay, 2020). Since we do not require specifying a parametric model or specifying an entire prior distribution for any finite-dimensional summary of P0, specifying a set Γ of prior distributions in the above form is no more difficult — and often easier — than specifying a single prior distribution, as would be required in a Bayesian approach.

3. Proposed meta-learning algorithms to compute a Γ-minimax estimator

Since both the model space and the estimator space 𝒟 may be infinite, it is computationally infeasible to directly solve the minimax problem (1) defining a Γ-minimax estimator. Similarly to Noubiap and Seidel (2001), our general strategy is to discretize and thus consider prior distributions with discrete supports. Once the supports of prior distributions are discrete, the optimization over prior distributions only involves finitely many parameters, namely the probability masses at support points, and thus is computationally possible. We will show that, when the grid is sufficiently fine, a solution to the discretized minimax problem is close to a solution to the original minimax problem.

Our proposed algorithm consists of two main steps. The first step is to discretize the model space and consider an approximating grid instead of the original complicated model space . This discretization is illustrated in Fig. 1. We will describe in more detail in Section 3.1. In the second step, we consider a set Γ of priors with support contained and compute a Γ-minimax estimator. We will describe two classes of algorithms to solve this discretized minimax problem in Sections 3.2 and 3.3, respectively.

FIG 1.

FIG 1.

Illustration of grid =P(1),P(2),P(3),,P(T) approximating the entire model space . Examples of densities of distributions P(t)(t=1,,T) in the grid are displayed. A prior distribution with support in is parameterized by the probability mass at each distribution P(t). An example of a prior distribution is displayed as black bars with their heights being proportional to the probability masses.

3.1. Grid-based approximation of Γ-minimax estimators

We first define the discretization of the model space that we will use. Let =1 be an increasing sequence of finite subsets of such that =1 is dense in . That is, =1 is an increasingly fine grid over . Since is separable, such an =1 necessarily exists. Define

Γ:=πΓ:πhassupportinandrsupd,Γ:=suprπΓ(d,π)

for any d𝒟 and ΓΠ.

Algorithm 1 describes how the grids are used to compute an approximately Γ-minimax estimator in our proposed algorithms. We will show that the approximation error decays to zero as grows to infinity. Here and in the rest of the algorithms in the paper, for any real-valued function f, when we assign argminxf(x) or argmaxxf(x) to a variable, we arbitrarily pick a minimizer or maximizer if there are multiple optimizers. In practice, the user may stop the iteration at some and use a Γ-minimax estimator d* as the output estimator. We discuss the stopping criterion in more detail at the end of this section.

Algorithm 1.

Iteratively approximate a Γ-minimax estimator over an increasingly fine grid.

1: for =1,2, do
2:  Construct a grid such that 1
3:  d*argmind𝒟supπΓr(d,π)

We note that the minimax problem in Line 3 of Algorithm 1 is nontrivial to solve, and therefore we propose two algorithms that can solve this minimax problem in Sections 3.2 and 3.3.

We assume that the following conditions hold throughout the rest of the paper.

Condition 1.

There exists a limit point d*𝒟 of the sequence {d*}=1.

Condition 1 holds if the sequence {d*{=1 eventually falls in a compact set. For example, if 𝒟 is a space of neural networks and we take ϱ to be the Euclidean norm in the coefficient space, then we expect Condition 1 to hold if all coefficients are restricted to fall in a bounded set, which is a common restriction in theoretical analyses of neural networks (see, e.g., Goel et al., 2016; Zhang, Lee and Jordan, 2016; Eckle and Schmidt-Hieber, 2019) and often leads to desirable generalization bounds (see, e.g., Bartlett, 1997; Bartlett, Foster and Telgarsky, 2017; Neyshabur et al., 2017). Our theoretical results hold for any limit point d* in Condition 1, even if there is more than one of them.

Condition 2.

The mapping dR(d,P) is continuous at d* for all P.

Condition 2 also often holds. For example, when parameterized using neural networks, all estimators are continuous functions of coefficients for common activation functions such as the sigmoid or the rectified linear unit (ReLU) (Glorot, Bordes and Bengio, 2011) function, and therefore dR(d,P) is continuous everywhere.

We next present a sufficient condition to ensure that d* is Γ-minimax, so that d* is approximately Γ-minimax for sufficiently large .

Condition 3.

We assume that there exists an increasing sequence Ω=1 of subsets of such that

  1. =1Ω=

  2. for all =1,2, and all d𝒟, define Γ˜:=πΓ:πhassupportinΩ and Γ˜i:=πΓ:πhassupportiniΩ. For any πΓ˜ with a finite support, there exists a sequence πiΓ˜i such that rd,πir(d,π) as i.

We note that, in contrast to , Ω may be an infinite set. We may expect Condition 3 to hold in many cases, especially when PR(d,P) is continuous at each d𝒟 and the grid contains a variety of distributions that are consistent with prior information represented by Γ. We illustrate this point with two counterexamples in Appendix A. We will check the plausibility of Condition 3 for Example 2 in our simulation and data analysis in Section 5.1 for exemplar prior information; an almost identical argument shows the plausibility of Condition 3 for Example 3.

We now present the theorem on Γ-minimaxity of d*.

Theorem 1 (Validity of grid-based approximation).

Under Conditions 13, d* is Γ-minimax and

rsup(d*,Γ)mind𝒟rsup(d,Γ)as.

To prove Theorem 1, we utilize a result in Pinelis (2016) to establish that rsup(d,Γ) can be approximated arbitrarily well by a discrete prior in Γ for any d𝒟. This is a key ingredient in the proof of Lemma 1, which states that, for any d𝒟, rsupd,Γ˜ converges to rsup(d,Γ). Then, we show that the sequence {rsup(d*,Γ)}=1 is nondecreasing and upper bounded by infd𝒟rsup(d,Γ), which is less than or equal to the Γ-maximal Bayes risk rsup(d*,Γ) of the limit point d* of {d*}=1 in Condition 1. Therefore, rsup(d*,Γ) converges to a limit. We finally use a contradiction argument to prove that this limit is greater than or equal to rsup(d*,Γ), which implies Theorem 1.

We have the following corollary on the uniqueness of the Γ-minimax estimator and the convergence of {d*}=1 for certain problems.

Corollary 1 (Convergence of Γ-minimax estimator).

Suppose that 𝒟 is a convex subset of a vector space, dR(d,P) is strictly convex for each P, and rsup(d,Γ) is attainable for each d𝒟 in the sense that, for all d𝒟, there exists a πΓ such that r(d,π)=rsup(d,Γ). Under Conditions 13, d* is the unique Γ-minimax estimator and

d*d*as.

We prove Corollary 1 by establishing that drsup(d,Γ) is strictly convex.

In practice, the user also needs to specify a stopping criterion for Algorithm 1. In Noubiap and Seidel (2001), the authors recommended computing or approximating rsup(d*,Γ) and stop if rsup(d*,Γ) is sufficiently close to rsup(d*,Γ). However, the procedure to approximate rsup(d*,Γ) in that work relies on the compactness of , but we do not want to assume this condition because it may restrict the applicability of the method. Therefore, we propose to use the following alternative criterion: stop if rsup(d*,Γ+1)rsup(d*,Γ)ϵ for a prespecified tolerance level ϵ>0. This criterion was proposed but not recommended in Noubiap and Seidel (2001) because it does not guarantee that rsup(d*,Γ) is close to rsup(d*,Γ). For example, if +1 is small, it is even possible that rsup(d*,Γ+1)rsup(d*,Γ)=0, but d* is far from being Γ-minimax. In contrast, we recommend this criterion for our proposed methods because we allow more flexibility in model specification, that is, need not be compact. We discuss this issue in more detail in Section 4.1.

We finally remark that rsupd,Γ may be difficult to evaluate exactly. Since the risk is often an expectation, we recommend approximating rsupd,Γ for any given d via Monte Carlo as follows: first, estimate risks R(d,P) for all P with a large number of Monte Carlo runs; second, estimate the corresponding least favorable prior πd,argmaxπΓr(d,π) using the estimated risks; third, estimate the risks R(d,P)P again with independent Monte Carlo runs, and, finally, calculate rd,πd, with the estimated risks and the estimated least favorable prior. Using two independent estimates of the risk can remove the positive bias that would otherwise arise due to using the same data to estimate the risks and the least favorable prior.

3.2. Computation of an estimator on a grid via stochastic gradient descent with max-oracle

In this section, we present methods to compute a Γ-minimax estimator, which corresponds to Line 3 in Algorithm 1. Gradient descent with max-oracle (GDmax) and its stochastic variant (SGDmax), which were presented in Lin, Jin and Jordan (2020), can be used to solve general minimax problems in Euclidean spaces. We focus on SGDmax in the main text and present GDmax in Appendix B. To apply these algorithms to find a Γ-m inimax estimator, we need to assume that 𝒟 can be parameterized by a subset of a Euclidean space, that is, that for any d𝒟, there exists a real vector-valued coefficient βRD such that d may be written as d(β). For example, 𝒟 may be a neural network class. More discussions on the parameterization of 𝒟 can be found in Section 4.2. In this section, in a slight abuse of notation, we define Rβ,P:=R(d(β),P),r(β,π):=r(d(β),π) and rsupβ,Γ:=rsupdβ,Γ for a coefficient βRD, a data-generating mechanism P and a prior πΓ. We assume that βR(β,P) is differentiable for all P, and hence so is βr(β,π) for all πΓ.

It is often the case that R(β,P) is expressed as an expectation. In this case, R(β,P) may instead be approximated using Monte Carlo techniques. With ξ being an exogenous source of randomness according to law Ξ, let Rˆ(β,P,ξ) be an unbiased approximation of R(β,P) with E[β{Rˆ(β,P,ξ)R(β,P)}2]σ2<, where denotes the 2-norm in Euclidean spaces. Let rˆβ,π,ξ:=Rˆ(β,P,ξ)π(dP) for πΓ. In this case, SGDmax (Algorithm 2) may be used to find a (locally) Γ-minimax estimator. Note that Algorithm 2 represents a generalization of the nested minimax AMC strategy in Luedtke et al. (2020) to Γ-minimax problems.

Algorithm 2.

Stochastic gradient descent with max-oracle (SGDmax) to compute a Γ-minimax estimator

1: Initialize β(0)RD. Set learning rate η>0, max-oracle accuracy ζ>0 and batch size J.
2: for t=1,2, do
3:  Stochastic maximization: use a stochastic procedure to find π(t)Γ such that E[r(β(t1),π(t))]maxπΓr(β(t1),π)ζ, where the expectation is over the randomness in stochastic maximization (e.g., variants of stochastic gradient ascent).
4:  Generate iid copies ξ1,,ξJ of ξ.
5:  Stochastic gradient descent: β(t)β(t1)ηJj=1Jβrˆ(β,π(t),ξj)|β=β(t1)

We next present two conditions needed for the validity of Algorithm 2.

Condition 4.

For each =1,2, and all βRD, βR(β,P) is Lipschitz continuous with a universal Lipschitz constant L1 independent of P.

Note that Condition 4 differs from Condition 2 in that the former relies on the parameterization of 𝒟 in a Euclidean space equipped with the Euclidean norm, while the latter may rely on a different metric on 𝒟 such as an L2-distance.

Condition 5.

For each =1,2,... and all βRD, βR(β,P) is bounded; ββR(β,P) is Lipschitz continuous with a universal Lipschitz constant L2 independent of P.

Under these conditions, using the results in Lin, Jin and Jordan (2020), we can show that SGDmax yields an approximation to a local minimum of βrsupβ,Γ when the algorithms' hyperparameters are suitably chosen. Before we formally present the theorem, we introduce some definitions related to the local optimality of a potentially nondifferentiable and nonconvex function. A real-valued function f is called q-weakly convex if xf(x)+(q/2)x2 is convex (q>0). The Moreau envelope of a real-valued function f with parameter q>0 is fq:xminxfx+xx2/(2q). A point x is an ϵ-stationary point (ϵ0) of a q-weakly convex function f if f1/(2q)(x)ϵ. Similarly, a random point x is an ϵ-stationary point (ϵ0) of a q-weakly convex function f in expectation if Ef1/(2q)(x)ϵ. If x is an ϵ-stationary point in expectation, we may conclude that it is an ϵ-stationary point with high probability by Markov's inequality. Lemma 3.8 in Lin, Jin and Jordan (2020) shows that an ϵ-stationary point of f is close to a point x at which f has at least one small subgradient for small ϵ, so that f(x) is close to a local minimum. In other words, if an algorithm outputs an estimator dˆ=d(βˆ) such that βˆ is an ϵ-stationary point of βrsupβ,Γ, then we know that rsup(βˆ,Γ) is close to a local minimum of βrsupβ,Γ.

We next present the validity result for Algorithm 2.

Theorem 2 (Validity of SGDmax (Algorithm 2)).

Suppose that Conditions 12 and 45 hold. Let ϵ>0 be fixed and define Δ:=rsup1/2L1β(0)minβRDrsup1/2L1(β), where we recall that rsup1/2L1 is the Moreau envelope of rsup with parameter 1/2L1. In Algorithm 2, with η=ϵ2/L1L22+σ2, ζ=ϵ2/24L1 and J=1,β(t) is an ϵ-stationary point of βrsupβ,Γ in expectation for t=OL1L22+σ2Δ/ϵ4, and is thus close to a local minimum of βrsupβ,Γ with high probability.

The assumption that the batch size J=1 is purely for convenience since increasing J corresponds to decreasing variance σ2. To run Algorithm 2 in practice, the user only needs to specify tuning parameters in Line 1 and all other constants in Theorem 2 need not be known. In general, a small learning rate η, a stringent accuracy ζ, and a large batch size J make Algorithm 2 likely to eventually reach an approximation of a local minimum of βrsupβ,Γ, but computation time might increase. Similar to most numeric optimization algorithms, fine-tuning is needed to achieve a balance between convergence guarantee and computation time, but a conservative choice of tuning parameters would typically result in convergence at the cost of computation time.

We note that Line 3 in Algorithm 2 may be inconvenient to implement because linear program solvers often do not use stochastic optimization. Therefore, we propose to use a convenient variant (Algorithm 6 in Appendix B), where the stochastic maximization step (Line 3 in Algorithm 2) is replaced by solving a linear program where the objective is approximated via Monte Carlo. This variant has similar validity under similar conditions. We also note that the two uniform Lipschitz continuity conditions (4 and 5) heavily rely on the fact that is finite and the compactness of a set containing the coefficients. Nevertheless, the latter compactness restriction is common in theoretical analyses of neural networks (see, e.g., Goel et al., 2016; Zhang, Lee and Jordan, 2016; Eckle and Schmidt-Hieber, 2019). Moreover, these two conditions are sufficient conditions for the validity of the gradient-based methods, namely SGDmax, our variant of SGDmax and GDmax; a guarantee similar to these validity results might hold when two conditions are violated.

We finally remark that other algorithms similar to SGDmax can be applied, for example, (stochastic) gradient descent ascent with projection (Lin, Jin and Jordan, 2020), (stochastic) mirror descent ascent, or accelerated (stochastic) mirror descent ascent (Huang, Wu and Huang, 2021). It is of future research interest to develop gradient-based methods to solve minimax problems with convergence guarantees under weaker conditions.

3.3. Computation of an estimator on a grid via fictitious play

The algorithms in Section 3.2 may be convenient in many cases, but the requirements such as parameterization of the space 𝒟 of estimators in a Euclidean space, differentiability of the risk function R with respect to the coefficients β, and uniform Lipschitz continuity may be restrictive for certain problems. In this section, we propose an alternative algorithm, fictitious play, that avoids these requirements. We also present its convergence results.

Brown (1951) introduced fictitious play as a means to find the value of a zero-sum game, that is, the optimal mixed strategy for both players and their expected gains. Robinson (1951) then proved that fictitious play can be used to iteratively solve a two-player zero-sum game for a saddle point that is a pair of mixed strategies where both players have finitely many pure strategies. Our problem of finding a Γ-minimax estimator may also be viewed as a two-player zero-sum game where one player chooses a prior from Γ and the other player chooses an estimator from 𝒟. If we assume that, for the Γ-minimax problem at hand, the pair of both players' optimal strategies is a saddle point, which holds in many minimax problems (e.g., v. Neumann, 1928; Fan, 1953; Sion, 1958), then fictitious play may also be used to find a Γ-minimax estimator. Since Γ may be too rich to allow for feasible implementation of fictitious play, we propose to use this algorithm to find a Γ-minimax estimator.

In the fictitious play algorithm in Robinson (1951), the two players take turns to play the best pure strategy against the mixture of the opponenťs historic pure strategies, and the final output is a pair of mixtures of the two players' historic pure strategies. Since this algorithm aims to find minimax mixed strategies, we consider stochastic estimators. That is, consider the Borel σ-field over 𝒟 and let Π denote the set of all probability distributions on the measurable space (𝒟,). We define 𝒟 to be the space of stochastic estimators with each element taking the following form: first draw an estimator from 𝒟 according to a distribution ϖΠ with an exogenous random mechanism and then use the estimator to obtain an estimate based on the data. Note that we may write any d𝒟 as d(ϖ) for some ϖΠ. We consider estimators in 𝒟 throughout this section, with the definition of Γ-minimaxity extended in the natural way, so that d*=dϖ*𝒟 is Γ-minimax if rsup(d*,Γ)=mind𝒟rsup(d,Γ); we similarly extend all other definitions from Section 2. We assume that there exists π*Γ (=1,2,...) such that

r(d*,π*)=supπΓinfd𝒟r(d,π)=infd𝒟supπΓr(d,π). (2)

In other words, (d*,π*) is a saddle point of r in 𝒟×Γ. Under this condition and the further conditions that 𝒟 is convex and dR(d,P) is convex for all P, it is possible to use a Γ-minimax estimator over the richer class 𝒟 of stochastic estimators to derive a Γ-minimax estimator over the original class 𝒟. Indeed, for any d(ϖ)𝒟 and P, by Jensen's inequality, R(d(ϖ),P)=R(d,P)ϖ(dd)R(d_(ϖ),P) where d_ϖ:=dϖ(dd)𝒟 is the average of the stochastic estimator d(ϖ); that is, the risk of d_ϖ is never greater than that of d(ϖ). Therefore, we may use the fictitious play algorithm to compute dϖ* for each and further apply Algorithm 1 to compute dϖ*. After that, we may take d_ϖ* as the final output deterministic estimator.

Algorithm 3 presents the fictitious play algorithm for finding a Γ minimax estimator in 𝒟. Note that Γ is convex, and hence π always lies in Γ throughout the iterations. In practice, we may initialize ϖ as a point mass at an initial estimator in 𝒟. In addition, similarly to Robinson (1951), we may replace Line 5 with d(t)argmind𝒟r(d,π(t)), that is, minimizing the Bayes risk with the most recently updated prior rather than with the previous prior.

Algorithm 3.

Fictitious play to compute a Γ-minimax stochastic estimator

1: Initialize ϖ(0)Π and π(0)Γ.
2: for t=1,2, do
3: π(t)argmaxπΓr(d(ϖ(t1)),π)
4: π(t)t1tπ(t1)+1tπ(t)
5: d(t)argmind𝒟r(d,π(t1))
6: ϖ(t)t1tϖ(t1)+1tδ(d(t)), where δ(d) denotes a point mass at d𝒟.

We next present a convergence result for this algorithm.

Theorem 3 (Validity of fictitious play (Algorithm 3)).

Assume that there exists a compact subset 𝒟 of 𝒟 that contains all d(t)(t=1,2,). Under Conditions 12, it holds that

r(d(t),π(t1))rdϖ*,π*r(dϖ(t1),π(t))

for all t and

limtr(dϖ(t1),π(t))r(d(t),π(t1))=0.

Consequently, the Γ-maximal risk of dϖ(t) converges to the Γ-minimax risk, that is,

rsupdϖ(t1),Γrsupdϖ*,Γast.

Robinson (1951) proved a similar case for two-player zero-sum games where each player has finitely many pure strategies. In contrast, in our problem, each player may have infinitely many pure strategies. A natural attempt to prove Theorem 3 would be to consider finite covers of 𝒟 and Γ, that is, 𝒟=i=1I𝒟i and Γ=j=1JΠj, such that the range of r(d,π) in each 𝒟i and Πj is small (say less than ϵ), bin pure strategies into these subsets, and then apply the argument in Robinson (1951) to these bins. The collection of 𝒟i and Πj may be viewed as finitely many approximated pure strategies to Γ and 𝒟 up to accuracy ϵ, respectively. Unfortunately, we found that this approach fails. The problem arises because Robinson (1951) inducted on I and J, and, after each induction step, the corresponding upper bound becomes twice as large. Unlike the case with finitely many pure strategies that was considered in Brown (1951) and Robinson (1951), as the desired approximation accuracy ϵ approaches zero, the numbers of approximated pure strategies, I and J, may diverge to infinity, and so does the number of induction steps. Therefore, the resulting final upper bound is of order 2I+Jϵ and generally does not converge to zero as ϵ tends to zero. To overcome this challenge, we instead control the increase in the relevant upper bound after each induction step more carefully so that the final upper bound converges to zero as ϵ decreases to zero, despite the fact that I and J may diverge to infinity.

We remark that, because Line 5 of Algorithm 3 typically involves another layer of iteration in addition to that over t, this algorithm will often be more computationally intensive than SGDmax. Nevertheless, Algorithm 3 provides an approach to construct Γ-minimax estimators in cases where these other algorithms cannot be applied, for example, in settings where the risk is not differentiable in the parameters indexing the estimator or uniform Lipschitz conditions fail. In our numerical experiments, we have implemented this algorithm in the context of mean estimation (Appendix C).

4. Considerations in implementation

4.1. Considerations when constructing the grid over the model space

By Theorem 1, rsup(d*,Γ)mind𝒟rsup(d,Γ) whenever Conditions 13 hold and the increasing sequence =1 is such that =1 is dense in . Though this guarantee holds for all such sequences =1, in practice, judiciously choosing this sequence of grids of distributions can lead to faster convergence. In particular, it is desirable that the least favorable prior Γ puts mass on some of the distributions in 1 since, if this is not the case, then d* will be the same as d1*. While we may try to arrange for this to occur by adding many new points when enlarging 1 to , it may not be likely that any of these points will actually modify the least favorable prior unless they are carefully chosen.

To better address this issue, we propose to add grid points using a Markov chain Monte Carlo (MCMC) method. Our intuition is that, given an estimator d, the maximal Bayes risk is likely to significantly increase if we add distributions that (i) have a high risk for d, and (ii) are consistent with prior information so that there exists some prior such that these distributions lie in a high-probability region. We propose to use the MCMC algorithm to bias the selection of distributions in favor of those with the above characteristics. Let τ:[0,) denote a function such that τ(P)>τP if P is more consistent with prior information than P. For example, given a prior mean μ of some real-valued summary Ψ(P) of P and an interval I that contains Ψ(P) with prior probability at least 95%, we may choose τ:Pϕ(Ψ(P)), where ϕ is the density of a normal distribution that has mean μ and places 95% of its probability mass in I. We call τ a pseudo-prior. Then, with the current estimator being d, we wish to select distributions P for which R(d,P)τ(P) is large. We may use the Metropolis-Hastings-Green algorithm (Metropolis et al., 1953; Hastings, 1970; Green, 1995) to draw samples from a density proportional to PR(d,P)τ(P). We then let be equal to the union of 1 and the set containing all unique distributions in this sample.

Details of the proposed scheme are provided in Algorithm 4. To use this proposed algorithm, we rely on it being possible to define a sequence of parametric models {Ω˜}=1 such that ˜:==1Ω˜ is dense in -this is possible in many interesting examples (see, e.g., Chen, 2007). When combined with separability of , this condition enables the definition of an increasing sequence of grids of distributions =1 such that, for each , ˜.

Algorithm 4.

MCMC algorithm to construct

Require: 1, current estimator d1* and number T of iterations. We define 1:=. An initial estimator d0* must be available if =1.
1: Initialize P(0)˜.
2: for t=1,2,,T do
3:  Propose a distribution P˜ from P(t1)
4:  Calculate the MCMC acceptance probability paccept of P for target density PR(d1*,P)τ(P)
5:  With probability paccept, accept P and P(t)P
6: if P is not accepted then
7:   P(t)P(t1)
8: unique elements of the multiset 1P(1),P(2),,P(T)

The following theorem on distributional convergence follows from that for the Metropolis-Hastings-Green algorithm (see Section 3.2 and 3.3 of Green, 1995).

Theorem 4 (Validity of MCMC algorithm (Algorithm 4)).

Suppose that PR(d1*,P)τ(P) is bounded and integrable with respect to some measure μ on ˜ and let denote the probability law on ˜ whose density function with respect to μ is proportional to this function. Suppose that the MCMC is constructed such that the Markov chain is irreducible and aperiodic. Then, P(t) converges weakly to as t.

Therefore, if corresponds to a continuous distribution with nonzero density over the parameter space of ˜, then Theorem 4 implies that =1 is dense in , as required by Algorithm 1.

Implementing Algorithm 4 relies on the user making several decisions. These decisions include the choice of the pseudo-prior τ and the technique used to approximate the risk R(d,P) to a reasonable accuracy. Fortunately, regardless of the decisions made, Theorem 1 suggests that rsup(d*,Γ)mind𝒟rsup(d,Γ) for a wide range of sequences =1. Indeed, all that theorem requires on this sequence is that the grid becomes arbitrarily fine as increases. Though the final decisions made are not important when is large, we still comment briefly on the decisions that we have made in our experiments, First, we have found it effective to approximate R(d,P) via a large number of Monte Carlo draws. Second, in a variety of settings, we have also identified, via numerical experiments, candidate pseudo-priors that balance high risk and consistency with prior information (see Sections 5.1 and 5.2 for details).

4.2. Considerations when choosing the space of estimators

It is desirable to consider a rich space 𝒟0 of estimators to obtain an estimator with low maximal Bayes risk, and thus good general performance. However, to make numerically constructing these estimators computationally feasible, we usually have to consider a restricted space 𝒟 of estimators. This approximation is justified because, if estimators in 𝒟 can approximate the Gamma-minimax estimator in 𝒟0 well, then we expect the resulting excess maximal Bayes risk is small.

Feedforward neural networks (or neural networks for short) are natural options for the space of estimators because of their universal approximation property (e.g., Hornik, 1991; Csáji, 2001; Hanin and Sellke, 2017; Kidger and Lyons, 2020). However, training commonly used neural networks can be computationally intensive. Moreover, a space of neural networks is typically nonconvex, and hence it may be difficult to find a global minimizer of the maximal Bayes risk even if the risk is convex in the estimator. Therefore, the learned estimator might not perform well.

To help overcome this challenge, we advocate for utilizing available statistical knowledge when designing the space of estimators. We call estimators that take this form statistical knowledge networks. In particular, if a simple estimator is already available, we propose to use neural networks with such an estimator as a node connected to the output node. An example of such an architecture is presented in Fig. 2. In this sample architecture, each node is an activation function such as the sigmoid or the rectified linear unit (ReLU) (Glorot, Bordes and Bengio, 2011) function applied to an affine transformation of the vector containing the ancestors of the node. The only exception is the output node, which is again an affine transformation of its ancestors but uses the identity activation function. When training the neural network, we may initialize the affine transformation in the output layer to only give weight to the simple estimator. Under this approach, the space of estimators is a set of perturbations of an existing simple estimator. Although we may still face the challenge of nonconvexity and local optimality, we can at least expect to improve the initial simple estimator.

FIG 2.

FIG 2.

Example of neural network estimator architecture utilizing an existing estimator. The arrows from the input nodes to the existing estimator are omitted from this graph.

In the simulation we describe in Appendix C, we compared the empirical performance of several spaces of estimators. This simulation concerns the simple problem of estimating the mean of a true distribution whose support has known bounds (Example 1), and the existing simple estimator we use in the statistical neural network is the sample mean. Fig. 3 presents the trajectory of estimated Bayes risks. As shown in subfigures (b)–(d), using the statistical knowledge network, the estimator is almost Γ-minimax after a few iterations; on the other hand, it took about 1000 iterations for the feedforward neural network to reach an approximately Γ-minimax estimator. Therefore, in this simple problem where the true Γ-minimax estimator is a shifted and scaled sample mean, statistical knowledge substantially reduced the number of iterations required to obtain an approximately Γ-minimax estimator. For more complicated problems, we expect statistical knowledge to further help improve the performance of the computed estimator.

FIG 3.

FIG 3.

Estimated Bayes risks of the estimator over iterations when computing a Γ1-minimax estimator. The lines are the current Bayes risks (y-axis) over iterations (x-axis) (unbiased estimates with 50 Monte Carlo runs for Algorithm 6; exact values for Algorithm 3). The solid lines are the Bayes risks after an update in the estimator to decrease the Bayes risk. The dashed lines are the Bayes risks after an update in the prior to increase the Bayes risk. The two horizontal lines are the Bayes risk of the sample mean (dashed) and d* (solid), respectively, for π*. For ease of visualization, in subfigures (a) and (b), the Bayes risks are plotted every 50 iterations; in subfigures (c) and (d), the Bayes risks are plotted every 200 iterations; subfigure (d) contains the part in subfigure (c) after 500 iterations.

We note that we might overcome the challenge of nonconvexity and local optimality by using an extreme learning machine (ELM) (Huang, Zhu and Siew, 2006) to parameterize the estimator. ELMs are neural networks for which the weights in hidden nodes are randomly generated and are held fixed, and only the weights in the output layer are trained. Thus, the space of ELMs with a fixed architecture and fixed hidden layer weights is convex. Like traditional neural networks, ELMs have the universal approximation property (Huang, Chen and Siew, 2006). In addition, Corollary 1 may be applied to an ELM so that the Γ-minimax estimator may converge to the Γ-minimax estimator. As for traditional neural networks, we may incorporate knowledge of existing statistical estimators into an ELM.

We finally remark that, besides computational intensity when constructing (i.e., learning) a Γ-minimax estimator, another important factor to be considered when choosing 𝒟 is the computational intensity to evaluate the learned estimator at the observed dataset. This is another reason for our choosing neural networks or ELMs as the space of estimators. Indeed, existing software packages (e.g., Paszke et al., 2019) make it easy to leverage graphics processing units to efficiently evaluate the output of neural networks for any given input. Therefore, if the existing estimator being used is not too difficult to compute, then estimators parameterized using similar architectures to that displayed in Figure 2 will be able to be computed efficiently in practice. This efficiency may be especially important in settings where the estimator will be applied to many datasets, so that the cost of learning the estimator is amortized and the main computational expense is evaluating the learned estimator.

5. Simulations and data analyses

We illustrate our methods in Examples 13. A toy example of Example 1 is presented in Appendix C. We focus on the more complex Examples 2 and 3 in this section.

5.1. Prediction of the expected number of new categories

We apply our proposed method to Example 2. In the simulation, we set the true population to be an infinite population with the same categories and same proportions as the sample studied in Miller and Wiegert (1989), which consists of 1088 observations in 188 categories. This setting is the same as the simulation setting in Shen, Chao and Lin (2003). We set the sample size to be n=100 and the size of the new sample to be m=200. In this setting, the expected number of new categories in the new sample unconditionally on the observed sample, namely Φ(P0):=EP0[Ψ(P0)(X*)], can be analytically computed and equals 48.02. We note that this quantity can also be computed via simulation: (i) sample n and m individuals with replacement from the dataset in Miller and Wiegert (1989), (ii) count the number of new categories in the second sample, and (iii) repeat steps (i) and (ii) many times and compute the average.

It is well known that this prediction problem is difficult when m>n, and we run this simulation to investigate the potential gain from leveraging prior information by computing a Gamma-minimax estimator for such difficult or even ill-posed problems. We consider three sets of prior information:

  1. strongly informative: prior mean of Φ(P) in [45, 50], ≥ 95% prior probability that Φ(P) lies in [40, 55];

  2. weakly informative: prior mean of Φ(P) in [40, 55], ≥ 95% prior probability that Φ(P) lies in [30, 65]; and

  3. almost noninformative: prior mean of Φ(P) in [35, 60], ≥ 95% prior probability that Φ(P) lies in [20, 75]. We note that a traditional Bayesian approach would require specifying a prior on , including the total number of categories and the proportion of each category, which may be difficult in practice.

We check the plausibility of Condition 3 in this context. We take the strongly informative prior information as an example. Take Ω to be the collection of multinomial distributions with at most categories. It is obvious that =1Ω=. Let d𝒟 be fixed and πΓ˜ be a fixed prior with finite support, that is, π=j=1JqjδQj where δ() denotes the point mass distribution, QjΩ, qj>0 and j=1Jqj=1. Let ϵ>0 be an arbitrary small number such that j=1JqjΦQj50ϵ or j=1JqjΦQj45+ϵ. Since =1 is dense in and Φ is continuous, there exists a sufficiently large i such that, for every distribution Qj, there exists PjiΩ satisfying the following:

  • ΦPjΦQjϵ;

  • if ΦQj[40,55], then ΦPj[40,55];

  • Rd,PjRd,Qjϵ.

Take πi to be j=1JqjδPj. Then it is easy to verify that j=1JqjΦPjj=1JqjΦQjϵ and thus j=1JqjΦPj[45,50]; moreover, ΦQj[40,55] implies that ΦPj[40,55] and therefore j=1Jqj1ΦPj[40,55]j=1Jqj1(ΦQj[40,55])95%. Thus, πiΓ˜i. Moreover, r(d,π)rd,πiϵ. Therefore, rd,πir(d,π) as i and Condition 3 holds.

We design the architecture of the neural network estimator as in Fig. 4. We choose two existing estimators (referred to as the OSW and SCL estimators, respectively) proposed by Orlitsky, Suresh and Wu (2016) and Shen, Chao and Lin (2003) as human knowledge inputs to the architecture. We use the ReLU activation function. There are 50 hidden nodes in the first hidden layer. We initialize the neural network that we train to output the average of these two existing estimators.

FIG 4.

FIG 4.

Architecture of the neural network estimator of the expected number of new categories. Xk : number of categories with k observations; OSW: the estimator proposed in Orlitsky, Suresh and Wu (2016); SCL: the estimator proposed in Shen, Chao and Lin (2003). The arrows from data X1,...,Xn to the OSW and SCL estimators are omitted from this graph.

We use Algorithm 4 to construct . There are 2000 grid points in 1, and we add 1000 grid points each time we enlarge the grid. When generating 1, we chose the starting point to be a distribution P(0) with 146 categories and ΦP(0)=49.9. The choice of this starting point P(0) was quite arbitrary. We first generated a sample from P0 and treated it as data from a pilot study. We then came up with a distribution P(0) such that five random samples generated from P(0) all appear qualitatively similar to the pilot data. In practice, this starting point can be chosen based on prior knowledge. Our chosen grid sizes for Algorithm 4 were quite arbitrary. For 1, the generated distributions P(t) appear similar for all t, and thus the initial grid size 2000 and the increment size 1000 appeared sufficient. Smaller grid sizes would simply lead to more iterations in Algorithm 1, which effectively increases the grid size. We selected the log pseudo-prior as a weighted sum of two log density functions: (i) a normal distribution with the mean being the midpoint of the interval constraint on the prior mean of Φ(P) and central 95% probability interval being the interval with at least 95% prior probability, (ii) a negative-binomial distribution of the total number of categories with success probability 0.995 and 2 failures until the Bernoulli trial is stopped so that the mode and the variance are approximately 200 and 8 × 104, respectively. These log-densities are provided weights 30 and 10, respectively. We selected the weights based on the empirical observation that distributions with only a few categories tend to have high risks, but these distributions are relatively inconsistent with prior information and may well be given almost negligible probability weight in a computed least favorable prior, thus contributing little to computing a Γ-minimax estimator. We chose the aforementioned weights so that Algorithm 4 can explore a fairly large range of distributions and does not generate too many distributions with too few categories.

We use Algorithm 6 with learning rate η=0.005 and batch size J=30 to compute Γ-minimax estimators. The number of iterations is 4,000 for Γ1 and 200 for Γ(>1). The stopping criterion in Algorithm 1 is that the estimated maximal Bayes risk with 2000 Monte Carlo runs does not relatively increase by more than 2% or absolutely increase by more than 0.0001. We chose the aforementioned tuning parameters based on the prior belief that at least one of OSW and SCL estimators should perform reasonably well, but the performance of SGDmax (Algorithm 6) and Algorithm 4 might be sensitive to tuning parameters. Thus, the network we used is neither deep nor wide. We chose a moderately small learning rate and a large number of iterations for SGDmax. Our chosen learning rate and chosen number of iterations led to a trajectory of estimated Bayes risks that approximately reached a plateau with small fluctuations, suggesting that the obtained estimator is approximately Γ1-minimax (see Fig. 5). In practice, such trajectory plots may help tune the learning rate and the number of iterations.

FIG 5.

FIG 5.

Estimated Bayes risks of the estimator over iterations when computing a Γ1-minimax estimator. The lines are unbiased estimates of the current Bayes risks (y-axis) with 30 Monte Carlo runs over iterations (x-axis). The two dashed horizontal lines are the risks of the OSW (upper) and the SCL (lower) estimators, respectively, under P0 in the simulation. The solid horizontal line is the risk of the computed Γ-minimax estimator under P0. For clearness of visualization, the estimated Bayes risks are plotted every 50 iterations.

We also run additional simulations to investigate the sensitivity of our methods to tuning parameter selections. We present these simulations in Appendix D. The results suggest that our methods may be insensitive to tuning parameter selections.

We examine the performance of the OSW estimator, the SCL estimator and our trained Γ-minimax estimator by comparing their risks under our set data-generating mechanism computed with 20000 Monte Carlo runs. We also compare their Bayes risks under the computed prior from Algorithm 6 using the last and finest grid in the computation with 20000 Monte Carlo runs. We present the results in Table 1. In this simulation experiment, our Γ-minimax estimator substantially reduces the risk compared to two existing estimators. The Γ-minimax estimator also has the lowest Bayes risk in all cases. Therefore, incorporating fairly informative prior knowledge into the estimator may lead to a significant improvement in predicting the number of new categories. We expect similar substantial improvement for difficult or even ill-posed statistical problems by incorporating prior knowledge.

Table 1.

Risks and Bayes risks of estimators. Rd,P0 : risk of the estimator under the true data-generating mechanism P0r(d,πˆ*) : Bayes risk under prior πˆ*, the computed prior from Algorithm 6 in the last and finest grid in the computation.

Strength of prior Estimator Rd,P0 r(d,πˆ*)

strong OSW 265 303
SCL 146 159
Γ-minimax 18 35
weak OSW 265 328
SCL 146 184
Γ-minimax 17 61
almost none OSW 265 293
SCL 146 124
Γ-minimax 24 81

Fig. 5 presents the unbiased estimator of Bayes risks over iterations when computing a Γ1-minimax estimator. The Bayes risks appear to have a decreasing trend and to approach a liming value. Over iterations, the Bayes risks decrease by a considerable amount. The limiting value of the Bayes risks appears to be slightly higher than the risk of the computed Γ-minimax estimator under P0. This might indicate that P0 is not an extreme distribution that yields a high risk.

We also apply the above methods to analyze the dataset studied in Miller and Wiegert (1989), which is used as the true population in the simulation. Based on this sample consisting of n=1088 observations in 188 categories, we use various methods to predict the number of new categories that would be observed if another m=2000 observations were to be collected. We train Gamma-minimax estimators using exactly the same tuning parameters as those in the above simulation, except that the starting point in Algorithm 4 has more categories. The predictions of all methods are presented in Table 2. The Γ-minimax estimator outputs a more similar prediction to the SCL estimator. This similarity appears different from our observation in the simulation, but can be explained by the fact that having more observations (n=1088 vs n=100; m=2000 vs m=200) decreases the variance of the number of new observed categories and thus lowers discrepancies between predictions from these methods. Since the SCL estimator outperforms the OSW estimator in the above simulation where this dataset is the true population, we expect the SCL estimator to achieve reasonably good performance in this application. Moreover, given that the Γ-minimax estimators outperform the SCL estimator in the above simulation, we expect that 57 or 58 represents an improved prediction of the number of new categories as compared to the SCL prediction of 51 in the case where there is limited prior information available.

Table 2.

Predicted number of new categories (rounded to the nearest integer) in a new sample with size 2000 based on the sample with size 1088 studied in Miller and Wiegert (1989). The strength of prior information in Γ-minimax estimators is shown in brackets.

Estimator Predicted # new categories

OSW 72
SCL 51
Γ-minimax (strong) 57
Γ-minimax (weak) 57
Γ-minimax (almost none) 58

The computation time to compute an approximated Γ-minimax estimator was about five to seven hours on an AWS EC2 instance (Amazon, 2019) with at least 4vCPUs and at least 8GiB of memory, depending on the number of times the grid was enlarged. As shown in Fig. 5, far few iterations are needed for SGDmax to output a good approximation of a Γ1-minimax estimator, which is itself quite close to Γ-minimax. Therefore, with suitably less conservative tuning parameters or more adaptive minimax problem solvers, the computation time might drastically decrease. Moreover, the computation time needed to evaluate the computed Γ-minimax estimator at any sample is almost zero.

5.2. Estimation of the entropy

We also apply our method to estimate the entropy of a multinomial distribution (Example 3). The data-generating mechanism is the same as that described in Example 2, and the estimand of interest is Shannon entropy (Shannon, 1948), that is, ΨP0=k=1Kpklogpk. In the simulation, we choose the same true population and the same sample size n=100 as in Section 5.1. The true entropy ΨP0 is 4.57. As a reference, the entropy of the uniform distribution with the same number of categories—which corresponds to the maximum entropy of multinomial distributions with the same total number of categories—is 5.24.

Jiao et al. (2015) developed a minimax rate optimal estimator of the Shannon entropy, and we run this simulation to investigate the potential gain of computing a Gamma-minimax estimator in well-posed problems with satisfactory solutions. As in Section 5.1, we consider three sets of prior information:

  1. Strongly informative: Prior mean of Ψ(P) in [4.3, 4.7], ≥ 95% probability that Ψ(P) lies in [4, 5];

  2. Weakly informative: Prior mean of Ψ(P) in [4, 5], ≥ 95% probability that Ψ(P) lies in [3.5, 5.5];

  3. Almost noninformative: Prior mean of Ψ(P) in [3.7, 5.3], ≥ 95% probability that Ψ(P) lies in [3, 6].

The architecture of our neural network estimator is almost identical to that in Section 5.1 except that the existing estimator being used is the one proposed in Jiao et al. (2015) (referred to as the JVHW estimator), and we initialize the network to return the JVHW estimator. We use Algorithm 4 to construct and Algorithm 6 to compute a Γ-minimax estimator. The tuning parameters in the algorithms are identical to those used in Section 5.1 except that, in Algorithm 6, (i) the learning rate is η=0.001, and (ii) the number of iterations is 6,000 for Γ1. We change these tuning parameters because the JVHW estimator is already minimax in terms of its convergence rate (Jiao et al., 2015), and we may need to update the estimator in a more cautious manner in Algorithm 6 to obtain any possible improvement. The trajectories of the estimated Bayes risks (Fig. 6) all appear to approximately reach a plateau, suggesting that the obtained estimator approximately Γ1-minimax and that our choice of a smaller learning rate and a larger number of iterations is valid. Because of the additional complexity of the JVHW estimator, we ran our simulations on an AWS EC2 instance (Amazon, 2019) with 4 vCPUs and 32GiB of memory. The computation time was ten to seventeen hours, depending on the number of times the grid was enlarged. The longer computation time than that described in Section 5.1 is primarily due to more iterations in SGDmax and the additional complexity of the JVHW estimator.

FIG 6.

FIG 6.

Estimated Bayes risks of the estimator over iterations when computing a Γ1-minimax estimator. The lines are unbiased estimates of the current Bayes risks minimax (y-axis) with 30 Monte Carlo runs over iterations (x-axis). The horizontal lines are the risks of the JVHW (dashed) and the computed Γ-minimax (solid) estimators, respectively, under P0 in the simulation. For clearness of visualization, the estimated Bayes risks are plotted every 100 iterations.

We compare the risk of the JVHW estimator and our trained Γ-minimax estimator under our set data-generating mechanism computed with 20000 Monte Carlo runs. We also compare their Bayes risk under the computed prior from Algorithm 6 using the last and finest grid in the computation with 20000 Monte Carlo runs. The results are summarized in Table 3. In this simulation experiment, our Γ-minimax estimator reduces the risk by a fair percentage compared with the JVHW estimator and achieves lower worst-case Bayes risk. According to these simulation results, we conclude that incorporating informative prior knowledge into the estimator may result in some improvement in estimating entropy. Thus, for well-posed statistical problems with satisfactory solutions, we expect mild or no substantial improvement and little deterioration from using a Gamma-minimax estimator.

Table 3.

Risks and Bayes risks of estimators. Rd,P0: risk of the estimator under the true data-generating mechanism P0, r(d,πˆ*):Bayesriskunderpriorπˆ*, the computed prior from Algorithm 6 in the last and finest grid in the computation.

Strength of prior Estimator Rd,P0 r(d,πˆ*)

strong JVHW 0.041 0.035
Γ-minimax 0.036 0.021
weak JVHW 0.041 0.028
Γ-minimax 0.018 0.024
almost none JVHW 0.041 0.031
Γ-minimax 0.025 0.016

Fig. 6 presents the unbiased estimator of Bayes risks over iterations when computing a Γ1-minimax estimator. With strongly informative prior information present, the Bayes risks appear to fluctuate without an increasing or decreasing trend at the beginning and decrease slowly after several thousand iterations. With weakly informative or almost no prior information, the Bayes risks also decrease slowly. A reason may be that the JVHW estimator is already minimax rate optimal (Jiao et al., 2015). The computed Γ-minimax estimators also appear to be somewhat similar to the JVHW estimator: in the output layer of the three settings with different prior information, the coefficients for the JVHW estimator are 0.97, 0.90 and 0.89, respectively; the coefficients for the previous hidden layer are 0.17, 0.17 and 0.20, respectively; the intercepts are 0.06, 0.30 and 0.30, respectively.

We further use the above methods to estimate entropy based on the dataset used as the true population in the simulation. The tuning parameters of the Γ minimax estimators are exactly the same as those in the above simulation except that the starting point in Algorithm 4 has more categories. The estimates are presented in Table 4. All methods produce almost identical estimates. Because the sample size is more than ten times the sample size in the simulation and the JVHW estimator is minimax rate optimal (Jiao et al., 2015), we expect the JVHW estimator to have little room for improvement, which explains why the three Γ-minimax estimators perform similarly to the JVHW estimator. In other words, Gamma-minimax estimators appear to maintain, if not improve, the performance of the original JVHW estimator.

Table 4.

Estimated entropy based on the sample with size 1088 studied in Miller and Wiegert (1989). The strength of prior information in Γ-minimax estimators is shown in brackets.

Estimator Estimated entropy

JVHW 4.709
Γ-minimax (strong) 4.709
Γ-minimax (weak) 4.708
Γ-minimax (almost none) 4.703

6. Discussion

We propose adversarial meta-learning algorithms to compute a Gamma-minimax estimator with theoretical guarantees under fairly general settings. These algorithms still leave room for improvement. As we discussed in Section 3.1, the stopping criterion we employ does not necessarily indicate that the maximal Bayes risk is close to the true minimax Bayes risk. In future work, it would be interesting to derive a better criterion that necessarily does indicate this near optimality. Our algorithms also require the user to choose increasingly fine approximating grids to the model space. Although we propose a heuristic algorithm for this procedure that performed well in our experiments, at this point, we have not provided optimality guarantees for this scheme. It may also be possible to improve our proposed algorithms to solve intermediate minimax problems in Section 3.1 by utilizing recent and ongoing advances from the machine learning literature that can be used to improve the training of generative adversarial networks.

We do not explicitly consider uncertainty quantification such as confidence intervals or credible intervals under a Gamma-minimax framework. Uncertainty quantification is important in practice since it provides more information than a point estimator and can be used for decision-making. In theory, our method may be directly applied if such a problem can be formulated into a Gamma-minimax problem. However, such a formulation remains unclear. The most challenging part is to identify a suitable risk function that correctly balances the level of uncertainty and the size of the output interval/region. Though the risk function used in Schafer and Stark (2009) appears to provide one possible starting point, it is not clear how to extend this approach to nonparametric settings.

It is possible to allow the space of estimators 𝒟 to increase as the grid increase. For example, we may specify an increasing sequence of estimator spaces 𝒟=1 whose limit is dense in a general space 𝒟0; then, in Line 3 of Algorithm 1, we compute a Γ-minimax estimator in 𝒟, namely replace 𝒟 with 𝒟. This sequence of estimators might converge to a Γ-minimax estimator in 𝒟0. One possible choice of 𝒟(>1) in this approach is a space of statistical knowledge networks with the given estimator being the computed Γ1-minimax estimator in 𝒟1. It is of future interest to investigate the properties of such an approach.

In conclusion, we propose adversarial meta-learning algorithms to compute a Gamma-minimax estimator under general models that can incorporate prior information in the form of generalized moment conditions. They can be useful when a parametric model is undesirable, semi-parametric efficiency theory does not apply, or we wish to utilize prior information to improve estimation.

Acknowledgments

Generous support was provided by Amazon through an AWS Machine Learning Research Award and the NIH under award number DP2-LM013340. The content is solely the responsibility of the authors and does not necessarily represent the official views of Amazon or the NIH.

The research has been supported by Amazon through an AWS Machine Learning Research Award and the NIH under award number DP2-LM013340

Appendix A: Two counterexamples of Condition 3

We provide two counterexamples of Condition 3 to illustrate that this condition fails in extremely ill cases.

In the first counterexample, PR(d,P) is discontinuous: we set R(d,P*) to be zero for a fixed P* and R(d,P) to be one for all other P. If we choose the grid to be dense in but to never contain P*, then Condition 3 does not hold since rsupd,Γ˜=1 for sufficiently large such that P*Ω but rsupd,Γ˜i=0 for all i and . This issue can be resolved by choosing a continuous risk function.

In the second counterexample, does not contain distributions that are consistent with prior information. Suppose that Γ=πΠ:Φ(P)π(dP)=0} where ΦP:=EPX2. In other words, it is known that the true data-generating mechanism P0 must be a distribution that is a point mass at zero, and thus Γ also only contains a point mass at P0. If Φ(P)0 for every Pi=1i, then, even if =1 is dense in , Γ˜i= and thus Condition 3 does not hold. This issue can be resolved by rewriting the problem such that these hard constraints on are incorporated into the specification of rather than Γ.

Appendix B: Additional gradient-based algorithms

If we can evaluate R(β,P) exactly for all β and P, then the GDmax algorithm (Algorithm 5) may be used. Note that Line 3 can be formulated into a linear program, which can always be solved in polynomial time with an interior point method (e.g., Jiang et al., 2020) and often be solved in polynomial time with a simplex method (Spielman and Teng, 2004).

Algorithm 5.

Gradient descent with max-oracle (GDmax) to compute a Γ-minimax estimator

1: Initialize β(0)RD. Set learning rate η>0 and max-oracle accuracy ζ>0.
2: for t=1,2, do
3:  Maximization: find π(t)Γ such that r(β(t1),π(t))maxπΓr(β(t1),π)ζ
4:  Gradient descent: β(t)β(t1)ηβr(β,π(t))|β=β(t1)

We have the following result on the validity of GDmax.

Theorem 5 (Validity of GDmax (Algorithm 5)).

Under conditions in Theorem 2, in Algorithm 5, with η=ϵ2/L1L22 and ζ=ϵ2/24L1,β(t) is an ϵ-stationary point of βrsupβ,Γ for t=OL1L2Δ/ϵ4, and is thus close to a local minimum of βrsupβ,Γ.

Therefore, we propose a variant (Algorithm 6) by replacing this line with Lines 34 so that ordinary linear program solvers can be directly applied. The following theorem justifies this variant.

Algorithm 6.

Convenient variant of SGDmax (Algorithm 2) to compute a Γ-minimax estimator

1: Initialize β(0)RD. Set learning rate η>0 and batch sizes J, J.
2: for t=1,2, do
3:  Generate iid copies ξ1,,ξJ of ξ.
4:  Stochastic maximization: π(t)argmaxπΓ1Jj=1Jrˆβ(t1),π,ξj.
5:  Generate iid copies of ξJ+1,,ξJ+J of ξ.
6:  Stochastic gradient descent: β(t)β(t1)ηJj=J+1J+Jβrˆβ,π(t),ξj|β=β(t1).

Theorem 6 (Validity of convenient variant of SGDmax (Algorithm 6)).

Suppose that ξrˆβ,π,ξ:βRD,πΓ is a Ξ-Glivenko-Centelli class (van der Vaart and Wellner, 2000). Then, for any ζ>0, there exists a sufficiently large J such that

Erβ(t1),π(t)maxπΓrβ(t1),πζ

for all t, where the expectation is taken over π(t) and β(t1) is fixed. Therefore, with the chosen parameters in Theorem 2, we may choose a sufficiently large J so that β(t) is an ϵ-stationary point of βrsupβ,Γ in expectation for t=OL1L22+σ2Δ/ϵ4 and is thus close to a local minimum of βrsupβ,Γ with high probability.

We prove Theorem 6 by showing that maxπΓrβ(t1),πErβ(t1),π(t) converges to 0 as J. The proof is essentially an application of empirical process theory to the study of an M-estimator.

Appendix C: Additional simulation: mean estimation

In this appendix, we illustrate our proposed methods via simulation in a special case of Example 1, namely for estimating the mean of a distribution. We assume that consists of all probability distributions defined on the Borel σ-algebra on [0, 1] and we observe X=X1,X2,,Xn, where X1,,Xn~iidP0. Here we take n=10. The estimand is ΨP0=xP0(dx). We use the mean squared error risk introduced in Example 1. Suppose that we represent the prior information by Γ=πΠ:Ψ(P)π(dP)=0.3, which corresponds to the set of prior distributions in Π that satisfy an equality constraint on the prior mean of Ψ(P).

We apply our method to three spaces of estimators separately. The first space, 𝒟linear, is the set of affine transformations of the sample mean, that is, 𝒟linear=d:d(X)=β0+β1i=1nXi/n,β0,β1R. As shown in Proposition 1 in Appendix E.5, there is an estimator d* in 𝒟linear that is Γ-minimax in the space of all estimators that are square-integrable with respect to all P, so we consider this simple space to better compare our computed estimator with that theoretical Γ-minimax estimator. When computing a Γ-minimax estimator in 𝒟linear, we initialize the estimator to be the sample mean, that is, we let β0=0 and β1=1.

The second space, 𝒟skn (statistical knowledge network), is a set of neural networks designed based on statistical knowledge that includes the sample mean as an input. We consider this space to illustrate our proposal in Section 4.2. More precisely, we use the architecture in Fig. 7, which is similar to the deep set architecture (Zaheer et al., 2017; Maron et al., 2019) and is a permutation invariant neural network. We use such an architecture to account for the fact that the sample is iid. In this architecture, the sample mean node is used as an augmenting node to an ordinary deep set network and is combined with the output of that ordinary network in the fourth hidden layer to obtain the final output. Note that 𝒟skn𝒟linear. When computing a Γ-minimax estimator for this class, we also initialize the network to be exactly the sample mean, which is a reasonable choice given that the sample mean is known to be a sensible estimator. In this simulation experiment, we choose the dimensionality of nodes in each hidden layer in Fig. 7 as follows: each node in the first, second, third and fourth hidden layer represents a vector in R10, R5, R10 and R, respectively. We do not use larger architectures because usually the sample mean is already a good estimator, and we expect to obtain a useful estimator as a small perturbation of this estimator. We also use the ReLU as the activation function. We did not use ELMs in this and the following simulations because we found that neural networks perform well.

The third space, 𝒟nn, is a set of neural networks that do not utilize knowledge of the sample mean. We consider this space to illustrate our method without utilizing existing estimators. These estimators are also deep set networks with similar architecture as 𝒟skn in Fig. 7. The main difference is that the explicit sample mean node and the fourth hidden layer are removed. When computing a Γ-minimax estimator in 𝒟nn, we also randomly initialize the network, unlike 𝒟linear and 𝒟skn, in order not to input statistical knowledge. Because the ReLU activation function is used, 𝒟nn𝒟linear, and we do not expect that optimizing over 𝒟nn should not lead to a Γ-minimax estimator with worse performance than those in 𝒟linear and 𝒟skn.

FIG 7.

FIG 7.

Architecture of the permutation invariant neural network estimator of the mean in 𝒟skn. Xi : observation i in the sample; : the node that sums up all ancestor nodes. In the first two hidden layers, all input nodes are transformed by the same function. The arrows from the input nodes to the sample mean estimator are omitted from this graph. Each node in the hidden layers represents a vector.

To construct the grid for this problem, we use a simpler method than Algorithm 4. As indicated by Lemma 6 in Appendix E.5, for estimators in 𝒟linear, Bernoulli distributions tend to have high risks since all probability weights lie on the boundary of [0, 1]; in addition, a prior π* for which d* is Bayes is a Beta prior over Bernoulli distributions. Therefore, we randomly generate 2000 Bernoulli distributions as grid points in 1. We also include two degenerate distributions in this grid, namely the distribution that places all of its mass at 0 and that which places all of its mass at 1. When constructing from 1, we still add in more complicated distributions to make the grid dense in the limit: we first randomly generate 500 discrete distributions with support being those in 1; then we randomly generate 10 new support points in [0, 1] and 1000 distributions with support points being the union of the new support points and the existing support points in 1.

When computing the Γ-minimax estimator, for each grid , we compute the Γ-minimax estimator for all three estimator spaces with Algorithm 6. We set the learning rate η=0.005, the batch size J=50 and the number of iterations to be 200 for Γ(>1). The number of iterations for Γ1 is larger because, in our experiments, we saw that a Γ1-minimax estimator is already close to a Γ-minimax estimator, and using a large number of iterations in this step can improve the initial estimator substantially. For 𝒟linear and 𝒟skn, the number of iterations for Γ1 is 2000; the corresponding number for 𝒟nn is 6000 to account for the lack of human knowledge input. We also use Algorithm 3 with 10000 iterations to compute a Γ-minimax estimator for 𝒟linear for illustration. In this setup, as described in Section 3.3, we take the average of the computed Γ-minimax stochastic estimator as the final output estimator in 𝒟linear. We do not apply Algorithm 3 to 𝒟skn or 𝒟nn because it is computationally intractable for these estimator spaces.

We set the stopping criterion in Algorithm 1 as follows. When Algorithm 6 is used to compute Γ-minimax estimators, we estimate both rsup(d1*,Γ) and rsup(d1*,Γ1) with 2000 Monte Carlo runs as described in Section 3.1; when Algorithm 3 is used, rsup(d1*,Γ) and rsup(d1*,Γ1) are computed exactly because R(d,P) has a closed-form expression for all d𝒟linear and P. We set the tolerance ϵ to be equal to 0.0001 so that we stop Algorithm 1 if rsup(d1*,Γ)rsup(d1*,Γ1)ϵ.

After computation, we report the Bayes risk of the computed and theoretical Γ-minimax estimators under π*, the prior such that r(d*,π*)=infd𝒟rsup(d,Γ). For the estimators in 𝒟linear, we further report their coefficients. We also report two coefficients of the computed estimator in 𝒟skn as follows. Since 𝒟linear𝒟skn and we initialize the estimator to be the sample mean for 𝒟skn, we would expect that the bias β0 and the weight of the sample mean β1 in the output layer for the computed Γ-minimax estimator in 𝒟skn may correspond to those in 𝒟linear. Therefore, we also report these two coefficients β0 and β1 for 𝒟skn. This may not be the case for 𝒟nn because the sample mean is not explicit in its parameterization and all coefficients are randomly initialized, so we do not report any coefficients for 𝒟nn.

Table 5 presents the computation results. By Theorem 7 in Appendix E.5, these computed estimators are all approximately Γ-minimax since their Bayes risks for π* are all close to that of a theoretical Γ-minimax estimator. The coefficients β0 and β1 of the computed estimators in 𝒟linear and 𝒟skn are also close to a theoretically derived estimator. For the computed estimator in 𝒟skn, the weight of the other ancestor node in the output layer (i.e., the node in the 4 th hidden layer in Fig. 7) is 0.000. Therefore, our computed Γ-minimax estimator in 𝒟skn is also close to a theoretically derived Γ-minimax estimator.

In our experiments, Algorithm 1 converged after computing a Γ1-minimax estimator except when using Algorithm 6 for 𝒟linear. Even in this exceptional case, the computed Γ1-minimax estimator is still approximately Γ-minimax. We think the algorithm does not stop then in these cases because of Monte Carlo errors when computing rsup(d1*,Γ) and rsup(d1*,Γ1).

Fig. 3 presents the Bayes risks (or its unbiased estimates) over iterations when computing a Γ1-minimax estimator. In all cases using Algorithm 6, the Bayes risks appear to decrease and converge. When using Algorithm 3, the upper and lower bounds both converge to the same limit. The limiting values of the Bayes risks in all cases are close to r(d*,π*) because Γ1 can approximate π* well.

Table 5.

Coefficients and Bayes risks of estimators of the mean. Unrestricted space: the space of all estimators that are square-integrable with respect to all P.

Estimator space Method to obtain d* β0 β1 rd,π*

Unrestricted space Theoretical derivation 0.072 0.760 0.012
𝒟linear Algorithms 1 & 6 0.072 0.763 0.012
𝒟skn Algorithms 1 & 6 0.071 0.767 0.012
𝒟nn Algorithms 1 & 6 0.012
𝒟linear Algorithms 1 & 3 0.072 0.760 0.012

Table 6.

Table similar to Table 1 for sensitivity analysis with strongly informative prior information.

Varied tuning parameter Rd,P0 r(d,πˆ*)

Initial distribution in MCMC 19 44
Grid size 15 34
Statistical knowledge network structure 17 38

Appendix D: Sensitivity analysis for tuning parameter selection

For the simulation in Section 5.1 with strongly informative prior information, we conduct three simulations to investigate the sensitivity of our proposed method to the selection of tuning parameters. In each simulation below, we vary one set of tuning parameters and rerun the algorithm to obtain an estimator. In the first simulation, we vary the starting point of Algorithm 4 to construct the first grid 1. The new starting point is a distribution with 173 categories and ΦP(0)=61, and so this starting point is qualitatively different from the one chosen in the original simulation. In the second simulation, we vary the grid sizes: There are 500 grid points in 1 and we add 500 grid points each time we enlarge the grid. In the third simulation, we chose a wider and deeper statistical knowledge network (see Fig. 8): Compared to the original simulation, we add one more hidden layer and increased the number of hidden nodes in the first two hidden layers to 100. As shown in Table 6, the results in these sensitivity simulations appear similar to that in Section 5.1 within the variation due to randomness in MCMC (Algorithm 4) and SGDmax (Algorithm 6).

Appendix E: Proofs

E.1. Proof of Theorem 1 and Corollary 1

Lemma 1.

If Ω=1 is an increasing sequence of subsets of such that =1Ω=, then, for any d𝒟, rsupd,Γ˜rsup(d,Γ)().

Proof of Lemma 1.

Since Γ˜Γ˜+1Γ, it holds that

rsup(d,Γ˜)rsup(d,Γ˜+1)rsup(d,Γ),

and so we only need to lower bound rsupd,Γ˜. Fix ϵ>0. By Corollary 5 of Pinelis (2016), rsup(d,Γ) can be approximated by r(d,ν) arbitrarily well for priors νΓ with a finite support; that is, there exists νΓ with finite support such that r(d,ν)rsup(d,Γ)ϵ. For sufficiently large , Ω contains all support points of ν and hence rsupd,Γ˜r(d,ν)rsup(d,Γ)ϵ. The desired result follows. □

FIG 8.

FIG 8.

Architecture of the deeper and wider neural network estimator of the expected number of new categories.

Lemma 2.

Under Condition 2, for any ΓΓ and ϵ>0, there exists δ>0 such that rsup(d*,Γ)rsup(d,Γ)ϵ for all d𝒟 such that ϱ(d,d*)δ.

Proof of Lemma 2.

By Corollary 5 of Pinelis (2016), there exists νΓ with a finite support such that rsup(d*,Γ)r(d*,ν)+ϵ/2. By Condition 2 and the fact that ν has a finite support, there exists δ>0 such that, for any d𝒟 such that ϱ(d,d*)δ,|r(d,ν)r(d*,ν)|ϵ/2. Since νΓ, we have that rsup(d,Γ)r(d,ν) and thus rsup(d*,Γ)rsup(d,Γ)r(d*,ν)+ϵ/2r(d,ν)ϵ for any d𝒟 such that ϱ(d,d*)δ. □

Lemma 3.

Under Condition 3, it holds that limirsupd,Γ˜i=rsupd,Γ˜.

Proof of Lemma 3.

Let d𝒟, and ϵ>0 be fixed. By Corollary 5 of Pinelis (2016), rsupd,Γ˜r(d,π)+ϵ/2 for some πΓ˜ with a finite support. Under Condition 2, there exists a sequence πiΓ˜i such that, for all sufficiently large i, rd,πir(d,π)ϵ/2. For such i, rsupd,Γ˜rd,πi+ϵ. Since rsupd,Γ˜rsupd,Γ˜ird,πi, we have that rd,πirsupd,Γ˜irsupd,Γ˜rd,πi+ϵ for all sufficiently large i, and thus we have proved Lemma 3. □

Proof of Theorem 1.

Let ϵ>0. There exists d𝒟 such that

rsupd,Γinfd𝒟rsup(d,Γ)+ϵ.

Moreover, there exists πΓ such that

rsupd,Γrd,π+ϵ.

Using the fact that d* is Γ-minimax and the definition of rsup, it holds that

rsup(d*,Γ)rsupd,Γrd,π+ϵrsupd,Γ+ϵinfd𝒟rsup(d,Γ)+2ϵ.

Since this inequality holds for any ϵ>0, we have that

rsup(d*,Γ)infd𝒟rsup(d,Γ).

An almost identical argument shows that the sequence {rsup(d*,Γ)}=1 is non-decreasing. Therefore, this sequence converges to some limit

infd𝒟rsup(d,Γ)rsup(d*,Γ).

We next prove that rsup(d*,Γ). Let ϵ>0. Without loss of generality, we may assume that Ω for all =1,2,... in Condition 3. (Otherwise, we may instead consider the sequence {Ω˜)˜=1 where Ω˜=:ΩΩ. Note that Condition 3 also holds for {Ω˜)˜=1.) By Lemma 1, there exists 0 such that rsup(d*,Γ˜0)rsup(d*,Γ)ϵ/3. By Condition 3, there exists i1 such that rsup(d*,Γ˜i10)rsup(d*,Γ˜0)ϵ/3. Without loss of generality, suppose that d*d* (otherwise, take a convergent subsequence to this limit point). This then implies that there exists i2>i1 such that ϱ(di2*,d*) is sufficiently small, such that, by Lemma 2, rsup(di2*,Γ˜i10)rsup(d*,Γ˜i10)ϵ/3. Moreover, since Γ˜i10Γ˜i1Γ˜i2, it holds that rsup(di2*,Γ˜i2)rsup(di2*,Γ˜i10). Therefore, rsup(di2*,Γ˜i2)rsup(d*,Γ)ϵ. Since the sequence {rsup(d*,Γ)}=1 is nondecreasing, it holds that rsup(d*,Γ)rsup(d*,Γ)ϵ for all i2. Since ϵ is arbitrary, we have that liminfrsup(d*,Γ)rsup(d*,Γ), and hence rsup(d*,Γ).

Combining the results from the preceding two paragraphs, =infd𝒟rsup(d,Γ)=rsup(d*,Γ). Consequently, d* is Γ-minimax. Moreover, as {rsup(d*,Γ)}=1 increases to , this sequence also increases to rsup(d*,Γ). This concludes the proof. □

Proof of Corollary 1.

We first establish the strict convexity of dr(d,π) for any πΓ. We then establish the strict convexity of drsup(d,Γ). We then establish that there is a unique minimizer of drsup(d,Γ) and show that the desired result follows from Theorem 1.

Let d1,d2𝒟 and c(0,1) be arbitrary, then by the convexity of 𝒟 and the strict convexity of dR(d,P) for each P,

rcd1+1cd2,π=Rcd1+1cd2,Pπ(dP)<cRd1,P+(1c)Rd2,Pπ(dP)=crd1,π+(1c)rd2,π.

Therefore, dr(d,π) is strictly convex for any πΓ.

Let d1,d2𝒟 be distinct and c(0,1) be arbitrary. Since rsup(d,Γ) is attainable for any d𝒟, there exists π˜Γ such that

rsupcd1+1cd2,Γ=rcd1+1cd2,π˜<crd1,π˜+(1c)rd2,π˜crsupd1,Γ+(1c)rsupd2,Γ.

Thus, drsup(d,Γ) is strictly convex.

As drsup(d,Γ) is strictly convex and 𝒟 is convex, this function achieves exactly one minimum on 𝒟. By Theorem 1, any limit point d* of {d*}=1 is a minimizer of drsup(d,Γ), and so the sequence has a limit point, which is also the unique Γ-minimax estimator. □

E.2. Proof of Theorems 2 & 5

We prove Theorems 2 and 5 by checking that Assumptions 3.1 and 3.6 in Lin, Jin and Jordan (2020) are satisfied and using Theorem E.3 and E.4 in Lin, Jin and Jordan (2020), respectively. Since Assumption 3.1 is satisfied by our construction of Rˆ, we focus on Assumption 3.6 for the rest of this section.

Let =P1,P2,,PΛ. For any πΓ, let πλ denote the probability weight of π on Pλ(λ=1,,Λ). For the rest of this section, we also use π to denote the vector π1,,πΛ. We also use ≲ to denote less than equal to up to a universal positive constant that may depend on . Then, straightforward calculations imply that βr(β,π)=λ=1ΛπλβRβ,Pλ and πr(β,π)=Rβ,P1,,Rβ,PΛ.

For each =1,2,, for any β1,β2 and π1,π2Γ, by Conditions 4 and 5,

βr(β,π)β=β1,π=π1βr(β,π)β=β2,π=π2=λ=1Λπλ1βRβ,Pλβ=β1πλ2βRβ,Pλβ=β2λ=1Λπλ1βRβ,Pλβ=β1βRβ,Pλβ=β2+λ=1Λπλ1πλ2βRβ,Pλβ=β2β1β2+π1π2β1,π1β2,π2,

and similarly for πr(β,π),

πr(β,π)β=β1,π=π1πr(β,π)β=β2,π=π2=Rβ1,P1Rβ2,P1,Rβ1,P2Rβ2,P2,,Rβ1,PΛRβ2,PΛβ1β2β1,π1β2,π2.

This implies that for each , the gradient of r(β,π)β,πΓ is Lipschitz continuous.

For each =1,2,, for any β1,β2 and πΓ, Condition 4 implies that

rβ1,πrβ2,π=|λ=1ΛπλRβ1,PλRβ2,Pλ|λ=1ΛπλRβ1,PλRβ2,Pλβ1β2.

Therefore, βr(β,π) is Lipschitz continuous with a universal Lipschitz constant independent of πΓ.

Finally, it is straightforward to check that (i) πr(β,π) is concave for any β, and (ii) Γ is parameterized by a convex subset of a simplex in a Euclidean space, which is a convex and bounded set. These results show that Assumption 3.6 in Lin, Jin and Jordan (2020) is satisfied for Algorithm 5 and 2.

E.3. Proof of Theorem 6

Proof of Theorem 6.

Let π(t),0 denote a maximizer of πrβ(t1),π. It holds that

0rβ(t1),π(t),0rβ(t1),π(t)1Jj=1Jrˆβ(t1),π(t),ξj1Jj=1Jrˆβ(t1),π(t),0,ξj+rβ(t1),π(t),0rβ(t1),π(t)=1Jj=1J{rˆβ(t1),π(t),ξjrˆβ(t1),π(t),0,ξjErˆβ(t1),π(t),ξrˆβ(t1),π(t),0,ξ}supβRD,π1,π2Γ|1Jj=1Jrˆβ,π1,ξjrˆβ,π2,ξjErˆβ,π1,ξrˆβ,π2,ξ|.

Note that the right hand side does not depend on t. Therefore,

0suptrβ(t1),π(t),0Erβ(t1),π(t)E*supβRD,π1,π2Γ|1Jj=1J{rˆβ,π1,ξjrˆβ,π2,ξjErˆβ,π1,ξrˆβ,π2,ξ}|,

where E* stands for outer expectation. We may apply Corollary 9.27 in Kosorok (2008) to :=ξrˆ(β,π,ξ):βRD,πΓ and show that :=f1f2:f1,f2ξrˆβ,π1,ξrˆβ,π2,ξ:βRD,π1,π2Γ is a Ξ-Glivenko-Cantelli class. Therefore,

supβRD,π1,π2Γ1Jj=1J{rˆβ,π1,ξjrˆβ,π2,ξjErˆβ,π1,ξrˆβ,π2,ξ}|}*supf|1Jj=1JfξjE[f(ξ)]|*a.s.0,

as J. Here, X* stands for the minimal measurable majorant with respect to Ξ of a (possibly non-measurable) mapping X (van der Vaart and Wellner, 2000).

By Problem 1 of Section 2.4 in van der Vaart and Wellner (2000), there exists a random variable F such that Fsupff(ξ)EfξΞ-almost surely and E[F]<. Then,

supf|1Jj=1JfξjEfξj|F

Ξ-almost surely. By dominated convergence theorem,

E*supβRD,π1,π2Γ|1Jj=1J{rˆβ,π1,ξjrˆβ,π2,ξjErˆβ,π1,ξjrˆβ,π2,ξj}|0

as J, and so does suptrβ(t1),π(t),0Erβ(t1),π(t). Thus, for any ζ>0, there exists a sufficiently large J such that Erβ(t1),π(t)rβ(t1),π(t),0ζ for all t. □

E.4. Proof of Theorem 3

Our proof of Theorem 3 builds on that of Robinson (1951). Major modifications are needed to allow for more general definitions that can accommodate for potentially infinite spaces of pure strategies and a more careful control on a bound on r(dϖ(t1),π(t))r(d(t),π(t1)) towards the end of the proof.

In this appendix, we slightly abuse the notation and use 𝒟 to denote the compact set 𝒟 that contains all dt(t=1,2,...). We first introduce the notion of cumulative Bayes risk functions. Under Algorithm 3, we let U0:𝒟R and V0:ΓR be any two continuous functions such that

mind𝒟U0(d)=maxπΓV0(π) (3)

and recursively define

Ut+1(d):=Ut(d)+r(d,π(t)),Vt+1(π):=Vt(π)+r(dt,π) (4)

for d𝒟 and πΓ. Here, we let π(t)argmaxπΓVt1(π) and d(t)argmind𝒟Ut1(d). Note that the choices of π(t) and d(t) in Algorithm 3 corresponds to setting U00 and V00, in which case Ut(d)=trd,π(t) and Vt(π)=trdϖ(t),π. In general,

Utd=U0d+trd,πt,Vtπ=V0π+trdϖt,π (5)

for some π(t)Γ and dϖ(t)𝒟. Later in this section, we will also make use of Ut and Vt with other initializations U0 and V0.

To make notations concise, we define mind𝒟Ut:=mind𝒟Ut(d) for any 𝒟𝒟, and define max𝒟Ut, minΠVt and maxΠVtΠΓ similarly. We also drop the subscript denoting the set when the set is the whole space we consider, that is, 𝒟 or Γ. Note that for any t1,t2=1,2,..., under the setting of Algorithm 3 and (2), it holds that

minUt1/t1=mind𝒟rd,πt1maxπΓmind𝒟rd,π=rdϖ*,π*=mind𝒟maxπΓr(d,π)maxπΓrdϖt2,π=maxVt2/t2

Therefore, to prove the first result in Theorem 3, it suffices to show that limsuptmaxVtminUt/t0.

We next introduce additional definitions related to iterations. We say that πΓ is eligible in the interval t1,t2 if there exists tt1,t2 such that Vt(π)=maxVt; we say that d𝒟 is eligible in the interval t1,t2 if there exists tt1,t2 such that Ut(d)=minUt. We also define eligibility for sets. We say that ΠΓ is eligible in the interval t1,t2 if there exists πΠ that is eligible in that interval; we say that 𝒟𝒟 is eligible in the interval t1,t2 if there exists d𝒟 that is eligible in the interval t1,t2. In addition, for any 𝒟𝒟, we define maximum variation MVt𝒟:=supd𝒟Ut(d)infd𝒟Ut(d) and MVtΠ similarly for any ΠΓ. By Condition 2, there exists B(0,) such that R[B,B]. Note that by Condition 1 and Lemma 2, given an arbitrary desired approximation accuracy ϵ>0,𝒟 can be covered by finitely many compact subsets with the maximum variation of each subset bounded by ϵt for all t; by Condition 2, since Γ is parameterized by a compact subset of a simplex in a Euclidean space, Γ can also be covered by finitely many compact subsets with the maximum variation of each subset bounded by ϵt for all t. These covers can be viewed as discrete finite approximations to 𝒟 and Γ, respectively.

All of the above definitions are associated with the space of estimators 𝒟 and the set of priors Γ. We call Ut,Vtt a pair of cumulative Bayes risk functions constructed from the pair 𝒟,Γ of the space of estimators and the set of priors, and will consider pairs of cumulative Bayes risk functions constructed from other pairs 𝒟,Π of the space of estimators and the set of priors in the subsequent proof. We can define the above quantities similarly for such cases.

The following lemma gives an upper bound on the maximum variation of Us+t and Vs+t over the corresponding entire space from which they are constructed after t iterations from s when essentially all parts of these spaces are eligible in [s,s+t].

Lemma 4.

Suppose that Ut,Vtt is a pair of cumulative Bayes risk functions constructed from 𝒟,Π. Suppose that 𝒟=i=1I𝒟i and Π=j=1JΠj where

supi,tMVt𝒟i/tA,supj,tMVtΠj/tA

for A<. If all 𝒟i and Πj are eligible in [s,s+t], then max𝒟Us+tmin𝒟Us+t(2B+A)t and maxΠVs+tminΠVs+t(2B+A)t.

Proof of Lemma 4.

Without loss of generality, assume that d˜argmaxd𝒟Us+t𝒟1. Since 𝒟1 is eligible in [s,t], there exists t˜[s,s+t] such that argmind𝒟Ut˜𝒟1. By the recursive definition of the sequence Utt in (4), the bound on the risk, and the assumption that supi,tMVt𝒟i/tA, we have that max𝒟Us+t=Us+t(d˜)Ut˜(d˜)+B(s+tt˜)min𝒟Ut˜+At+B(s+tt˜)min𝒟Ut˜+(A+B)t. Letting d˜argmind𝒟Us+t, by the bound on the risk, we can derive that min𝒟Us+t=Us+t(d˜)Ut˜(d˜)B(s+tt˜)min𝒟Ut˜Bt. Combine these two inequalities and we have that max𝒟Us+tmin𝒟Us+t(2B+A)t. An identical argument applied to the sequence Vtt shows that maxΠVs+tminΠVs+t(2B+A)t. □

The next lemma builds on the previous lemma and provides an upper bound on maxVs+tminUs+t under the same conditions.

Lemma 5.

Under the same setup and conditions as in Lemma 4, maxΠVs+tmin𝒟Us+t(4B+2A)t.

Proof of Lemma 5.

Summing the two inequalities in Lemma 4 and rearranging the terms, we have that maxΠVs+tmin𝒟Us+t(4B+2A)t+minΠVs+tmax𝒟Us+t. It therefore suffices to show that minΠVs+tmax𝒟Us+t.

Let τ:=s+t. There exists πΠ and a stochastic strategy d𝒟 such that Uτ(d)=U0(d)+τr(d,π) and Vτ(π)=V0(π)+τr(d,π) for all d𝒟 and all πΠ. Therefore, for this choice of π and d, using (3), minΠVτVτ(π)=V0(π)+τr(d,π)maxΠV0+τr(d,π)=min𝒟U0+τr(d,π)U0(d)+τr(d,π)=Uτ(d)max𝒟Uτ. □

Proof of Theorem 3.

It suffices to show that limsuptmaxVtminUt/t0 by letting U00 and V00, which corresponds to Algorithm 3. Let ϵ>0. Note that r is Lipschitz continuous by Lemma 2 and the fact that r(d,π) is an average of bounded risks with weights π. Furthermore, 𝒟 and Γ are both compact. In addition, U0 and V0 are both continuous. Therefore, there exist covers 𝒟=i=1I𝒟i and Γ=j=1JΠj such that (i) 𝒟i and Πj are all compact, and (ii) supi,tMVt𝒟i/tϵ, supj,tMVtΠj/tϵ. (Note that I and J may depend on ϵ.) For index sets {1,2,,I} and 𝒥{1,2,,J}, define 𝒟:=i𝒟i and Π𝒥:=j𝒥Πj. We show that maxVtminUtCϵt for an absolute constant C and all sufficiently large t via induction on the sizes of and 𝒥.

Let Ut,Vtt be a pair of cumulative Bayes risk functions constructed from 𝒟,Π𝒥 where ||=|𝒥|=1. By (5) and the fact that MVt𝒟ϵt and MVtΠ𝒥ϵt, we have that

min𝒟Ut=mind𝒟U0(d)+trd,π(t)Ed~ϖ(t)U0(d)+trdϖ(t),π(t)ϵtmind𝒟U0(d)+trdϖ(t),π(t)ϵt=maxπΠ𝒥V0(π)+trdϖ(t),π(t)ϵtV0π(t)+trdϖ(t),π(t)ϵtmaxπΠ𝒥V0(π)+trdϖ(t),π2ϵt=maxΠ𝒥Vt2ϵt.

Therefore, maxΠ𝒥Vtmin𝒟Ut2ϵt.

Let ϵ>0 be arbitrary. Suppose that there exists t0 such that, for any and 𝒥𝒥 such that or 𝒥𝒥, for any pair of cumulative Bayes risk functions Ut,Vtt constructed from (𝒟,Π𝒥), it holds that maxΠ𝒥Vtmin𝒟Utϵt for all tt0. We next obtain a slightly greater bound on maxΠ𝒥Vtmin𝒟Ut for all sufficiently large t.

We first prove that if, for a given pair of cumulative Bayes risk functions Ut,Vtt constructed from (𝒟,Π𝒥), there exists i or j𝒥 such that 𝒟i or Πj is not eligible in an interval s,s+t0, then

maxΠ𝒥Vs+t0min𝒟Us+t0maxΠ𝒥Vsmin𝒟Us+ϵt0. (6)

Suppose that 𝒟i is not eligible in s,s+t0, then define Ut:=Us+t and Vt:=Vs+tmaxΠ𝒥Vs+min𝒟Us for all t0. It is straightforward to check that Ut,Vtt=0t0 satisfies the recursive definition of a pair of cumulative Bayes risk functions constructed from (𝒟i,Π𝒥). By the induction hypothesis, maxΠ𝒥Vt0min𝒟iUt0ϵt0. Therefore, maxΠ𝒥Vs+t0min𝒟Us+t0=maxΠ𝒥Vt0min𝒟iUt0+maxΠ𝒥Vsmin𝒟UsmaxΠ𝒥Vsmin𝒟Us+ϵt0. Similar argument can be applied if Πj is not eligible in s,s+t0.

Now we obtain a bound on maxΠ𝒥Vtmin𝒟Ut. Let t>t0, 𝒬:=t/t01 and :=t/t0𝒬[0,1). There are two cases.

Case 1:

There exists s0𝒬 such that 𝒟i and Πj are eligible in s01+)t0,s0+t0 for all i and j𝒥. Take s0 to be the largest such integer. Then, repeatedly apply (6) to intervals s0+t0,s0+1+t0,s0+1+)t0,s0+2+t0,,(𝒬1+)t0,(𝒬+)t0=tt0,t and we derive that

maxΠ𝒥Vtmin𝒟UtmaxΠ𝒥Vs0+t0min𝒟Us0+t0+ϵ𝒬s0t0.

By Lemma 5, maxΠ𝒥Vs0+t0min𝒟Us0+t0(4B+ϵ)t0. Therefore,

maxΠ𝒥Vtmin𝒟Ut(4B+ϵ)t0+ϵ𝒬s0t0(4B+ϵ)t0+ϵt.
Case 2:

There is no integer s0 satisfying the condition in Case 1. Then, repeatedly apply (6) to intervals t0,(1+)t0,(1+)t0,(2+)t0,...,[(𝒬1+)t0,(𝒬+)t0=tt0,t, we derive that

maxΠ𝒥Vtmin𝒟UtmaxΠ𝒥Vt0min𝒟Ut0+ϵ𝒬t0.

By the bound on the risk, maxΠ𝒥Vt0Bt0 and min𝒟Ut0Bt0. Hence,

maxΠ𝒥Vtmin𝒟Ut2Bt0+ϵ𝒬t0(4B+ϵ)t0+ϵt.

Thus, in both cases, it holds that maxΠ𝒥Vtmin𝒟Ut(4B+ϵ)t0+ϵt for t>t0. Let C>0 be any constant (which may depend on ϵ, the approximation error of the covers, that is, the bound on MVt/t). The following holds for any sufficiently large t,

maxΠ𝒥Vtmin𝒟Ut4B+ϵt0+ϵt1+Cϵt. (7)

In other words, we show that after increasing the size of either index set by 1, for all sufficiently large t, we obtain a bound on maxΠ𝒥Vtmin𝒟Ut that grows by a multiplicative factor of (1+C) relative to the original bound.

It takes finitely many, say N, steps to induct from the initial case where the sizes of both index sets are one to the case of interest with index sets {1,,I} and {1,,J}. (Note that N may also depend on ϵ through its dependence on I and J.) Take C=1/N in (7) and we derive that, for all sufficiently large t,

maxVtminUt=maxΠ{1,,J}Vtmin𝒟{1,,I}Ut(1+1/N)N2ϵt2eϵt

where e is the base of natural logarithm. Since ϵ is arbitrary, we show that limsuptmaxVtminUt/t0. □

E.5. Derivation of Γ-minimax estimator of the mean in Section C

In this section, we show that, for the problem of estimating the mean in Section C, one Γ-minimax estimator lies in 𝒟linear. This is formally presented below.

Proposition 1.

Let consist of all probability distributions defined on the Borel σ-algebra on [0,1]. Let X1,,Xn~iidP0 and X=X1,X2,,Xn be the observed data. Let Ψ:PxP(dx) denote the mean parameter and Γ=πΠ:Ψ(P)π(dP)=μ be the set of priors that represent prior information. Let 𝒟 denote the space of estimators that are square-integrable with respect to all P. Consider the risk in Example 1, R:(d,P)EP[(d(X)Ψ(P))2]. Define X=i=1nXi/n and d0:X(μ+nX)/(1+n). Then d0𝒟linear is Γ-minimax over 𝒟.

We first present a theorem on a criterion of Γ-minimaxity.

Theorem 7.

Suppose that d0𝒟 is a Bayes estimator for π0Γ and rd0,π0=rsupd0,Γ. Then d0 is a Γ-minimax estimator in 𝒟.

Proof of Theorem 7.

Clearly rsupd0,Γinfd𝒟rsup(d,Γ). Fix d𝒟. Then, rsupd,Γrd,π0rd0,π0=rsupd0,Γ. Since d is arbitrary, this shows that infd𝒟rsup(d,Γ)rsupd0,Γ. Thus, rsupd0,Γ=infd𝒟rsup(d,Γ) and d0 is Γ-minimax. □

We now present a lemma that is used to prove Proposition 1.

Lemma 6.

Let a<b and suppose that denotes the model space that consists of all probability distributions defined on the Borel σ-algebra on [a,b]R with mean μ[a,b]. Let X denote a generic random variable generated from some P. Then maxPVarP(X)=VarP*(X)=(bμ)(μa), where P* is defined by P*(X=a)=(bμ)/(ba) and P*(X=b)=(μa)/(ba).

Proof of Lemma 6.

Without loss of generality, we may assume that a=1 and b=1. Note that for any P, it holds that VarP(X)=EP[X2]EP[X]2=EP[X2]μ21μ2, where the equality is attained if P(X{1,1})=1. Therefore, the maximum variance is achieved at the distribution with the specified mean μ and support being {a,b}, that is, at the distribution P* defined in the lemma statement. Straightforward calculations show that VarP*(X)=(bμ)(μa). □

Proof of Proposition 1.

Let :={Bernoulliθ:θ(0,1)} and let π0 be a prior distribution over such that the prior distribution on the success probability θ is Beta(μn,(1μ)n). By Theorem 1.1 in Chapter 4 of Lehmann and Casella (1998), a Bayes estimator for π0 minimizes the risk under the posterior distribution, whose minimizer over 𝒟 is the posterior mean d0 for our choice of risk. That is, d0 is a Bayes estimator in 𝒟 for π0.

We next show that rd0,π0=supπΓrd0,π. Let πΓ be arbitrary. Since EP[X]=Ψ(P) and VarP(X)=VarPX1/n, we can derive that

rd0,π=EPμ+nX1+nΨ(P)2π(dP)=EPn1+n(XΨ(P))+μΨ(P)1+n2π(dP)=1(1+n)2VarPX1+(μΨ(P))2(1+n)2π(dP)
Table 7.

Summary of frequently used symbols

Symbol

P0 True data-generating mechanism
Space of data-generating mechanisms containing P0
X* and X=𝒞X* Full generated data and coarsened data
𝒟 Space of candidate estimators or decision functions (e.g., neural networks)
R Risk function
r Bayes risk function r:(d,π)R(d,P)π(dP)
Γ(Π) Set of prior distributions consistent with prior knowledge
Ψ Functional defining the estimand ΨP0 in Examples 13
An increasing sequence of finite subsets of
Γ Set of priors in Γ with support in
rsup Worst-case Bayes risk function rsup:d,ΓsupπΓr(d,π)
d* Γ-minimax estimator in 𝒟
d* A limit point of sequence {d*}=1, which is Γ-minimax in 𝒟 by Theorem 1
βRD Coefficient of a finite-dimensional estimator (e.g., neural network)
ξ~Ξ Exogenous randomness
Rˆ(β,P,ξ) Unbiased approximation of R(β,P)
d(ϖ) Stochastic estimator following distribution ϖ over 𝒟
𝒟 Space of stochastic estimators d(ϖ)
d=dϖ* Γ-minimax estimator in 𝒟

Apply Lemma 6 to VarPX1 and the display continues as

1(1+n)2Ψ(P)(1Ψ(P))+(μΨ(P))2(1+n)2π(dP)=1(1+n)2μ2+(12μ)Ψ(P)π(dP)=μ(1μ)(1+n)2.

This upper bound can be attained by any π with support contained in , for example, π0. Therefore, rsupd0,Γ=rd0,π0. By Theorem 7, d0 is Γ-minimax over 𝒟. □

References

  1. Amazon (2019). Amazon EC2 Instance Types - Amazon Web Services.
  2. Bartlett PL (1997). For valid generalization, the size of the weights is more important than the size of the network. Advances in Neural Information Processing Systems 134–140. [Google Scholar]
  3. Bartlett PL, Foster DJ and Telgarsky M (2017). Spectrally-normalized margin bounds for neural networks. Advances in Neural Information Processing Systems 2017-December 6241–6250. [Google Scholar]
  4. Berger JO (1985). Statistical Decision Theory and Bayesian Analysis. Springer Series in Statistics. Springer New York, New York, NY. [Google Scholar]
  5. Bickel PJ, Klaassen CA, Bickel PJ, Ritov Y, Klaassen J, Wellner JA and Ritov Y (1993). Efficient and adaptive estimation for semiparametric models 4. Johns Hopkins University Press Baltimore. [Google Scholar]
  6. Birmingham J, Rotnitzky A and Fitzmaurice GM (2003). Pattern-mixture and selection models for analysing longitudinal data with monotone missing patterns. Journal of the Royal Statistical Society. Series B: Statistical Methodology 65 275–297. [Google Scholar]
  7. Brown GW (1951). Iterative solution of games by fictitious play. Activity analysis of production and allocation 13 374–376. [Google Scholar]
  8. Bryan B, McMahan HB, Schafer CM and Schneider J (2007). Efficiently computing minimax expected-size confidence regions. ACM International Conference Proceeding Series 227 97–104. [Google Scholar]
  9. Bunge J, Willis A and Walsh F (2014). Estimating the Number of Species in Microbial Diversity Studies. Annual Review of Statistics and Its Application 1 427–445. [Google Scholar]
  10. Chen X (2007). Chapter 76 Large Sample Sieve Estimation of Semi-Nonparametric Models. Handbook of Econometrics 6 5549–5632. [Google Scholar]
  11. Chen L, Eichenauer-Herrmann J and Lehn J (1988). Gamma-minimax estimators for the parameters of a multinomial distribution. Applicationes Mathematicae 20 561–564. [Google Scholar]
  12. Chen L, Eichenauer-Herrmann J, Hofmann H and Kindler J (1991). Gamma-minimax estimators in the exponential family. Polska Akademia Nauk, Instytut Matematyczny. [Google Scholar]
  13. CsÁJI B (2001). Approximation with artificial neural networks Technical Report.
  14. Eckle K and Schmidt-Hieber J (2019). A comparison of deep networks with ReLU activation function and linear spline-type methods. Neural Networks 110 232–242. [DOI] [PubMed] [Google Scholar]
  15. Eichenauer-Herrmann J (1990). A gamma-minimax result for the class of symmetric and unimodal priors. Statistical Papers 31 301–304. [Google Scholar]
  16. Eichenauer-Herrmann J, Ickstadt K and Weiss E (1994). Gamma-minimax results for the class of unimodal priors. Statistical Papers 35 43–56. [Google Scholar]
  17. Erhan D, Courville A, Bengio Y and Vincent P (2010). Why does unsupervised pre-training help deep learning? Technical Report. [Google Scholar]
  18. FAn K (1953). Minimax theorems. Proceedings of the National Academy of Sciences of the United States of America 39 42. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Gill RD, van der LaAn MJ and Robins JM (1997). Coarsening at Random: Characterizations, Conjectures, Counter-Examples. 255–294. Springer, New York, NY. [Google Scholar]
  20. Glorot X, Bordes A and Bengio Y (2011). Deep sparse rectifier neural networks Technical Report.
  21. Goel S, Kanade V, Klivans A and Thaler J (2016). Reliably Learning the ReLU in Polynomial Time.
  22. Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, WardeFarley D, Ozair S, Courville A and Bengio Y (2014). Generative adversarial nets Technical Report No. January. [Google Scholar]
  23. Green PJ (1995). Reversible Jump Markov Chain Monte Carlo Computation and Bayesian Model Determination. Biometrika 82 711. [Google Scholar]
  24. Hanin B and Sellke M (2017). Approximating Continuous Functions by ReLU Nets of Minimal Width. arXiv preprint arXiv:1710.11278v2. [Google Scholar]
  25. Hastings WK (1970). Monte Carlo Sampling Methods Using Markov Chains and Their Applications. Biometrika 57 97. [Google Scholar]
  26. Heituan DF (1993). Ignorability and Coarse Data: Some Biomedical Examples. Biometrics 49 1099. [PubMed] [Google Scholar]
  27. Heituan DF (1994). Ignorability in general incomplete-data models. Biometrika 81 701–708. [Google Scholar]
  28. Heituan DF and Rubin DB (1991). Ignorability and Coarse Data Technical Report No. 4.
  29. HoRnIK K (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks 4 251–257. [Google Scholar]
  30. Huang GB, Chen L and Siew CK (2006). Universal approximation using incremental constructive feedforward networks with random hidden nodes. IEEE Transactions on Neural Networks 17 879–892. [DOI] [PubMed] [Google Scholar]
  31. Huang F, Wu X and Huang H (2021). Efficient mirror descent ascent methods for nonsmooth minimax problems. Advances in Neural Information Processing Systems 34. [Google Scholar]
  32. Huang GB, Zhu QY and Siew CK (2006). Extreme learning machine: Theory and applications. Neurocomputing 70 489–501. [Google Scholar]
  33. Jiang S, Song Z, Weinstein O and Zhang H (2020). Faster Dynamic Matrix Inverse for Faster LPs. arXiv preprint arXiv:2004.07470v1. [Google Scholar]
  34. Jiao J, Venkat K, Han Y and Weissman T (2015). Minimax Estimation of Functionals of Discrete Distributions. IEEE Transactions on Information Theory 61 2835–2885. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Kempthorne PJ (1987). Numerical Specification of Discrete Least Favorable Prior Distributions. SIAM Journal on Scientific and Statistical Computing 8 171–184. [Google Scholar]
  36. Kidger P and Lyons T (2020). Universal Approximation with Deep Narrow Networks Technical Report.
  37. Kosorok MR (2008). Introduction to Empirical Processes and Semiparametric Inference. Springer Series in Statistics 77. Springer; New York. [Google Scholar]
  38. Lehmann EL and Casella G (1998). Theory of Point Estimation. Springer. [Google Scholar]
  39. Lin T, Jin C and Jordan MI (2020). On gradient descent ascent for nonconvex-concave minimax problems. 37th International Conference on Machine Learning, ICML 2020 PartF168147-8 6039–6049. [Google Scholar]
  40. Luedtke A, Chung I and Sofrygin O (2020). Adversarial Monte Carlo Meta-Learning of Optimal Prediction Procedures. arXiv preprint arXiv:2002.11275v1. [PMC free article] [PubMed] [Google Scholar]
  41. Luedtke A, Carone M, Simon N and Sofrygin O (2020). Learning to learn from data: Using deep adversarial learning to construct optimal statistical procedures. Science Advances 6 eaaw2140. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Maron H, Fetaya E, Segol N and Lipman Y (2019). On the Universality of Invariant Networks.
  43. Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH and Teller E (1953). Equation of state calculations by fast computing machines. The Journal of Chemical Physics 21 1087–1092. [Google Scholar]
  44. MilleR RI and Wiegert RG (1989). Documenting completeness, species-area relations, and the species-abundance distribution of a regional flora. Ecology 70 16–22. [Google Scholar]
  45. Nelson W (1966). Minimax Solution of Statistical Decision Problems by Iteration. The Annals of Mathematical Statistics 37 1643–1657. [Google Scholar]
  46. Neyshabur B, Bhojanapalli S, McAllester D and Srebro N (2017). Exploring generalization in deep learning. Advances in Neural Information Processing Systems 2017-December 5948–5957. [Google Scholar]
  47. Noubiap RF and Seidel W (2001). An algorithm for calculating Γ-minimax decision rules under generalized moment conditions. Annals of Statistics 29 1094–1116. [Google Scholar]
  48. Olman V and Shmundak A (1985). Minimax Bayes estimation of mean of normal law for the class of unimodal a priori distributions. Proc. Acad. Sci. Estonian Physics Math 34 148–153. [Google Scholar]
  49. Orlitsky A, Suresh AT and Wu Y (2016). Optimal prediction of the number of unseen species. Proceedings of the National Academy of Sciences of the United States of America 113 13283–13288. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Kopf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J and Chintala S (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32 (Wallach H, Larochelle H, Beygelzimer A, d' Alché-Buc F, Fox E and Garnett R, eds.) 8024–8035. Curran Associates, Inc. [Google Scholar]
  51. Pfanzagl J (1990). Estimation in semiparametric models. 17–22. Springer, New York, NY. [Google Scholar]
  52. Pinelis I (2016). On the extreme points of moments sets. Mathematical Methods of Operations Research 83 325–349. [Google Scholar]
  53. QIU H (2022). QIU-Hongxiang-David/Gamma-minimax-learning: Simulation code for "Constructing Gamma-minimax estimators to leverage vague prior information". https://github.com/QIU-Hongxiang-David/Gamma-minimax-learning/. [Online; accessed 2022-03-14].
  54. RobBins H (1951). Asymptotically Subminimax Solutions of Compound Statistical Decision Problems. In Proceedings of the second Berkeley symposium on mathematical statistics and probability 131–149. [Google Scholar]
  55. Robinson J (1951). An Iterative Method of Solving a Game. The Annals of Mathematics 54 296. [Google Scholar]
  56. Sarma A and Kay M (2020). Prior Setting in Practice: Strategies and Rationales Used in Choosing Prior Distributions for Bayesian Analysis. In Conference on Human Factors in Computing Systems - Proceedings. Association for Computing Machinery. [Google Scholar]
  57. Schafer CM and Stark PB (2009). Constructing confidence regions of optimal expected size. Journal of the American Statistical Association 104 1080–1089. [Google Scholar]
  58. Shannon CE (1948). A Mathematical Theory of Communication. Bell System Technical Journal 27 379–423. [Google Scholar]
  59. Shen TJ, Chao A and Lin CF (2003). Predicting the number of new species in further taxonomic sampling. Ecology 84 798–804. [Google Scholar]
  60. Sion M (1958). On general minimax theorems. Pacific Journal of Mathematics 8 171–176. [Google Scholar]
  61. Spielman DA and Teng SH (2004). Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. Journal of the ACM 51 385–463. [Google Scholar]
  62. Torrey L and Shavlik J (2009). Transfer learning. Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques 11 242–264. [Google Scholar]
  63. v. Neumann J (1928). Zur Theorie der Gesellschaftsspiele. Mathematische Annalen 100 295–320. [Google Scholar]
  64. van der VAart A and Wellner J (2000). Weak Convergence and Empirical Processes: With Applications to Statistics. Springer Series in Statistics. Springer. [Google Scholar]
  65. VidAkovic B (2000). Г-Minimax: A Paradigm for Conservative Robust Bayesians. 241–259. Springer, New York, NY. [Google Scholar]
  66. WALD A (1945). Statistical Decision Functions Which Minimize the Maximum Risk. The Annals of Mathematics 46 265. [Google Scholar]
  67. Zaheer M, Kottur S, Ravanbhakhsh S, Póczos B, SalakhutdiNov R and Smola AJ (2017). Deep sets. Advances in Neural Information Processing Systems 2017-Decem 3392–3402. [Google Scholar]
  68. Zhang Y, LeE JD and Jordan MI (2016). L1-regularized neural networks are improperly learnable in polynomial time. 33rd International Conference on Machine Learning, ICML 2016 3 1555–1563. [Google Scholar]

RESOURCES