Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Mar 12.
Published in final edited form as: J Mach Learn Res. 2021;22:255.

Adversarial Monte Carlo Meta-Learning of Optimal Prediction Procedures

Alex Luedtke 1, Incheoul Chung 1, Oleg Sofrygin 2
PMCID: PMC10928557  NIHMSID: NIHMS1923282  PMID: 38476310

Abstract

We frame the meta-learning of prediction procedures as a search for an optimal strategy in a two-player game. In this game, Nature selects a prior over distributions that generate labeled data consisting of features and an associated outcome, and the Predictor observes data sampled from a distribution drawn from this prior. The Predictor’s objective is to learn a function that maps from a new feature to an estimate of the associated outcome. We establish that, under reasonable conditions, the Predictor has an optimal strategy that is equivariant to shifts and rescalings of the outcome and is invariant to permutations of the observations and to shifts, rescalings, and permutations of the features. We introduce a neural network architecture that satisfies these properties. The proposed strategy performs favorably compared to standard practice in both parametric and nonparametric experiments.

1. Introduction

1.1. Problem Formulation

Consider a dataset consisting of n2 observations X1,Y1,,Xn,Yn drawn independently from a distribution P belonging to some known model 𝒫, where each Xi is a continuously distributed feature with support contained in 𝒳Rp and each Yi is an outcome with support contained in 𝒴R. This dataset can be written as D(X,Y), where X is the n×p matrix for which row i contains Xi and Y is the n-dimensional vector for which entry i contains Yi. The support of D is contained in 𝒟𝒳n×𝒴n. The objective is to develop an estimator of the regression function μP that maps from x0 to EPYX=x0. An estimator T belongs to the collection 𝒯 of operators that take as input a dataset d(x,y) and output a prediction function T(d):𝒳R, where here and throughout we use d=(x,y) to denote a possible realization of the random variable D=(X,Y). Examples of estimators include the generalized linear models (Nelder and Wedderburn, 1972), random forests (Breiman, 2001), and gradient boosting machines (Friedman, 2001). We will also refer to estimators as prediction procedures. We focus on the case that the performance of an estimator is quantified via the standardized mean-squared error (MSE), namely

R(T,P)EPT(D)x0μPx02σP2dPXx0, (1)

where the expectation above is over the draw of D under sampling from P,PX denotes the marginal distribution of X implied by P, and σP2 denotes the variance of the error ϵPYμP(X) when (X,Y)~P. Note that ϵP may be heteroscedastic. Throughout we assume that, for all P𝒫,EPY2< and ϵP is a continuous random variable. Note that the continuity of ϵP implies that Y is continuous and that σP2>0.

In practice, the distribution P is not known, and therefore the risk R(T,P) of a given estimator T is also not known. We now describe three existing criteria for judging the performance of T that do not rely on knowledge of P. The first criterion is the maximal risk supP𝒫R(T,P). If T minimizes the maximal risk over 𝒯, then T is referred to as a minimax estimator (Wald, 1945). Minimax estimators optimize for the worst-case scenario wherein the distribution P is chosen adversarially in such a way that the selected estimator performs as poorly as possible. The second criterion is Bayesian in nature, namely the average of the risk R(T,P) over draws of P from a given prior Π on 𝒫. Specifically, this Bayes risk is defined as r(T,Π)EΠ[R(T,P)] (Robert, 2007). A Π-Bayes estimator optimally incorporates the prior beliefs encoded in Π with respect to the Bayes risk r(,Π) — more concretely, an estimator T is referred to as a Π-Bayes estimator if it minimizes the Bayes risk over 𝒯. Though the optimality property of Bayes estimators is useful in settings where Π only encodes substantive prior knowledge, its utility is less clear otherwise. Indeed, as the function r(,Π) generally depends on the choice of Π, it is possible that a Π-Bayes estimator T is meaningfully suboptimal with respect to some other prior Π, that is, that r(T,Π)infTrT,Π. This phenomenon can be especially common when the sample size is small or the model is nonparametric. In fact, in the latter case, Bayes estimators against particular priors Π can easily be inconsistent even though consistent frequentist estimators are available (Ghosal and Van der Vaart, 2017) — for such priors, Bayes estimators perform poorly even when the sample size is large. Therefore, in settings where there is no substantive reason to favor a particular choice of Π, it is sensible to seek another approach for judging the performance of T. A natural criterion is the worst-case Bayes risk of T over some user-specified collection Γ of priors, namely supΠΓr(T,Π). This criterion is referred to as the Γ-maximal Bayes risk of T. The collection Γ may be restricted to contain all priors that are compatible with available prior information, such as knowledge about the smoothness of a regression function, while being left large enough to acknowledge that prior knowledge may be too vague to encode within a single prior distribution (see Section 3.6 of Robert, 2007, for more possible forms of vague prior information). If T is a minimizer of the Γ-maximal Bayes risk, then T is referred to as a Γ-minimax estimator (Berger, 1985). Such estimators can be viewed as the optimal strategy in a sequential two-player game between a Predictor and Nature, where the Predictor selects an estimator and Nature then selects a prior in Γ at which the Predictor’s chosen estimator performs as poorly as possible in terms of Bayes risk. Notably, in settings where Γ contains all distributions with support in 𝒫, the Γ-maximal Bayes risk is equivalent to the maximal risk. Consequently, in this special case, an estimator is Γ-minimax if and only if it is minimax. In settings where Γ={Π}, an estimator is Γ-minimax if and only if it is Π-Bayes. Therefore, by allowing for a choice of Γ as large as the unrestricted set of all possible distributions or as small as a singleton set, Γ-minimaxity provides a means of interpolating between the minimax and Bayesian criteria.

Though Γ-minimax estimators represent an appealing compromise between the Bayesian and minimax paradigms, they have seen limited use in practice because they are rarely available in closed form. In this work, we aim to overcome this challenge in the context of prediction by providing an iterative strategy for learning Γ-minimax prediction procedures. Due to the potentially high computational cost of this iterative scheme, a key focus of our work involves identifying conditions under which we can identify a small subclass of 𝒯 that still contains a T-minimax estimator. This then makes it possible to optimize over this subclass, which we show in our experiments can dramatically improve the performance of our iterative scheme given a fixed computational budget.

Hereafter we refer to Γ-minimax estimators as ‘optimal’, where it is to be understood that this notion of optimality relies on the choice of Γ.

1.2. Overview of Our Strategy and Our Contributions

Our strategy builds on two key results, each of which will be established later in this work. First, under conditions on 𝒯 and Γ, there exists a Γ-minimax estimator in the subclass 𝒯e𝒯 of estimators that are equivariant to shifts and rescalings of the outcome and are invariant to permutations of the observations and to shifts, rescalings, and permutations of the features. Second, under further conditions, there is an equilibrium point T,Π𝒯e×Γ such that

supΠΓrT,Π=rT,Π=infT𝒯erT,Π. (2)

Upper bounding the right-hand side by supΠΓinfT𝒯er(T,Π) and applying the max-min in-equality shows that T is Γ-minimax. To find an equilibrium numerically, we propose to use adversarial Monte Carlo meta-learning (AMC) (Luedtke et al., 2020) to iteratively update an estimator in 𝒯e and a prior in Γ. AMC is a form of stochastic gradient descent ascent (e.g., Lin et al., 2019) that can be used to learn optimal statistical procedures in general decision problems.

We make the following contributions:

  • In Section 2, we characterize several equivariance properties of optimal estimators for a wide range of (𝒯,Γ).

  • In Section 3, we present a general framework for adversarially learning optimal prediction procedures.

  • In Section 4, we present a novel neural network architecture for parameterizing estimators that satisfy the equivariance properties established in Section 2.

  • In Section 5, we apply our algorithm in two settings and learn estimators that outperform standard approaches in numerical experiments. In Section 6, we also evaluate the performance of these learned estimators in data experiments.

All proofs for the results in the above sections can be found in Section 7. Section 8 describes possible extensions and provides concluding remarks.

To maximize the accessibility of our main theoretical results, we do not use group theoretic notation when presenting them in Sections 2 through 4. However, when proving these results, we will heavily rely on tools from group theory; consequently, we adopt this notation in Section 7.

1.3. Related Works

The approach proposed in this work is a form of meta-learning (Schmidhuber, 1987; Thrun and Pratt, 1998; Vilalta and Drissi, 2002), where here each task is a regression problem. Most existing works in this area pursue a task-distribution strategy to meta-learning (Hospedales et al., 2020), where the objective is to minimize the average loss (risk) across draws of tasks from some specified distribution. As we will now show, the objective function employed in such strategies in fact corresponds to a Bayes risk. In regression problems, each task is a tuple containing a dataset d and a task-dependent loss :𝒟×𝒯R. For a given prior Π, a draw from the task distribution can be obtained by first sampling P~Π, next sampling a dataset D of independent observations from P, drawing an evaluation point X0~PX, and finally defining the loss by (d,T)=T(d)X0μPX02/σP2 or some related loss, such as a squared error loss that does not standardize by σP2. The objective function is then equal to TE[(D,T)], where the expectation is over the draw of (D,) from the task distribution. This objective function is exactly equal to the Bayes risk function Tr(T,Π). Hence, existing meta-learning approaches for regression problems whose objective functions take this form can be viewed as optimizing a Bayes risk.

We now review existing meta-learning strategies, starting with those that parameterize 𝒯 as a neural network class. Hochreiter et al. (2001) advocated parameterizing 𝒯 as a collection of long short-term (LSTM) networks (Hochreiter and Schmidhuber, 1997). More recent works have advocated using memory-augmented neural networks (Santoro et al., 2016) or conditional neural processes (CNPs) (Garnelo et al., 2018) rather than LSTMs in meta-learning tasks. There have also been other works on the meta-learning of supervised learning procedures that are parameterized as neural networks (Bosc, 2016; Vinyals et al., 2016; Ravi and Larochelle, 2017). Compared to these works, we adversarially learn a prior Π from a collection Γ of priors, and we also formally characterize equivariance properties that will be satisfied by any optimal prediction procedure in a wide variety of problems. This characterization leads us to develop a neural network architecture designed for the prediction settings that we consider.

Model-agnostic meta-learning (MAML) is another popular meta-learning approach (Finn et al., 2017). In our setting, MAML aims to initialize the weights of a regression function estimate (parameterized as a neural network, for example) in such a way that, on any new task, only a limited number of gradient updates are needed. More recent approaches leverage the fact that, in certain settings, the initial estimate can instead be updated using a convex optimization algorithm (Bertinetto et al., 2018; Lee et al., 2019). To run any of these approaches, a prespecified prior over tasks is required. In our setting, these tasks take the form of data-generating distributions P. In contrast, our approach adversarially selects a prior from Γ.

Two recent works (Yin et al., 2018; Goldblum et al., 2019) developed meta-learning procedures that are trained under a different adversarial regime than that studied in the current work, namely under adversarial manipulation of one or both of the dataset d and evaluation point x0 (Dalvi et al., 2004). This adversarial framework appears to be most useful when there truly is a malicious agent that aims to contaminate the data, which is not the case that we consider. In contrast, in our setting, the adversarial nature of our framework allows us to ensure that our procedure will perform well regardless of the true value of P, while also taking into account prior knowledge that we may have.

Our approach is also related to existing works in the statistics and econometrics literatures on the numerical learning of minimax and Γ-minimax statistical decision rules. In finite-dimensional models, early works showed that it is possible to numerically learn minimax rules (Nelson, 1966; Kempthorne, 1987) and, in settings where Γ consists of all priors that satisfy a finite number of generalized moment conditions, Γ-minimax rules (Noubiap and Seidel, 2001). Other works have studied the Γ-minimax case where Γ consists of priors that only place mass on a pre-specified finite set of distributions in 𝒫, both for general decision problems (Chamberlain, 2000) and for constructing confidence intervals (Schafer and Stark, 2009). Defining Γ in this fashion modifies the statistical model 𝒫 to only consist of finitely many distributions, which can be restrictive. A recent work introduced a new approach, termed AMC, for learning minimax procedures for general models 𝒫 (Luedtke et al., 2020). In contrast to earlier works, AMC does not require the explicit computation of a Bayes estimator under any given prior, thereby improving the feasibility of this approach in moderate-to-high dimensional models. In their experiments, Luedtke et al. (2020) used neural network classes to define the sets of allowable statistical procedures. Unlike the current work, none of the aforementioned studies identified or leveraged the equivariance properties that characterize optimal procedures. As we will see in our experiments, leveraging these properties can dramatically improve performance.

1.4. Notation

We now introduce the notation and conventions that we use. For a function f:𝒫𝒫, we let Πf1 denote the pushforward measure that is defined as the distribution of f(P) when P~Π. For any dataset d=(x,y) and mapping f with domain 𝒟, we let f(x,y)f(d). We take all vectors to be column vectors when they are involved in matrix operations. We write to mean the entrywise product and a2 to mean aa. For an m1×m2 matrix a, we let ai* denote the ith row, a*j denote the jth column, a1m1i=1m1ai*, and s(a)21m1i=1m1ai*a2. When we standardize a vector a as [aa]/s(a), we always use the convention that 0/0 = 0. We write [ab] to denote the column concatenation of two matrices. For an m1×m2×m3 array a, we let ai** denote the m2×m3 matrix with entry (j,k) equal to aijk,ai*k denote the m2-dimensional vector with entry j equal to aijk, etc. For aR and bRk, we write a+b to mean a1k+b.

2. Characterization of Optimal Procedures

2.1. Optimality of Equivariant Estimators

We start by presenting conditions that we impose on the collection of priors Γ. Let 𝒜 denote the collection of all n×n permutation matrices, and let denote the collection of all p×p permutation matrices. We suppose that Γ is preserved under the following transformations:

  • P1.

    Permutations of features: ΠΓ and B implies that Πf11Γ, where f1(P) is the distribution of (BX,Y) when (X,Y)~P.

  • P2.

    Shifts and rescalings of features: ΠΓ,aRp, and bR+p implies that Πf21Γ, where f2(P) is the distribution of (a+bX,Y) when (X,Y)~P.

  • P3.

    Shift and rescaling of outcome: ΠΓ and a˜R and b˜>0 implies that Πf31Γ, where f3(P) is the distribution of (X,a˜+b˜Y) when (X,Y)~P.

The above conditions implicitly encode that f1(P),f2(P), and f3(P) all belong to 𝒫 whenever P𝒫. Section 7.1 provides an alternative characterization of P1, P2, and P3 in terms of the preservation of Γ under a certain group action.

Condition P1 ensures that permuting the features during preprocessing will not impact the collection of priors considered. This condition is reasonable in settings where there is only a limited prior understanding of each individual feature under consideration or, if such information is available, there is little anticipated benefit from including it in the analysis. Most commonly supervised machine learning algorithms similarly do not incorporate specific prior information about individual features, and are instead designed to work across a variety of settings — this is the case, for example, for commonly used implementations of random forests, extreme gradient boosting, and penalized linear models (Pedregosa et al., 2011; Chen and Guestrin, 2016). It is worth noting, however, that P1 still allows information on the features to be incorporated should it be available — for example, prior beliefs on the multivariate feature distribution, such as the number of modes that it has, or the regression function, such as its level of sparsity, can be imposed in the collection Γ of prior distributions. Conditions P2 and P3 are imposed to ensure that the Γ-maximal risk criterion captures the possibility that the data may be preprocessed via affine transformations, such as prestandardization or a change of the unit of measure (Fahrenheit to Celsius, say), before being supplied to the prediction algorithm. By having Γ be large enough to ensure that P2 and P3 are satisfied, the Γ-minimax risk reflects performance in an adversarial setting wherein affine transformations are applied to the features and outcome in such a way as to make the (Bayes) risk as large as possible for a given prediction algorithm. Because it minimizes this adversarial criterion, a Γ-minimax estimator should be robust to such adversarial transformations, thereby ensuring satisfactory performance regardless of the chosen unit of measure or prestandardization scheme.

We also assume that the signal-to-noise ratio (SNR) is finite — this condition is important in light of the fact that the MSE risk that we consider standardizes by σP2.

  • P4.

    Finite SNR : supP𝒫varPμP(X)/σP2<.

We now present conditions that we impose on the class of estimators 𝒯. In what follows we let 𝒟0d,x0𝒟×𝒳:s(y)0,s(x)0p. For d,x0𝒟0, we let

zd,x0xxs(x),yys(y),x0xs(x),xs(x),ys(y),logs(x),logs(y),

where logs(x) is the vector where log is applied entrywise and where we abuse notation and let xxs(x) represent the n×p matrix for which row i is equal to xix/s(x), and similarly for x/s(x). We let Ƶzd,x0:d,x0𝒟0. When it will not cause confusion, we will write zzd,x0. Fix T𝒯. Let ST:𝒵R denote the unique function that satisfies

T(d)x0=y+s(y)ST(z)foralld,x0𝒟0. (3)

The uniqueness arises because s(y)0 on 𝒟0. Because we have assumed that X and Y are continuous random variables under sampling from any P𝒫, it follows that, for all P𝒫, the class 𝒮ST:T𝒯 uniquely characterizes the functions in 𝒯 up to their behavior on subsets of 𝒟×𝒳 of P-probability zero. In what follows, we will impose smoothness constraints on 𝒮, which in turn imposes constraints on 𝒯. The first three conditions suffice to show that 𝒮 is compact in the space C(Ƶ,R) of continuous ƵR functions equipped with the compact-open topology.

  • T1.

    𝒮 is pointwise bounded: For all zƵ,supS𝒮|S(z)|<.

  • T2.

    𝒮 is locally Hölder: For all compact sets 𝒦Ƶ, there exists an α(0,1) such that

supS𝒮,zz𝒦S(z)Szzz2α<,

where 2 denotes the Euclidean norm. We take the supremum to be zero if 𝒦 is a singleton or is empty.

  • T3.

    𝒮 is sequentially closed in the topology of compact convergence: If Sjj=1 is a sequence in 𝒮 and SjS compactly in the sense that, for all compact 𝒦Ƶ,supz𝒦Sj(z)S(z)0, then S𝒮.

The following conditions ensure that 𝒮 is invariant to certain preprocessings of the data, in the sense that, for any function S𝒮, the function that first preprocesses the data in an appropriate fashion and then applies S to this data is itself in 𝒮. When formulating these conditions, we write zd,x0 to mean an element of Ƶ. Because z is a bijection between 𝒟0 and Ƶ, it is possible to recover d,x0 from zd,x0. Below we use this fact to abuse notation and define functions with domain Ƶ like zd,x0gd,x0 for functions g with domain 𝒟0, without explicitly introducing notation for the inverse of z.

  • T4.

    Permutations: For all S𝒮,A𝒜, and B,zd,x0Sz(AxB,Ay),Bx0 is in 𝒮.

  • T5.

    Shifts and rescalings: For all S𝒮,aRp,bR+p,a˜R, and b˜>0, the function zd,x0Szxa,b,a˜+b˜y,a+bx0 is in 𝒮, where xa,b is the n×p matrix with row i equal to a+bxi*.

In Appendix B, we provide two examples of classes 𝒮 that satisfy Conditions T1-T5. One of these classes is finite-dimensional and the other is infinite-dimensional. The infinite-dimensional class takes a particularly simple form. In particular, for some c,α>0 and F:ƵR+some function that is invariant to permutations, shifts, and rescalings, we consider the class 𝒮 to be the collection of all the collection of all S:ƵR such that |S(z)|F(z) and S(z)Szczz2α for all z,zƵ.

Let 𝒯e𝒯 denote the class of estimators that are equivariant to shifts and rescalings of the outcome and are invariant to permutations of the observations and to shifts, rescalings, and permutations of the features. Specifically, 𝒯e consists of functions in 𝒯 satisfying the following properties for all pairs d,x0 of datasets and features in 𝒟0, permutation matrices A𝒜 and B, shifts aRp and a˜R, and rescalings bR+p and b˜>0 :

TAxB,AyBx0=Tdx0, (4)
Txa,b,a˜+b˜ya+bx0=a˜+b˜T(d)x0, (5)

The following result shows that the Γ-maximal risk is the same over 𝒯 and 𝒯e𝒯.

Theorem 1. Under P1-P4 and T1-T5,

infT𝒯supΠΓr(T,Π)=infT𝒯esupΠΓr(T,Π).

The above does not rule out the possibility that there exists a non-equivariant Γ-minimax estimator, that is, a Γ-minimax estimator that belongs to 𝒯𝒯e. Rather, when paired with additional conditions that ensure that the infimum over 𝒯e above is achieved (see Theorem 3), the above implies that 𝒯e contains at least one Γ-minimax estimator.

Theorem 1 is a variant of the Hunt-Stein theorem (Hunt and Stein, 1946). Our proof, which draws inspiration from Le Cam (2012), consists in showing that our prediction problem is invariant to the action of an amenable group and subsequently applying Day’s fixed-point theorem (Day, 1961) to show that, for all T𝒯, the collection of T for which supΠΓrT,ΠsupΠΓr(T,Π) has nonempty intersection with 𝒯e.

This theorem has a natural analogy to the translation equivariance that is enjoyed by convolutional neural networks in object detection problems, where the goal is to classify and draw a bounding box around objects in an image (Russakovsky et al., 2015). To simplify the discussion, here we focus on the special case where there is only one object class of interest (e.g., humans), so that the goal is simply to draw a bounding box around each object that is contained in the image. In object detection settings, a key insight is that an objecťs class does not change even if its position is shifted. Given this insight, it seems reasonable to expect that any sufficiently rich collection of candidate detectors will be such that, given any object detector V, the collection will contain a translation equivariant detector with equal or superior performance to that of V. For this to be true, certain requirements are also generally needed of the loss function used to measure performance. In particular, the error accrued by incorrectly bounding or failing to bound an object should not depend on the position of that object in the image — this condition is satisfied by many loss functions that are commonly used in this setting. In our setting, conditions P1-P3, which say that a prior still belongs to Γ even after certain transformations are applied to the distributions drawn from that prior, are the analogues of the translation invariance property of an objecťs class (“a human remains a human if they are shifted to the left, and the pushforward of a prior in Γ remains in Γ even if features and outcomes are permuted, shifted, or rescaled”); conditions T4 and T5 are the analogues of the requirement that the collection of detectors be sufficiently rich; and the fact that the standardized squared error T(d)x0μPx02/σP2 does not depend on the particular ordering of the features or the centering or scaling of the features or outcomes is analogous to the translation invariance of the loss functions used in object detection.

2.2. Focusing Only on Distributions with Standardized Predictors and Outcome

Theorem 1 suggests restricting attention to estimators in 𝒯e when trying to learn a Γ-minimax estimator. We now show that, once this restriction has been made, it also suffices to restrict attention to a smaller collection of priors Γ1 when identifying a least favorable prior. In fact, we show something slightly stronger, namely that the restriction to Γ1 can be made even if optimal estimators are sought over the richer class 𝒯~e𝒯e of estimators that satisfy the equivariance property (5) but do not necessarily satisfy (4).

We now define Γ1. Let h(P) denote the distribution of

XjEPXjvarPXj1/2j=1p,YEP[Y]σP

when (X,Y)~P. Note that here, and here only, we have written Xj to denote the jth feature rather than the jth observation. Also let Γ1Πh1:ΠΓ, which is a collection of priors on 𝒫1{h(P):P𝒫}.

Theorem 2. If P2 and P3 hold and all T𝒯 satisfy (5), then T is Γ-minimax if and only if it is Γ1-minimax.

We conclude by noting that, under P2 and P3,𝒫1 consists precisely of those P𝒫 that satisfy:

EP[X]=0p,EPX2=1p,EP[Y]=0,σP2=1. (6)

2.3. Existence of an Equilibrium Point

We also make the following additional assumption on 𝒮.

  • T6.

    𝒮 is convex: S1,S2𝒮 and δ(0,1) implies that zδS1(z)+(1δ)S2(z) is in 𝒮.

The two examples in Appendix B also satisfy T6.

We also impose the following condition on the size of the collection of distributions 𝒫1 and the collection of priors Γ1, which in turn imposes restrictions on 𝒫 and Γ.

  • P5.

    There exists a metric ρ on 𝒫1 such that (i) 𝒫1,ρ is a complete separable metric space, (ii) Γ1 is tight in the sense that, for all ε>0, there exists a compact set 𝒦 in 𝒫1,ρ such that Π(𝒦)1ε for all ΠΓ1, and (iii) for all T𝒯e,PR(T,P) is upper semi-continuous and bounded from above on 𝒫1,ρ.

In Appendix C, we give examples of parametric and nonparametric settings where P5 is applicable.

So far, the only conditions that we have required on the σ-algebra 𝒜 of 𝒫 are that h and R(T,),T𝒯, are measurable. In this subsection, and in this subsection only, we add the assumptions that P5 holds and that 𝒜 is such that A𝒫1:A𝒜 equals 1, where 1 is the collection of Borel sets on 𝒫1,ρ.

We will also assume the following two conditions on Γ1.

  • P6.

    Γ1 is closed in the topology of weak convergence: if Πjj=1 is a sequence in Γ1 that converges weakly to Π, then ΠΓ1.

  • P7.

    Γ1 is convex: for all Π1,Π2Γ and α(0,1), the mixture distribution αΠ1+(1α)Π2 is in Γ.

Under Conditions P5 and P6, Prokhorov’s theorem (Billingsley, 1999) can be used to establish that Γ1 is compact in the topology of weak convergence. This compactness will be useful for proving the following result, which shows that there is an equilibrium point under our conditions.

Theorem 3. If T1-T3, T6, and P2-P7 hold, then there exists T𝒯e and ΠΓ1 such that, for all T𝒯e and ΠΓ1, it is true that rT,ΠrT,ΠrT,Π.

Combining the above with Lemma 10 in Section 7.2.3 establishes (2), that is, that the conclusion of Theorem 3 remains valid if Π varies over Γ rather than over Γ1.

3. AMC Meta-Learning Algorithm

We now present an AMC meta-learning strategy for obtaining a Γ-minimax estimator within some class 𝒯. Here we suppose that 𝒯=Tt:tτ, where each Tt is an estimator indexed by a finite-dimensional parameter t that belongs to some set τ. We note that this framework encapsulates: model-based approaches (e.g., Hochreiter et al., 2001), where Tt can be evaluated by a single pass of d,x0 through a neural network with weights t; optimization-based approaches, where t are the initial weights of some estimate that are subsequently optimized based on d (e.g., Finn et al., 2017); and metric-based approaches, where t indexes a measure of similarity αt that is used to obtain an estimate of the form i=1nαtxi,x0yi (e.g., Vinyals et al., 2016).

We suppose that all estimators in 𝒯 satisfy the equivariance property (5), which can be arranged by prestandardizing the outcome and features and then poststandardizing the final prediction — see Algorithm 2 for an example. Since all T𝒯 satisfy (5), Theorem 2 shows that it suffices to consider a collection Γ1 of priors with support on 𝒫1, that is, so that, for all ΠΓ1,P~Π satisfies (6) almost surely. To ensure that the priors are easy to sample from, we parameterize them via generator functions Gg (Goodfellow et al., 2014) that are indexed by a finite-dimensional g that belongs to some set γ. Each Gg takes as input a source of noise U drawn from a user-specified distribution νu and outputs the parameters indexing a distribution in 𝒫 (Luedtke et al., 2020). Though this form of sampling limits to parametric families 𝒫, the number of parameters indexing this family may be much larger than the sample size n, which can, for all practical purposes, lead to a nonparametric estimation problem. For each g, we let Πg denote the distribution of Gg(U) when U~νu. We then let Γ1=Πg:gγ. It is worth noting that classes Γ1 that are defined in this way will not generally satisfy the conditions P5-P7 used in Theorem 3. To iteratively improve the performance of the prior, we require the ability to differentiate realized datasets through the parameters indexing the prior. To do this, we assume that, for each P𝒫, the user has access to a generator function HP:𝒱R such that HP(V) has the same distribution as (X,Y)~P when noise V is drawn from a user-specified distribution νv. We suppose that, for all realizations of the noise u in the support of νu and v in the support of νv, the function gHGg(u)(v) is differentiable at each parameter value g indexing the prior.

graphic file with name nihms-1923282-f0001.jpg

The AMC learning strategy is presented in Algorithm 1. The algorithm takes stochastic gradient steps on the parameters indexing an estimator and prior generator to iteratively reduce and increase the Bayes risk, respectively. All gradients in the algorithm can be computed via backpropagation using standard software — in our experiments, we used Pytorch for this purpose (Paszke et al., 2019). Note that, when computing g Loss, the dependence of Loss on g is tracked through the dependence of P on g on line 5, the dependence of X0 and D=Xi,Yii=1n on P on lines 6 and 7, and the the dependence of Loss on P,X0, and D on line 8. We caution that, when the outcome or some of the features are discrete, g Loss will not generally represent an unbiased estimate of the gradient of grTt,Πg, which can cause Algorithm 1 to perform poorly. To handle these cases, the algorithm can be modified to instead obtain an unbiased gradient estimate using the likelihood ratio method (Glynn, 1987).

Though studying the convergence properties of the minimax optimization in Algorithm 1 is not the main focus of this work, we now provide an overview of how results from Lin et al. (2019) can be used to provide some guarantees for this algorithm. When doing so, we focus on the special case where there exists some < such that, for all g,trTt,Gg is differentiable with -Lipschitz gradient and, for some finite (but potentially large) collection 𝒫DP1,,PD𝒫,Γ is the collection of all mixtures of distributions in 𝒫D. We also suppose that the parameter g indexing the generator Gg takes values on the D1 simplex and that this generator is parameterized in such a way that νuGg1 has the same distribution as the mixture of distributions in 𝒫D that places mass gj on distribution Pj,j=1,,D. In this case, provided the learning rates η1 and η2 are chosen appropriately, Theorem 4.5 in Lin et al. (2019) gives guarantees on the number of iterations required to return an ϵ-stationary point TtK (idem, Definition 3.7) within a specified number of iterations — this stationary point is such that there exists a t near tK at which the function tsupΠΓrTt,Π has at least one small subgradient (idem, Lemma 3.8, for details). If, also, tTt(d) is convex for all d, then this also implies that TtK is nearly Γ-minimax. If, alternatively, the prior update step in Algorithm 1 (line 13) is replaced by an oracle optimizer such that, at each iteration, g is defined as a true maximizer of the Bayes risk grT,Πg, then Theorem E.4 of Lin et al. (2019) similarly guarantees that an ϵ-stationary point will be reached within a specified number of iterations.

Alternatives to Algorithm 1 are possible. As one example, the stochastic gradient descent ascent optimization scheme could be replaced by an extragadient method (Korpelevich, 1976), which has been shown to perform well in generative adversarial network settings (Gidel et al., 2018). As another example, the prior distribution could, in principle, be specified via its density rather than as the pushforward distribution νuGg1 defined by the generator. While this density-based parameterization may make it easier to relate the specified priors to commonly used probability distributions, it may also lead to challenges since sampling from a distribution specified by its density is generally a hard problem that necessitates the use of numerical approaches such as Markov chain Monte Carlo methods (Hastings, 1970; Geman and Geman, 1984). Because the prior is updated at each of the K iterations, it seems that many instances of these numerical sampling schemes would need to be run before theb termination of the AMC algorithm. Identifying a means to expedite the convergence of this density-based approach is an interesting area for future work.

4. Proposed Class of Estimators

4.1. Equivariant Estimator Architecture

Algorithm 2 presents our proposed estimator architecture, which relies on four modules. Each module k can be represented as a function mk belonging to a collection k of functions mapping from Rak to Rbk, where the values of ak and bk can be deduced from Algorithm 2. For given data d, a prediction at a feature x0 can be obtained by sequentially calling the modules and, between calls, either mean pooling across one of the dimensions of the output or concatenating the evaluation point as a new column in the output matrix.

We let 𝒯 represent the collection of all prediction procedures described by Algorithm 2, where here mkk=14 varies over k=14k. We now give conditions under which the proposed architecture yields an equivariant estimator.

  • M1)

    m1(AvB)**=Am1(v)**B for all m11,A𝒜,B,vRn×p×2, and 1,,o1.

  • M2)

    m2(Bv)=Bm2(v) for all m22,B, and vRp×o1.

  • M3)

    m3(Bv)=Bm3(v) for all m33,B, and vRp×o2.

4.

Theorem 4. If M1-M3, then all T𝒯 satisfy (4) and (5).

4.2. Neural Network Parameterization

In our experiments, we choose the four module classes k,k=1,2,3,4, indexing our estimator architecture to be collections of neural networks. For each k, we let k contain the neural networks consisting of hk hidden layers of widths wk1,wk2,,wkhk, where the types of layers used depends on the module k. When k=1, multi-input-output channel equivariant layers as defined in Hartford et al. (2018) are used. In particular, for j=1,,h1+1, we let 1j denote the collection of all such layers that map from Rn×p×w1j1 to Rn×p×w1j, where we let w10=2 and w1h1+1=o1. For each j, each member L1j of 1j is equivariant in the sense that, for all A𝒜,B, and vRn×p×w1j1,L1j(AvB)**=AL1j(v)**B for all =1,,o1. When k=2,3, multi-input-output channel equivariant layers as described in Eq. 22 of Zaheer et al. (2017) are used, except that we replace the sum-pool term in that equation with a mean-pool term (see the next subsection for the rationale). In particular, for j=1,,hk+1, we let kj denote the collection of all such equivariant layers that map from Rp×wkj1 to Rp×wkj. For each j, each member Lkj of kj is equivariant in the sense that, for all B and vRp×wkj1, Lkj(Bv)=BLkj(v). When k=4, standard linear layers mapping from Rw4j1 to Rw4j are used for each j=1,,h4+1, where w40=o3 and w4h4+1=1. For each j, we let 4j denote the collection of all such layers. For a user-specified activation function q, we then define the module classes as follows for k=1,2,3,4 :

kvqLkhk+1qLkhkqLk1(v):Lkjkj,j=1,2,,hk+1.

Notably, 1 satisfies M1 (Ravanbakhsh et al., 2017; Hartford et al., 2018), and 2 and 3 satisfy M2 and M3, respectively (Ravanbakhsh et al., 2016; Zaheer et al., 2017). Each element of 4 is a multilayer perceptron.

The proposed architecture bears some resemblance to CNPs (Garnelo et al., 2018). Like our proposed architecture, CNPs are invariant to permutations of the observations. Nevertheless, CNPs fail to satisfy the other properties imposed on 𝒯e, namely invariance to shifts, rescalings, and permutations of the features and equivariance to shifts and rescalings of the outcome. Moreover, a decision-theoretic rationale for making CNPs invariant to permutations of the observations has not yet been provided in the literature, for example, via a Hunt-Stein-type theorem.

4.3. Pros and Cons of Proposed Architecture

A benefit of using the proposed architecture in Algorithm 2 is that Modules 1 and 2 can be evaluated without knowing the feature x0 at which a prediction is desired. As a consequence, these modules can be precomputed before making predictions at new feature values, which can lead to substantial computational savings when the number of values at which predictions will be made is large. Another advantage of the proposed architecture is that it can be evaluated on a dataset that has a different sample size n than did the datasets used during meta-training. In the notation of Eq. 4 from Hartford et al., this corresponds to noting that the weights from an RN×M×kRN×M×o multi-input-output channel layer can be used to define an RN×M×kRN×M×o layer for which the output Yn,mo is given by the same symbolic expression as that displayed in Eq. 4 from that work, but now with n ranging over 1,,N. We will show in our upcoming experiments that procedures trained using 500 observations can perform well even when evaluated on datasets containing only 100 observations. It is similarly possible to evaluate the proposed architecture on datasets containing a different number of features than did the datasets used during meta-training — again see Eq. 4 in Hartford et al. (2018), and also see Eq. 22 in Zaheer et al. (2017), but with the sum-pool term replaced by a mean-pool term. The rationale for replacing the sum-pool term by a mean-pool term is that this will ensure that the scale of the hidden layers will remain fairly stable when the number of testing features differs somewhat from the number of training features.

A disadvantage of the proposed architecture is that it currently has no established universality guarantees. Such guarantees have been long available for standard multilayer perceptrons (e.g., Cybenko, 1989; Hornik, 1991), and have recently also become available for certain invariant architectures (Maron et al., 2019). In future work, it would be interesting to see if the arguments in Maron et al. (2019) can be modified to provide universality guarantees for our architecture. Establishing such results may also help us to overcome a second disadvantage of our architecture, namely that the resulting neural network classes will not generally satisfy the convexity condition T6 used in Theorem 3. If a network class 𝒯 that we have proposed can be shown to satisfy a universality result for some appropriate convex class 𝒯c, and if 𝒯 is itself a subset of 𝒯c, then perhaps it will be possible to invoke Theorem 3 to establish an equilibrium result over the class of estimators 𝒯c, and then to use this result to establish an (approximate) equilibrium result for 𝒯. To ensure that conditions T1-T3 are satisfied, such an argument will likely require that the weights of the networks in 𝒯 be restricted to belong to some compact set.

5. Numerical Experiments

5.1. Overview

In this section, we present the results from two sets of numerical experiments, with the first corresponding to benchmarks from the meta-learning literature and the second consisting of settings designed to evaluate the performance of our method relative to that of analytically-derived estimators that are commonly used in practice for which theoretical performance guarantees are available. In each example, the collection of estimators 𝒯 is parameterized as the network architecture introduced in Section 4.2 with o1=o2=50,o3=10,h1=h3=10,h2=h4=3, and, for k=1,2,3,4,wk=100. For each module, we use the leaky ReLU activation q(z)max{z,0}+0.01min{z,0}. At the end of this section, we report the results of an ablation study that evaluates the extent to which imposing invariance to permutations of the observations and features improves performance.

All experiments were run in Pytorch 1.0.1 on Tesla V100 GPUs using Amazon Web Services. The code used to conduct the experiments can be found at https://github.com/alexluedtke12/amc-meta-learning-of-optimal-prediction-procedures. Further experimental details can be found in Appendix D.

5.2. Meta-Learning Benchmarks

5.2.1. Preliminaries

We now evaluate the performance of AMC on widely used meta-learning benchmarks. As described in the Introduction, existing meta-learning algorithms tend to be Bayesian in nature, where the goal during meta-training is to learn an estimator with small Bayes risk under a specified prior Π. Consequently, when adjudicating performance in this study, we will primarily focus on the evaluation of each learned estimator T in terms of its Bayes MSE against this fixed prior Π, defined as EPT(D)x0μPx02dPXx0dΠ(P).

Because our method is designed to learn adversarially over a collection of priors Γ that satisfies the invariance properties P1, P2, and P3, we define the collection Γ used when training our method as the smallest collection of priors that satisfies these three properties and contains Π. It can be verified that Γ1 is a singleton in this case, so that the generator is a constant function and is never updated in these benchmark settings. Though this simplified meta-training may make it appear that AMC will not be robust to an adversarial choice of prior, it is worth noting that the learned estimator in fact is robust to such a choice in the sense that the Bayes risk of the learned estimator will be invariant under permutations of the features and also under shifts and rescalings of the outcomes and features. The main motivation for using a small Γ when comparing to these benchmarks is that doing so will help inform on the performance of the estimator architecture that we proposed in Section 4 even in Bayesian settings for which existing meta-learning approaches are tailor-made.

We compare the performance of AMC to that of two popular meta-learning methods for which code is readily available: MAML (Finn et al., 2017) and CNPs (Garnelo et al., 2018). Because these algorithms do not prestandardize the features and outcomes, they may have large standardized Bayes MSEs (the Bayes risk derived from Eq. 1) if these quantities are simply shifted or rescaled. To ensure that possible discrepancies in performance between AMC and MAML or CNPs are not solely due to prestandardization, we also compare our method to natural variants of MAML and CNPs that, like AMC, are robust to such shifts and rescalings. For each method, these variants prestandardize the features and outcomes, and then, in an analogous fashion to line 9 of Algorithm 2, scale the final output by the sample standard deviation of the original training outcomes and shift by their sample mean. These algorithms, which we refer to as MAML-Eq and CNP-Eq, are invariant to shifts and rescalings of the features and equivariant to shifts and rescalings of the outcomes. Details on the MAML and CNP implementations used can be found in Appendix D.1.

5.2.2. Sinusoidal Regression

We start with a benchmark few-shot regression setting from that is commonly used in the meta-learning literature. The prior Π is defined as follows. The feature is 1-dimensional and is Unif (−5, 5) distributed, and the regression function μP takes the form xasin(xb), where the parameters a and b are drawn independently from a Unif (0.1, 5.0) and Unif (0,π) distribution, respectively (Finn et al., 2017). Following related meta-learning benchmarks (Finn et al., 2018; Vuorio et al., 2018), the error ϵP added to the signal μP(X) is distributed as N0,0.32. We use the same sample sizes as were used in Finn et al. (2017), namely n=5,10, and 20.

We now report on the performance of the various meta-learning approaches in this setting. In Table 1a, we can see that MAML and CNPs consistently outperform their equivariant counterparts, namely MAML-Eq and CNP-Eq, in this setting. Nevertheless, as we noted earlier, MAML and CNPs are non-robust in that their standardized MSE can be made large by simply shifting or rescaling the outcomes or features. In Figure S5 in the appendix we provide evidence that this is indeed the case. As a particularly striking example, when n=5, scaling the feature down by a factor of 5 leads to 24-fold and 149-fold increases in the MSEs of MAML and CNPs, respectively. The degradation of performance worsens with sample size. Indeed, when n=20, the same rescaling leads to 144-fold and 487-fold increases in the MSEs of these two methods. Consequently, even seemingly innocuous preprocessings of the data, such as applying an affine transformation to change the unit of measurement, can have a dramatic impact on the performance of MAML and CNPs. In contrast, the standardized MSE performance of MAML-Eq and CNP-Eq is invariant to such preprocessings of the data.

Table 1:

Bayes MSEs of meta-learning approaches in the meta-learning benchmark experiments, where the Bayes MSE is defined as the squared difference between the predictions and true underlying regression function, averaged across draws of the data-generating distribution from the prior and the feature from the feature distribution. Standard errors all <0.005 in the sinusoid experiment and < 0.001 in the Gaussian process experiments.

(a) Sinusoid (b) Gaussian process
1d feature 5d feature
n=5 10 20 n=5 50 5 50
MAML* 0.22 0.10 0.03 MAML* 0.85 0.13 1.00 1.00
CNP* 0.05 0.02 0.01 CNP* 0.47 0.04 0.95 0.73
MAML-Eq 2.06 0.47 0.07 MAML-Eq 0.93 0.13 1.22 1.02
CNP-Eq 1.13 0.13 0.04 CNP-Eq 0.56 0.04 1.12 0.73
AMC (ours) 0.89 0.09 0.03 AMC (ours) 0.56 0.03 1.11 0.66
*

As these two algorithms do not prestandardize the features or outcomes, their standardized MSEs can be made large by simply shifting or rescaling the features and outcomes. See Figure S5 for more information.

Table 1a also displays results for AMC. AMC consistently outperforms the robust versions of existing algorithms, namely MAML-Eq and CNP-Eq. When compared with the non-robust variants, AMC is outperformed by MAML when n=5, outperforms MAML when n=10, and has about the same performance as MAML when n=20. CNPs perform better than MAML and AMC, though this difference begins to diminish as the sample size increases.

5.2.3. Gaussian Process Regression

We next consider a benchmark Gaussian process regression setting. We consider two cases for the prior. The first is the same as that considered in Garnelo et al. (2018), except that they considered the noise-free case where ϵP=0 almost surely, whereas we consider the noisy case where the errors ϵP are homoscedastic and distributed as N0,0.32. Considering a noisy case where ϵP is non-degenerate is necessary for the standardized MSE that we consider to be well-defined, and also better reflects real-world regression scenarios where observed outcomes are rarely, if ever, deterministic functions of the features considered. Following Garnelo et al. (2018), the feature is 1-dimensional and follows a Unif (−2, 2) distribution, and the regression function μP is drawn from a mean-zero Gaussian process with a squared exponential kernel with lengthscale 0.4 and variance 1. We also use the same sample sizes as were used in that work, namely n=5 and 50. The second case that we consider is the same as the first except that the feature X is 5-dimensional, where the entries of X are independent Unif(−2, 2) random variables, and the lengthscale is taken to be equal to 1.2.

Table 1b displays the performance of the various methods in this setting. Adversarial Monte Carlo noticeably outperforms MAML and MAML-Eq across all settings except the 5-dimensional, n=5 case, where MAML performs slightly better than does AMC. The ordering between AMC and the CNP-based methods varies by sample size. At the smaller sample size considered (n=5), AMC outperforms the robust CNP-based method, namely CNP-Eq, but is outperformed by the non-robust method, namely CNP. In the larger sample size considered (n=50), AMC outperforms both CNP and CNP-Eq. The fact that AMC outperforms CNP in this setting is notable given that CNPs are designed to mimic the desirable properties of Gaussian process regression procedures (Garnelo et al., 2018).

5.3. Comparing to (Regularized) Empirical Risk Minimizers

5.3.1. Preliminaries

We now compare the performance of our approach to that of existing estimators that are commonly used in practice for which theoretical performance guarantees are available. The examples differ in the definitions of the model 𝒫 and the collection Γ of priors on 𝒫. In each case, Γ satisfies the invariance properties P1, P2, and P3. By the equivariance of the estimators in 𝒯, Theorem 2 shows that it suffices to consider a collection of priors Γ1 with support on 𝒫1. Hence, it suffices to define the collection 𝒫1𝒫 of distributions P satisfying (6). By P2 and P3, we see that 𝒫=P𝒫1𝒫(P), where 𝒫(P) consists of the distributions of (a+bX,a˜+b˜Y) when (X,Y)~P; here, a,b,a˜, and b˜ vary over Rp,R+p,R, and R+, respectively. In each setting, the submodel 𝒫1 takes the form

𝒫1P:μP,PX𝒫X,ϵPX~PN(0,1)

and the p=10 dimensional feature X is known to be drawn from a distribution in the set 𝒫X of N0p,Σ distributions, where Σ varies over all positive-definite p×p covariance matrices with diagonal equal to 1p. The collections of regression functions differ in the examples and are detailed in the coming subsections. These collections are indexed by a sparsity parameter s that specifies the number of features that may contribute to the regression function μP. In each setting, we considered all four combinations of s{1,5} and n{100,500}, where n denotes the number of observations in the datasets d used to evaluate the performance of the final learned estimators. For each n, we evaluated the performance of AMC meta-trained with datasets of size nmt=100 observations (AMC100) and nmt=500 observations (AMC500).

5.3.2. Sparse Linear Regression

We next considered the setting where μP belongs to a sparse linear model and the feature is p=10 dimensional. In this setting,

xβx:β0s,β15, (7)

where a0#j:aj0 and a1j=1paj. The collection Γ is described in Appendix D.

For each sparsity level s{1,5}, we evaluated the performance of the prediction procedure trained at sparsity level s using two priors. Both priors sample the covariance matrix of the feature distribution PX from the Wishart prior ΠX described in Appendix D.2.1 and let β=(α,0) for a random α satisfying α15. They differ in how α is drawn. Both make use of a uniform draw Z from 1 ball aRs:a1=5. The first sets α=Z, whereas the second sets α=UZ for U~Unif(0,1) drawn independently of Z. We will refer to the two settings as ‘boundary’ and ‘interior’, respectively. We refer to the s=1 and s=5 cases as the ‘sparse’ and ‘dense’ settings, respectively. Further details can be found in Appendix D.2.2.

In this example, AMC leverages knowledge of the underlying sparse linear regression model by generating synthetic training data from distributions P for which EP[YX=] belongs to the class defined in Eq. 7 (see line 5 of Algorithm 1). Therefore, we aimed to compare AMC’s performance to that of estimators that also take advantage of this linearity. Ideally, we would compare AMC’s performance to that of the true Γ-minimax estimator. Unfortunately, as is the case in most problems, the form of this estimator is not known in this sparse linear regression setting. Therefore, we instead compared AMC’s performance to ordinary least squares (OLS) and lasso (Tibshirani, 1996) with tuning parameter selected by 10-fold cross-validation, as implemented in scikit-learn (Pedregosa et al., 2011).

Table 2a displays performance for the sparse setting. We see that AMC outperformed OLS and lasso for the boundary priors, and was outperformed for the interior priors. Surprisingly, AMC500 outperformed AMC100 for the interior prior when n=100 observations were used to evaluate performance. The fact that AMC100 was trained specifically for the n=100 case suggests that a suboptimal equilibrium may have been reached in this setting. Table 2b displays performance for the dense setting. Here AMC always performed at least as well as OLS and lasso when nmt=n, and performed comparably even when nmtn.

Table 2:

MSEs based on datasets of size n in the linear regression settings. Standard errors all < 0.001.

(a) Sparse signal
Boundary Interior
n=100 500 100 500
OLS 0.12 0.02 0.12 0.02
Lasso 0.06 0.01 0.06 0.01
AMC100 (ours) 0.02 <0.01 0.11 0.09
AMC500 (ours) 0.02 <0.01 0.07 0.04
(b) Dense signal
Boundary Interior
n=100 500 100 500
OLS 0.13 0.02 0.13 0.02
Lasso 0.11 0.02 0.09 0.02
AMC100 (ours) 0.10 0.04 0.08 0.02
AMC500 (ours) 0.09 0.02 0.09 0.02

5.3.3. Fused Lasso Additive Model

We next considered the setting where P belongs to a variant of the fused lasso additive model (FLAM) (Petersen et al., 2016) and the feature is p=10 dimensional. This model enforces that μP belong to a generalized additive model, that only a certain number of the components can be different from the zero function, and that the sum of the total variations of the remaining components is not too large. We recall that the total variation V(f) of f:RR is equal to the supremum of =1kfa+1fa over all a=1k+1 such that kN and a1<a2<<ak+1 (Cohn, 2013). Let v(μ)Vμjj=1p. Writing xj to denote feature j, the model we considered imposes that μP falls in

xj=1pμjxj:v(μ)1M,v(μ)0s.

We take M=10 in the experiments in this section. The collection Γ is described in Appendix D.

In this example, we preprocessed the features before supplying them to the estimator. In particular, we replaced each entry with its rank statistic among the n observations so that, for each i{1,,n} and j{1,,p}, we replaced xij by k=1nIxijxkj and x0j by k=1nIx0jxkj. This preprocessing step is natural given that the FLAM estimator (Petersen et al., 2016) also only depends on the features through their ranks. An advantage of making this restriction is that, by the homoscedasticity of the errors and the invariance of the rank statistics and total variation to strictly increasing transformations, the learned estimators should perform well even if the feature distributions do not belong to a Gaussian model, but instead belong to a much richer Gaussian copula model.

We evaluated the performance of the learned estimators using variants of simulation scenarios 1–4 from Petersen et al. (2016). The level of smoothness varies across the settings (see Fig. 2 in that work). In the variants we considered, the true regression function either contains s0=1 (‘sparse’) or s0=4 (‘dense’) nonzero components. In the sparse setting, we evaluated the performance of the estimators that were meta-trained at sparsity level s=1, and, in the dense setting, we evaluated the performance of the estimators that were meta-trained at s=5. Further details can be found in Appendix D.2.3.

Figure 2:

Figure 2:

Improvement of AMC estimators over existing estimators, in terms of differences of cross-validated MSEs of FLAM and AMC FLAM (x-axis) and Lasso and AMC Linear (y-axis). Positive values indicate that AMC outperformed the comparator. AMC performed similarly to or better than existing estimators in settings where the number of features in the dataset was the same as were used in meta-training. As expected, the performance was somewhat worse for datasets that had fewer features than were used during meta-training, though, surprisingly, it was still sometimes better than that of existing methods.

Similarly to as in the previous example, AMC leverages knowledge of the possible forms of the regression function that is imposed by — in this case, the model for the regression function is nonparametric but does impose that this function belongs to a particular sparse generalized additive model. Though there does not exist a competing estimator that is designed to optimize over , the FLAM estimator (Petersen et al., 2016) optimizes over the somewhat larger, non-sparse model where s=p. We, therefore, compared the performance of AMC to this estimator as a benchmark, with the understanding that AMC is slightly advantaged in that it has knowledge of the underlying sparsity pattern. Nevertheless, we view this experiment as an important proof-of-concept, as it is the first, to our knowledge, to evaluate whether it is feasible to adversarially meta-learn a prediction procedure within a nonparametric regression model.

To illustrate the kinds of functions that AMC can approximate, Fig. 1 displays examples of AMC500 fits from scenario 3 when (n,s)=(500,1). Table 3 provides a more comprehensive view of the performance of AMC and compares it to that of FLAM. Table 3a displays performance for the sparse setting. The AMC procedures meta-trained with nmt=n observations outperformed FLAM for all of these settings. Interestingly, AMC procedures meta-trained with nmtn also outperformed FLAM in a majority of these settings, suggesting that learned procedures can perform well even at different sample sizes from those at which they were meta-trained. In the dense setting (Table 3b), AMC500 outperformed both AMC100 and FLAM in all but one setting (scenario 4, n=100), and in this setting both AMC100 and AMC500 dramatically outperformed FLAM. The fact that AMC500 also sometimes outperformed AMC100 when n=100 in the linear regression setting suggests that there may be some benefit to training a procedure at a larger sample size than that at which it will be evaluated. We leave an investigation of the generality of this phenomenon to future work.

Figure 1:

Figure 1:

Examples of AMC500 fits (thin blue lines) based on n=500 observations drawn from distributions at sparsity level s=1 with four possible signal components (thick black lines). Predictions obtained at different signal feature values with all 9 other features set to zero.

Table 3:

MSEs based on datasets of size n in the FLAM settings. Standard errors for FLAM all < 0.04 and for AMC all < 0.01.

(a) Sparse signal
Scenario 1 Scenario 2 Scenario 3 Scenario 4
n=100 500 100 500 100 500 100 500
FLAM 0.44 0.12 0.47 0.17 0.38 0.11 0.51 0.19
AMC100 (ours) 0.34 0.20 0.18 0.08 0.27 0.14 0.17 0.08
AMC500 (ours) 0.48 0.12 0.19 0.06 0.35 0.10 0.23 0.08
(b) Dense signal
Scenario 1 Scenario 2 Scenario 3 Scenario 4
n=100 500 100 500 100 500 100 500
FLAM 0.59 0.17 0.65 0.24 0.53 0.16 0.76 0.36
AMC100 (ours) 1.20 0.91 0.47 0.39 0.87 0.57 0.30 0.30
AMC500 (ours) 0.58 0.15 0.37 0.08 0.46 0.12 0.36 0.09

5.4. Ablation Study to Evaluate the Performance of Permutation Invariance

We numerically evaluated the utility of imposing invariance in the architecture in Algorithm 2. To do this, we repeated the n=nmt=100 and n=nmt=500 FLAM settings, separately modifying the architecture to remove invariance to permutations of the observations and the features. In the case where the architecture was not invariant to permutations of the observations, we weakened M1 to the condition that m1(vB)**=m1(v)**B for all m11,B,vRn×p×2, and =1,,o1. We used the same architecture as was used in our earlier experiment, except that each layer in Module 1 was replaced by a multi-input-output channel layer that is equivariant to permutations of the p features (Zaheer et al., 2017), and the output of the final layer was of dimension Rp×o1 so that the subsequent mean pooling layer could be removed. In the case where the architecture was not invariant to permutations of the features, we removed conditions M2 and M3 and also weakened M1 to the condition that m1(Av)**=Am1(v)** for all m11, A𝒜,vRn×p×2, and =1,,o1. We used the same architecture as in our earlier experiment except that Modules 2 and 3 were replaced by multilayer perceptrons and each layer in Module 1 was replaced by a multi-input-output channel layer that is equivariant to permutations of the n observations.

Table 4 displays the results. In every setting considered, removing invariance to permutations of the observations led to a marked increase in the MSE of the estimator, with the degradation of performance tending to be worse at the larger sample size. In the most extreme scenario, the MSE of the non-invariant estimator was 38 times higher than that of the invariant estimator. Removing invariance to permutations of the features also tended to worsen performance, sometimes by a factor of 2 or 3, though there were a few settings where performance improved slightly (no more than 5%). Taken together, these results suggest that a priori enforcing that the estimator is invariant to permutations of the features and observations can dramatically improve performance.

Table 4:

Fold-change in MSEs for modifications of AMC in the FLAM settings with n=100, as compared to the performances of FLAM listed in Table 3. Standard errors all ≤ 0.03 times the fold-change in the MSE.

(a) Sparse signal
Scenario 1 Scenario 2 Scenario 3 Scenario 4
n=100 500 100 500 100 500 100 500
Not invariant to observations 6.98 38.29 5.82 29.93 5.03 27.58 4.29 13.08
permutations of: features 1.01 0.95 1.16 1.09 1.02 0.98 1.01 0.99
(b) Dense signal
Scenario 1 Scenario 2 Scenario 3 Scenario 4
n=100 500 100 500 100 500 100 500
Not invariant to observations 1.86 14.68 1.69 8.60 1.97 14.20 1.51 4.70
permutations of: features 1.05 2.55 0.99 1.98 1.09 3.02 1.04 1.67

6. Data Experiments

We also used real datasets to evaluate the performance of AMC100 estimators meta-trained in sparse linear regression settings (Section 5.3.2) or fused lasso additive model settings (Section 5.3.3). We compared the performance of our estimators to the estimators from our numerical experiments, namely, the OLS, lasso, and FLAM estimators. These estimators are natural comparators because they assume the same or similar models as do our AMC estimators; consequently, comparing to these estimators allows us to focus our discussion on differences in the performance of existing estimation strategies as compared to that of new meta-learned strategies, rather than on differences in underlying assumptions that could potentially be resolved by training a new AMC estimator in a different model.

Because the implementations of lasso and FLAM that we compared to both use 10-fold cross-validation to select tuning parameters, we also used 10-fold cross-validation to select tuning parameters for the AMC100 estimators. The first of these estimators, which we refer to as “AMC Linear”, selects a tuning parameter s{1,2,,10} by finding the value of s for which the cross-validated MSE of an AMC100 estimator trained in the sparse linear regression setting with sparsity level s is minimal. The final prediction then corresponds to that returned by the AMC100 estimator trained in the model with this selected value of s. The second, which we refer to as “AMC FLAM”, selects two tuning parameters, one of which reflects the sparsity level s of the problem and the other of which corresponds to the bound M on the sum of the variation norms of the μj components in the fused lasso additive model. In particular, the tuning parameters (s,M){1,2,,10}×{5,10,20} are chosen to be those that minimize the cross-validated MSE of an AMC100 estimator trained in the fused lasso additive model with parameters (s,M). Notably, each candidate estimator considered by AMC Linear and AMC FLAM only has access to 90, rather than 100, observations when selecting tuning parameter values using 10 -fold cross-validation on a dataset of size n=100. This does not pose a problem because, as was noted in Section 4.3, the trained estimators can be evaluated at different sample sizes than those at which they were trained.

In settings where both AMC-trained estimators and other estimators are available, it is natural to wonder whether there is a way to capitalize on the availability of both types of methods. Ensemble algorithms provide a natural means to do this, with stacked ensembles representing an especially appealing option given theoretical guarantees that adding base learners will not typically degrade performance (Van der Vaart et al., 2006; Van der Laan et al., 2007) and existing experiments showing that they often outperform all included base learners (e.g., Polley and Van der Laan, 2010). We, therefore, evaluate the performance of three stacked ensembles in these experiments. The first includes only the AMC Linear and AMC FLAM estimators as base learners. The second only includes the OLS, lasso, and FLAM estimators. The third includes all five of these estimators. Predictions of the base learners were combined using 10-fold cross-validation. Following the recommendation of Breiman (1996), we employed a non-negative least squares estimator for this combination step.

Our experiments make use of ten datasets. Six of these datasets are available through the University of California, Irvine (UCI) Machine Learning Repository (Dua and Graff, 2017), three were used to illustrate supervised learning machines in popular statistical learning textbooks (Friedman et al., 2001; James et al., 2013), and one was used as an illustrative example in the paper that introduced FLAM (Petersen et al., 2016). All of these datasets contain more than 100 observations. Five of them have at least 10 features and the others have fewer (5, 6, 6, 7, and 9). All outcomes are standardized to have empirical variance 1 so that, for each dataset, the cross-validated MSE performance of a sample mean for predicting the outcome is approximately 1. Further details on these datasets can be found in Appendix E.1.

We evaluated our learned estimators in three settings. First, we considered the case where the number of features in the datasets matched the number that they saw during training, namely 10. In particular, we evaluated the performance of AMC Linear and AMC FLAM in the 5 datasets that have 10 or more features by randomly selecting 100 observations and 10 features from each dataset and evaluating MSE on the held out observations. This and all other Monte Carlo evaluations of MSE described in what follows were repeated 200 times and averaged across the replications. Second, we evaluated the robustness of our learned estimators to a key assumption used during training. In particular, we evaluated the performance of our estimators (a) Datasets with same number of features as used during meta-training (b) Datasets with fewer features than used during meta-training on the 5 datasets that have fewer features than the 10 used during meta-training, again sampling 100 observations and evaluating MSE on the held out observations. Third, we evaluated the relative performance of our estimators at varying levels of signal sparsity for each of the ten datasets. In particular, for each training-test split of the data, we selected s total features from the dataset, removed the remaining features, and then included (10s) Gaussian noise features so that the dimension of the feature was always p=10.

We first discuss performance on datasets with the same number of features as were used during meta-training. Complete numerical results for estimator performance can be found in Table S5 in Appendix E.2. Here, we focus on graphical summaries of performance to communicate the key trends that we saw. Figure 2a shows that AMC FLAM performed similarly to or better than FLAM across all settings, and AMC Linear performed similarly to lasso across all settings. We have compared AMC Linear to lasso as a baseline in this figure because lasso performed similarly to or better than OLS across all settings. Figure 3a shows that stacking all available base learners consistently yielded better performance than did only stacking only the existing estimators or the AMC estimators. This stacked ensemble also outperformed all base learners considered. These results suggest that incorporating AMC estimators into regression pipelines (a) Datasets with same number of features as used during meta-training (b) Datasets with fewer features than used during meta-training can reliably lead to improved predictions even in settings where performant learners are already available.

Figure 3:

Figure 3:

Improvement of the stacked ensemble algorithm that includes all base learners over those which only include a subset (existing learners or AMC learners), in terms of differences of cross-validated MSEs. Including both AMC and existing estimators as base learners always outperformed only including a subset when the dataset contained the same number of features as were used during training. Adding AMC base learners did not tend to improve performance when the dataset had fewer features than were used during meta-training, though any degradation in performance was minimal.

We now discuss performance on datasets with fewer features than were used during meta-training. Figure 2b displays performance on datasets that have fewer features than were used during meta-training. Unsurprisingly, performance was somewhat less desirable than it was for datasets with the same number of features as were used during meta-training. AMC FLAM tended to be somewhat outperformed by FLAM, though did outperform FLAM in one setting. AMC Linear continued to perform similarly to lasso across all settings. Figure 3 shows that stacking all available base learners outperformed stacking only AMC estimators, and performed similarly to stacking existing estimators.

We conclude by discussing the performance of the estimators when we induce varying levels of signal sparsity. Figure 4 shows that AMC FLAM outperformed FLAM for the vast majority of datasets and sparsity patterns. The only exception to this trend occurred for the yacht dataset and the LAozone dataset for denser signals (7, 8, or 9 signal features), where AMC FLAM was slightly outperformed by FLAM. Figure S6 in the appendix shows that AMC Linear consistently outperformed OLS and performed comparably to or slightly better than lasso in most settings.

Figure 4:

Figure 4:

Performance of FLAM and AMC FLAM at different sparsity levels. For each training-validation split of the data, between 1 and q features are selected at random from the original dataset (x-axis), where q is the minimum of 10 and the total number of features in the dataset, and Gaussian noise features are then added so that there are 10 total features. Therefore, the signal is expected to become denser and stronger as the x-axis value increases. AMC FLAM outperforms FLAM in most settings.

Figure S7 shows that there was not a major difference between the cross-validated MSE of the three stacking algorithms. Nevertheless, it is worth noting that stacking all available base learners did outperform the other two stacking schemes in 53% of the 83 dataset-sparsity settings considered, with the stacking scheme that only included AMC algorithms performing best in 39% of the settings and the scheme that only included existing algorithms performing best in only 8% of these settings. Thus, we again see evidence that including AMC base learners in a stacked ensemble can improve performance, even when other learners are already available.

7. Proofs

7.1. A Study of Group Actions that are Useful for Our Setting

To prove Theorem 1, it will be convenient to use tools from group theory to describe and study the behavior of our estimation problem under the shifts, rescalings, and permutations that we consider. For kN, let Sym(k) be the symmetric group on k symbols. Let RR+ be the semidirect product of the real numbers with the positive real numbers with the group multiplication

a1,b1a2,b2=a1+b1a2,b1b2.

Define 𝒢0RR+×RR+pSym(p)×Sym(n). Let 𝒪naRn:a=0,s(a)=1. Throughout we equip 𝒢0 with the product topology.

We note that the quantity Ƶ defined in Section 2.1 writes as

𝒵=𝒪np×𝒪n×Rp×Rp×R×Rp×R. (8)

Denote the generic group element g=gj+,gj×j=0p,τg,ηg where gj+,gj×RR+,τgSym(p), and ηgSym(n). Denote the generic element zƵ by

z=zx,1,j,,zx,n,jj=1p,zy,1,,zy,n,zx,0,jj=1p,zx,jj=1p,zy,zs(x),jj=1p,zs(y).

For g1=g1j+,g1j×j=0p,τ1,η1,g2=g2j+,g2j×j=0p,τ2,η2, two arbitrary elements in 𝒢0, define the group multiplication as

g1g2=g10+g20×+g10×g20+,g10×g20×,g1j+g2τ11(j)×+g1j×g2τ11(j)+,g1j×g2τ11(j)×j=1p,τ1τ2,η1η2.

Define the group action 𝒢0×ƵƵ by

(gz)x,i,j=zx,ηg1(i),τg1(j)(gz)y,i=zy,ηg1(i)(gz)x,0,j=zx,0,τg1(j)(gz)x,j=gj+gj×+zx,τg1(j)(gz)y=g0+g0×+zy(gz)s(x),j=loggj×+zs(x),τg1(j)(gz)s(y)=logg0×+zs(y),

where i{1,2,,n} and j{1,2,,p}.

We make use of the below result without statement in the remainder of this section.

Lemma 1. The map defined above is a left group action.

Proof. The identity axiom, namely that ez=z when e is the identity element of 𝒢0, is straightforward to verify and so we omit the arguments. Fix g1,g2𝒢0 and zƵ. We establish compatibility by showing that g1g2z=g1g2z. To see that this is indeed the case, note that, for all i{1,,n} and j{1,,p}:

g1g2zy,i=zy,η1η21(i)=zy,η21η11(i)=g2zy,η11(i)=g1g2zy,i
g1g2zx,i,j=zx,η21η11(i),τ21τ11(j)=g2zx,η11(i),τ11(j)=g1g2zx,i,jg1g2zx,0,j=zx,0,τ21τ11(j)=g2zx,0,τ11(j)=g1g2zx,0,jg1g2zx,j=g1j+g2τ11(j)×+g1j×g2τ11(j)+g1j×g2τ11(j)×+zx,τ21τ11(j)=g1j+g1j×+g2zx,τ11(j)=g1g2zx,jg1g2zy=g10+g20×+g10×g20+g10×g20×+zy=g10+g10×+g2zy=g1g2zyg1g2zs(x),j=logg1j×g2j×+zs(x),τ21τ11(j)=logg1j×+g2zs(x),τ11(j)=g1g2zs(x),jg1g2zs(y)=logg10×g20×+z(s(y)=logg10×+g2zs(y)=g1g2zs(y).

We now introduce several group actions that we will make heavy use of in our proof of Theorem 1 and in the lemmas that precede it. We first define 𝒢0×𝒮𝒮. For S𝒮 and g𝒢0, define gS to be (gS)(z)=S(gz). Conditions T4 and T5 can be restated as gS𝒮 for all g𝒢0 and S𝒮. It can then readily be shown that, under these conditions, the defined map is a left group action. For T𝒯, we will write gT to denote the 𝒟(𝒳R) operator defined so that

(gT)(d):x0y+s(y)gSTzd,x0,ifd,x0𝒟0,0,otherwise

It is possible that gT does not belong to 𝒯 due to its behavior when d,x0𝒟0, and therefore that the defined map is not a group action. Nonetheless, because 𝒟0 has P-probability one for any P𝒫, this fact will not pose any difficulties in our arguments.

We now define the group action 𝒢0×(𝒴×𝒳)(𝒴×𝒳). For (y,x)R×Rp, define g(y,x) as

g(y,x)=g0++g0×y,gi++gi×xτg1(i)i=1p.

Similar arguments to those used to prove Lemma 1 show that the map defined above is a left group action. We now define the group action 𝒢0×𝒫𝒫. For P𝒫,g𝒢0, define gP=Pg1 by (gP)(U)=Pg1(U), where

g1(U)=(y,x)Rp+1:g(y,x)U.

Under P1, P2, and P3, which, as noted in the Section 2.1, implicitly encode that Pg1𝒫, it can readily be shown that the defined map is a left group action. Finally, we define the group action 𝒢0×ΓΓ. For ΠΓ,g𝒢0, define gΠ=Πg1 by (gΠ)(U)=Πg1(U) where

g1(U)={P𝒫:gPU}.

We can restate P1, P2, and P3 as Πg1Γ for all ΠΓ,g𝒢0. Under these conditions, it can be shown that the defined map is a left group action.

We now show that 𝒢0 is amenable — see Appendix A for a review of this concept. Establishing this fact will allow us to apply Day’s fixed point theorem (Theorem S1 in Appendix A) in the upcoming proof of Theorem 1.

Lemma 2. 𝒢0 is amenable.

Proof. Because Sym(p) and Sym(n) are finite groups, they are compact, and therefore amenable. Because R and R+are Abelian, they are also amenable. By Theorem S19, group extensions of amenable groups are amenable. □

7.2. Proofs of Theorems 1 through 4

This section is organized as follows. Section 7.2.1 introduces three general lemmas that will be useful in proving the results from the main text. Section 7.2.2 proves several lemmas, proves the variant of the Hunt-Stein theorem from the main text (Theorem 1), and concludes with a discussion of the relation of this result to those in Le Cam (2012). Section 7.2.3 establishes a preliminary lemma and then proves that, when the class of estimators is equivariant, it suffices to restrict attention to priors in Γ1 when aiming to learn a Γ-minimax estimator (Theorem 2). Section 7.2.4 establishes several lemmas, including a minimax theorem for our setting, before proving the existence of an equilibrium point (Theorem 3). Section 7.2.5 establishes the equivariance of our proposed neural network architecture (Theorem 4).

In this section, we always equip C(Ƶ,R) with the topology of compact convergence and, whenever T2 holds so that 𝒮C(Ƶ,R), we equip 𝒮 with the subspace topology. For a fixed compact 𝒦Ƶ and a function hC(Ƶ,R), we also let h,𝒦supz𝒦|h(z)|.

7.2.1. Preliminary lemmas

We now prove three lemmas that will be used in our proofs of Theorems 1 and 3.

Lemma 3. C(Ƶ,R) with the compact-open topology is metrizable.

Proof. See Example IV.2.2 in Conway (2010). □

As a consequence of the above, we can show that a subset of C(Ƶ,R) is closed by showing that it is sequentially closed, and we can show that a subset of C(Ƶ,R) is continuous by showing that it is sequentially continuous.

Lemma 4. If T1, T2, and T3 hold, then 𝒮 is a compact subset of C(Ƶ,R).

Proof. By T1, 𝒮 is pointwise bounded. Moreover, the local Hölder condition T2 implies that 𝒮 is equicontinuous, in the sense that, for every ϵ>0 and every zƵ there exists an open neighborhood 𝒰Ƶ of z such that, for all S𝒮 and all z𝒰, it holds that S(z)Sz<ϵ. Hence, by the Arzelà-Ascoli theorem (see Theorem 47.1 in Munkres, 2000 for a convenient version),𝒮 is a relatively compact subset of C(Ƶ,R). By T3, 𝒮 is closed, and therefore 𝒮 is compact. □

We now show that the group action 𝒢0×𝒮𝒮 is continuous under conditions that we assume in Theorem 1. Establishing this continuity condition is necessary for our use of Day’s fixed point theorem in the upcoming proof of that result.

Lemma 5. If T2, T4, and T5 hold, then the group action 𝒢0×𝒮𝒮 is continuous.

Proof. By T4 and T5, 𝒢0×𝒮𝒮 is indeed a group action. Also, by T2 and Lemma 3,𝒮 is metrizable. Recall the expression for Ƶ given in (8) and that

𝒢0RR+×RR+pSym(p)×Sym(n).

The product topology is compatible with semidirect products, and so the fact that each multiplicand is a metric space implies that 𝒢0 is a metric space. Hence, it suffices to show sequential continuity. Let gk,Skk=1 be a sequence in 𝒢0×𝒮 such that gk,Sk(g,S), where (g,S)𝒢0×𝒮. By the definition of the product metric, gkg and SkS. Let 𝒦1𝒪np,𝒦2𝒪n,𝒦3Rp,𝒦4Rp,𝒦5R,𝒦6Rp, and 𝒦7R be compact spaces. Since each compact space 𝒦𝒵 is contained in such a i=17𝒦i, it suffices to show that

supzi=17𝒦igkSk(z)(gS)(z)=gkSkgS,i=17𝒦i0

for arbitrary compact sets 𝒦1,,𝒦7. To show this, we will use the decomposition gk=gk,1,gk,2,gk,3,gk,4, where gk,1RR+,gk,2RR+p,gk,3Sym(p), and gk,4Sym(n). We similarly use the decomposition g=g1,g2,g3,g4. For all N large enough, all of the statements are true for all k>N:gk,3=g3,gk,4=g4,gk,1 is contained in a compact neighbourhood C1 of g1, and gk,2 is contained in a compact neighbourhood C2 of g2.

Since permutations are continuous, g4𝒦1g3g4wg3:w𝒦1,g4𝒦2g4w:w𝒦2, and 𝒦jg3wg3:w𝒦j,j=3,4,6, are compact. In the following we use the decomposition gg1,g2,g3,g4 for an arbitrary element g𝒢. Since addition and multiplication are continuous, C2𝒦3g3g2w:g2C2,w𝒦3g3,C2𝒦4g3g2w:g2C2,w𝒦4g3,C1𝒦5g1w:g1C1,w𝒦5,C2𝒦6g3g2w:g2C2,w𝒦6g3, and C1𝒦7g1w:g1C1,w𝒦7 are compact. Define 𝒦 to be the compact set

𝒦=g4𝒦1g3×g4𝒦2×C2𝒦3g3×C2𝒦4g3×C1𝒦5×C2𝒦6g3×C1𝒦7

Then,

gkSkgS,i=17𝒦iSkS,𝒦0.

7.2.2. Proof of Theorem 1

We begin this subsection with four lemmas and then we prove Theorem 1. Following this proof, we briefly describe how the argument relates to that given in Le Cam (2012). In the proof of Theorem 1, we will use notation that we established about the group 𝒢0 in Section 7.1. We refer the reader to that section for details.

Lemma 6. For any g𝒢0,T𝒯, and P𝒫,R(gT,P)=R(T,gP)

Proof. Fix T𝒯 and P𝒫, and let SST, where ST is defined in (3). By the change-of-variables formula,

R(gT,P)=EPσP2Y+s(Y)S(gZ)μPx02dPXx0=EPg1σP2g1Y+sg1YS(Z)μPg1x02dPXg1x0.

Plugging the fact that g1y=yg0+/g0× and that

μP(g1x0)=EP[YX0=g1x0]=EP[YgX0=x0]=EP[gYgX0=x0]g0+g0×=μPg1(x0)g0+g0×

into the right-hand side of the preceding display yields that

R(gT,P)=EPg1σP2Yg0+g0×+sYg0+g0×S(Z)μPg1x0g0+g0×2dPXg1x0=EPg1σP2Yg0×+sYg0+g0×S(Z)μPg1x0g0×2dPXg1x0.

By the shift and scale properties of the standard deviation and variance, the above continues as

=EPg1σP2Yg0×+s(Y)g0×S(Z)μPg1x0g0×2dPXg1x0=EPg1σPg12Y+s(Y)S(Z)μPg1x02dPXg1x0=R(T,gP).

Lemma 7. For any g𝒢0,T𝒯, and ΠΓ, it holds that r(gT,Π)=r(T,gΠ).

Proof. This result follows quickly from Lemma 6. Indeed, for any g𝒢0,T𝒯, and ΠΓ,

r(gT,Π)=R(gT,P)dΠ(P)=R(T,gP)dΠ(P)=R(T,P)dΠg1(P)=r(T,gΠ).

Let 𝒮eS𝒮:gS=S for all g𝒢0 consists of the 𝒢0-invariant elements of 𝒮. The following fact will be useful when proving Theorem 1, and also when proving results in the upcoming Section 7.2.3.

Lemma 8. It holds that 𝒮e=ST:T𝒯e.

Proof. Fix S𝒮e and g𝒢0. By the definition of 𝒮ST:T𝒯, there exists a T𝒯 such that S=ST. For this T, the fact that ST(z)=ST(gz) implies that

T(gz)=g0++g0×y+g0×s(y)ST(gz)=g0++g0×y+g0×s(y)ST(z)=g0++g0×y+s(y)ST(z)=g0++g0×T(z).

As g was arbitrary, T𝒯e. Hence, 𝒮eST:T𝒯e. □

Now fix T𝒯e and g𝒢0. Note that ST(z)=[T(z)y]/s(y). Using that T𝒯e implies that T(gz)=g0++g0×T(z), we see that

ST(gz)=T(gz)g0+g0×ys(gy)=T(gz)g0+g0×yg0×s(y)=g0++g0×T(z)g0+g0×yg0×s(y)=T(z)ys(y)=ST(z).

As, g was arbitrary, ST𝒮e, and so 𝒮eST:T𝒯e. □

We define r0:𝒮×Γ[0,) as follows:

r0(S,Π)EPx0:D,x0𝒟0Y+s(Y)SzD,x0μPx02σP2dPXx0dΠ(P). (9)

Because 𝒟0 occurs with P-probability one (for any P𝒫 ), it holds that r(T,Π)=r0ST,Π for any T𝒯.

Lemma 9. Fix ΠΓ. If T1, T2, and P4 hold, then r0(,Π):𝒮R is lower semicontinuous.

Proof. Fix ΠΓ. For any compact 𝒦Ƶ, we define f𝒦:𝒮R by

f𝒦(S)EP𝒳D,𝒦σP2Y+s(Y)S(Z)μPx02dPXx0dΠ(P),

where here and throughout in this proof we let ZzD,x0 and 𝒳D,𝒦x0:D,x0𝒟0,zD,x0𝒦𝒳. Recalling that there exists an increasing sequence of compact subsets 𝒦1𝒦2 such that j=1𝒦j=𝒵, we see that supjNf𝒦j()=r0(,Π) by the monotone convergence theorem. Moreover, as suprema of collections of continuous functions are lower semicontinuous, we see that f is lower semicontinuous if f𝒦 is continuous for every 𝒦. In the remainder of this proof, we will show that this is indeed the case.

By Lemma 3, it suffices to show that f𝒦 is sequentially continuous. Fix S1,S2𝒮. By Jensen’s inequality,

f𝒦S1f𝒦S2=EP𝒳D,𝒦σP2Y+s(Y)S1(Z)μPx02Y+s(Y)S2(Z)μPx02dPXx0dΠ(P)σP2EP𝒳D,𝒦Y+s(Y)S1(Z)μPx02Y+s(Y)S2(Z)μPx02dPXx0dΠ(P). (10)

In what follows, we will bound the right-hand side above by some finite constant times S1S2𝒦,. We start by noting that, for any d,x0𝒟0 such that zd,x0𝒦,

y+s(y)S1(z)μPx02y+s(y)S2(z)μPx02=s(y)2y+s(y)S1(z)+S2(z)2μPx0S1(z)S2(z)S1S2,𝒦s(y)2y+s(y)S1(z)+S2(z)2μPx0S1S2,𝒦s(y)2S1𝒦,+S2𝒦,+2s(y)yμPx0S1S2,𝒦s(y)2S1𝒦,+S2𝒦,+2s(y)yEP[Y]+2s(y)μPx0EP[Y]2S1S2,𝒦C1s(y)2+s(y)yEP[Y]+s(y)μPx0EP[Y],

where C1supS𝒮S𝒦, is finite by T1 and T2. Integrating both sides shows that

EP𝒳D,𝒦Y+s(Y)S1(Z)μPx02Y+s(Y)S2(Z)μPx02dPXx02S1S2,𝒦C1EP𝒳D,𝒦s(Y)2dPXx0+EP𝒳D,𝒦s(Y)YEP[Y]dPXx0+EP𝒳D,𝒦s(Y)μPx0EP[Y]dPXx02S1S2,𝒦C1EPs(Y)2+EPs(Y)YEP[Y]+EPs(Y)μPx0EP[Y]dPXx0. (11)

We now bound the three expectations on the right-hand side by finite constants that do not depend on S1 or S2. All three bounds make use of the bound on the first expectation, namely EPs(Y)2=n1nVarP(Y)n1nC2σP2, where C2supP𝒫VarP(Y)/σP2. We note that (P4) can be used to show that C2<. Indeed,

EPVarP(YX)=EPVarPϵPX=EPϵP2=σP2,

and so, by the law of total variance and (P4), C2=1+supP𝒫VarPμP(X)/σP2<. By Cauchy-Schwarz, the second expectation on the right-hand side of (11) bound as

EPs(Y)YEP[Y]EPs(Y)21/2EPYEP[Y]21/2=EPs(Y)21/2σP=n1nC2σP2,

and the third expectation bounds as

EPs(Y)μPx0EP[Y]EPs(Y)21/2EPμPx0EP[Y]2dPX01/2EPs(Y)21/2VarP(Y)1/2n1nC2σPVarP(Y)1/2n1nC2σP2.

Plugging these bounds into (11), we see that

EP𝒳D,𝒦Y+s(Y)S1(Z)μPx02Y+s(Y)S2(Z)μPx02dPXx02S1S2,𝒦σP2n1nC21/2C1C21/2n1n+C21/2+1..

Plugging this into (10), we have shown that

f𝒦S1f𝒦S22S1S2,𝒦n1nC21/2C1C21/2n1n+C21/2+1.

We now conclude the proof by showing that the above implies that f𝒦 is sequentially continuous at every S𝒮, and therefore is sequentially continuous on 𝒮. Fix S and a sequence Sj such that SjS compactly. This implies that SjS,𝒦0, and so the above display implies that f𝒦Sjf𝒦(S), as desired. □

We now prove Theorem 1.

Proof of Theorem 1. Fix T0𝒯 and let S0ST0𝒮. Let 𝒦 be the set of all elements S𝒮 that satisfy

supΠΓr0(S,Π)supΠΓr0S0,Π.

For fixed Π0Γ, the set of S𝒮 that satisfy r0S,Π0supΠΓr0S0,Π is closed due to the lower semicontinuity of the risk function (Lemma 9) and contains S0. The intersection of such sets is closed and contains S0 so that 𝒦 is a nonempty closed subset of the compact Hausdorff set 𝒮, implying that 𝒦 is compact. By the convexity of xxab2, the risk function Sr0(S,Π) is convex. Hence, 𝒦 is convex. If S𝒦, then Lemma 7 shows that, for any g𝒢0,

r0gS,Π0=r0S,gΠ0supΠΓr0S0,Π.

Thus, gS𝒦 and 𝒢0×𝒦𝒦 is an affine group action on a nonempty, convex, compact subset of a locally compact topological vector space. Combining this with the fact that 𝒢0 is amenable (Lemma 2) shows that we may apply Day’s fixed point theorem (Theorem S1) to see that there exists an Se𝒮 such that, for all g𝒢0,gSe=Se and

supΠΓr0Se,ΠsupΠΓr0S0,Π.

The conclusion is at hand. By Lemma 8, there exists a Te𝒯e such that Se=STe. Furthermore, as noted below (9),r0STe,Π=rTe,Π and r0ST0,Π=rT0,Π for all ΠΓ. Recalling that S0ST0, the above shows that supΠΓrTe,ΠsupΠΓrT0,Π. As T0𝒯 was arbitrary and Te𝒯e, we have shown that infTe𝒯esupΠΓrTe,ΠinfT0𝒯supΠΓrT0,Π. □

The proof of Theorem 1 is inspired by that of the Hunt-Stein theorem given in Le Cam (2012). Establishing this result in our context required making meaningful modifications to these earlier arguments. Indeed, Le Cam (2012) uses transitions, linear maps between L-spaces, to characterize the space of decision procedures. This more complicated machinery makes it possible to broaden the set of procedures under consideration. Indeed, with this characterization, it is possible to describe decision procedures that cannot even be represented as randomized decision procedures via a Markov kernel, but instead come about as limits of such decision procedures. Despite the richness of the space of decision procedures considered, Le Cam is still able to show that this space is compact by using a coarse topology, namely the topology of pointwise convergence. Unfortunately, this topology appears to generally be too coarse for our Bayes risk function r0(,Π) to be lower semi-continuous, which is a fact that we used at the beginning of our proof of Theorem 1. Another disadvantage to this formulation is that it makes it difficult to enforce any natural conditions or structure, such as continuity, on the set of estimators. It is unclear whether it would be possible to implement a numerical strategy optimizing over a class of estimators that lacks such structure. In contrast, we showed that, under appropriate conditions, it is indeed possible to prove a variant of the Hunt-Stein theorem in our setting even once natural structure is imposed on the class of estimators. To show the compactness of the space of estimators that we consider, we applied the Arzelà-Ascoli theorem.

7.2.3. Proof of Theorem 2

We provide one additional lemma before proving Theorem 2. The lemma relates to the class 𝒯~e of estimators in 𝒯 that satisfy the equivariance property (5) but do not necessarily satisfy (4). Note that 𝒯e𝒯~e𝒯

Lemma 10. If P2 and P3 hold, then, for all T𝒯~e,

r(T,Π)=rT,Πh1forallΠΓ,

and so supΠΓr(T,Π)=supΠΓ1r(T,Π).

Proof of Lemma 10. Let e be the identity element in Sym(n)×Sym(p). For each P𝒫, define gP𝒢0 to be

gPEP[Y]σP,1σP,EPXjVarPXjj=1p,1VarPXjj=1p,e.

It holds that

RT,Πh1=R(T,P)dΠh1(P)=RT,PgP1dΠ(P)bythedefinitionofh=RgPT,PdΠ(P)byLemma6=R(T,P)dΠ(P)=r(T,Π)sinceT𝒯˜e.

We conclude by proving Theorem 2.

Proof of Theorem 2. Under the conditions of the theorem, 𝒯~e=𝒯. Recalling that Γ1Πh1:ΠΓ, Lemma 10 yields that, for any T𝒯,supΠΓr(T,Π)=supΠΓrT,Πh1=supΠΓ1r(T,Π). Hence, an estimator T𝒯 is Γ-minimax if and only if it is Γ1-minimax. □

7.2.4. Proof of Theorem 3

In this subsection, we assume (without statement) that all ΠΓ are defined on the measurable space (𝒫,𝒜), where 𝒜 is such that A𝒫1:A𝒜 equals 1, where 1 is the collection of Borel sets on the metric space 𝒫1,ρ described in P5. Under P2 and P3, which we also assume without statement throughout this subsection, it then follows that each Π1Γ1 is defined on the measurable space 𝒫1,1, where 1 is the collection of Borel sets on 𝒫1,ρ. Let Γ0 denote the collection of all distributions on 𝒫1,1. For each A1, define the ϵ-enlargement of A by AϵP𝒫1:PA such that ρP,P<ϵ. Further let ξ denote the Lévy-Prokhorov metric on Γ0, namely

ξΠ,Πinfϵ>0:Π(A)ΠAϵ+ϵandΠ(A)ΠAϵ+ϵforallA1.

Lemma 11. If P5 and P6, then Γ1,ξ is a compact metric space.

Proof of Lemma 11. By Prokhorov’s theorem (see Theorem 5.2 in van Gaans, 2003 for a convenient version, or see Theorems 1.5.1 and 1.6.8 in Billingsley, 1999), P5 implies that Γ1 is relatively compact in Γ0,ξ. The fact that Γ1 is closed (P6) implies the result. □

We now define r1:𝒮e×Γ1[0,), which is the analogue of r0:𝒮×Γ[0,) from Section 7.2.2:

r1(S,Π)EPx0:D,x0𝒟0Y+s(Y)SzD,x0μPx02dPXx0dΠ(P). (12)

Note that, because each distribution in 𝒫 is continuous, each distribution in 𝒫1 is also continuous. Hence, 𝒟0 occurs with P-probability one for all P𝒫1, and so the definition of r1 combined with Lemma 8 shows that r(T,Π)=r1ST,Π for any T𝒯e and ΠΓ1.

Lemma 12. If P5, then, for each S𝒮e,r1(S,) is upper semicontinuous on Γ1,ξ.

Proof of Lemma 12. Fix S𝒮e, and note that, by Lemma 8, there exists a T𝒯e such that S=ST. Let Πjj=1 be such that ΠjkΠ in Γ1,ξ for some ΠΓ1. Because ξ metrizes weak convergence (Theorem 1.6.8 in Billingsley, 1999), the Portmanteau theorem shows that limsupkEΠj[f(P)]EΠ[f(P)] for every f:𝒫1R that is upper semicontinuous and bounded from above on 𝒫1,ρ. By part (iii) of P5, we can apply this result at f:PR(T,P) to see that limsupkrT,Πjr(T,Π). As Πjj=1 was arbitrary, r(T,) is upper semicontinuous on Γ1,ξ. Because r(T,)=r1ST, and S=ST, we have this shown that r1(S,) is upper semicontinuous on Γ1,ξ. □

Lemma 13. Under the conditions of Lemma 4, 𝒮e is a compact subset of C(Ƶ,R).

Proof. By Lemma 4,𝒮e𝒮 is relatively compact. Hence, it suffices to show that 𝒮e is closed. By Lemma 3, a subset of C(Ƶ,R) is closed in the topology of compact convergence if it is sequentially closed. Let Sjj=1 be a sequence on 𝒮e such that SjS compactly. Because 𝒮e𝒮 and 𝒮 is closed by T3, we see that S𝒮. We now wish to show that S𝒮e. Fix z𝒵 and g𝒢0. Because the doubleton set {z,gz} is compact, Sj(z)S(z) and Sj(gz)S(gz), and thus Sj(z)Sj(gz)S(z)S(gz). Moreover, because Sj𝒮e,Sj(gz)=Sj(z) for all j. Hence, Sj(z)Sj(gz)0. As these two limits must be equal, we see that S(z)=S(gz). Because z𝒵 and g𝒢0 were arbitrary, S𝒮e. □

Lemma 14. Fix ΠΓ1. If T1, T2, and P4 hold, then r1(,Π):𝒮eR is lower semicontinuous.

Proof. The proof is similar to that of Lemma 9 and is therefore omitted. □

Lemma 15. If T6, then 𝒮e is convex.

Proof. Fix S1,S2𝒮e and δ(0,1). For any z𝒵 and g𝒢0,

gδS1+[1δ]S2(z)=δS1(gz)+[1δ]S2(gz)=δS1(z)+[1δ]S2(z),

where the latter equality holds since S1,S2𝒮e. Hence, gδS1+[1δ]S2=δS1+[1δ]S2 for all g𝒢0. By T6, δS1+[1δ]S2𝒮. Hence, δS1+[1δ]S2𝒮e{S𝒮:gS=S for all g𝒢0.

Lemma 16 (Minimax theorem). Under the conditions of Theorem 3,

minS𝒮emaxΠΓ1r1(S,Π)=maxΠΓ1minS𝒮er1(S,Π). (13)

Proof of Lemma 16. We will show that the conditions of Theorem 1 in Fan (1953) are satisfied. By Lemma 3, C(Ƶ,R) is metrizable by some metric ρ0. By Lemma 13, 𝒮e,ρ0 is a compact metric space. Moreover, by Lemma 11, Γ1,ξ is a compact metric space. As all metric spaces are Hausdorff, 𝒮e,ρ0 and Γ1,ξ are Hausdorff. By Lemma 12, for each for each S𝒮e,r1(S,) is upper semicontinuous on Γ1,ξ. By Lemma 14, for each ΠΓ1,r1(,Π) is lower semicontinuous on 𝒮e,ρ0. It remains to show that r1 is concavelike on Γ1 (called “concave on” Γ1 by Fan) and that r1 is convexlike on 𝒮e (called “convex on” 𝒮e by Fan). To see that r1 is concavelike on Γ1, note that Γ1 is convex (P7), and also that, for all S𝒮e,r1(S,) is linear, and therefore concave, on Γ1. Hence, r1 is concavelike on Γ1 (page 409 of Terkelsen, 1973). To see that r1 is convexlike on 𝒮e, note that 𝒮e is convex (Lemma 15), and also that, for all ΠΓ1,r1(,Π) is convex on 𝒮e. Hence, r1 is convexlike on 𝒮e (ibid.). Thus, by Theorem 1 in Fan (1953), (13) holds. □

We conclude by proving Theorem 3.

Proof of Theorem 3. We follow arguments given on page 93 of Chang (2006) to show that, under the conditions of this theorem, (13) implies that there exists an S𝒮e and a ΠΓ1 such that

maxΠΓ1r1S,Π=r1S,Π=minS𝒮er1S,Π. (14)

Noting that pointwise maxima of lower semicontinuous functions are themselves lower semicontinuous, Lemma 14 implies that maxΠΓ1r1(,Π) is lower semicontinuous. Because 𝒮e is compact (Lemma 13), there exists an S𝒮e such that

maxΠΓ1r1S,Π=minS𝒮emaxΠΓ1r1(S,Π).

Similarly, Lemma 12 implies that minS𝒮er1(S,) is upper semicontinuous on Γ1,ξ. Because Γ1,ξ is compact (Lemma 11), there exists a ΠΓ1 such that

minS𝒮er1S,Π=maxΠΓ1minS𝒮er1(S,Π).

By Lemma 16, the above two displays show that maxΠΓ1r1S,Π=minS𝒮er1S,Π. Combining this result with the elementary fact that minS𝒮er1S,Πr1S,ΠmaxΠΓ1r1S,Π shows that (14) holds.

Recall from below (12) that r1ST,Π=r(T,Π) for all ΠΓ1 and T𝒯e. Moreover, since 𝒮e=ST:T𝒯e (Lemma 8), there exists a T𝒯e such that S=ST. Combining these observations shows that (i) maxΠΓ1r1S,Π=maxΠΓ1r1ST,Π=maxΠΓ1rT,Π; (ii) r1S,Π=r1ST,Π=rT,Π; and (iii) minS𝒮er1S,Π=minT𝒯er1ST,Π=minT𝒯erT,Π. Hence, by (14), maxΠΠrT,Π=r1T,Π=minT𝒯er1T,Π. Equivalently, for all T𝒯e and ΠΓ1,rT,ΠrT,ΠrT,Π. □

7.2.5. Proof of Theorem 4

Proof of Theorem 4. Fix T, and let m1,m2,m3,m4k=14k be the corresponding modules. Recall from Algorithm 2 that, for a given d,x0,x00x0xs(x) and d0Rn×p×2 is defined so that di*10=xixs(x) for all i=1,,n and d*j20=yys(y) for all j=1,,p. Now, for any d,x0𝒟0,

T(d)x0=y+s(y)m41pj=1pm3m21ni=1nm1d0i**x00j*,

and so ST takes the form

STzd,x0=m41pj=1pm3m21ni=1nm1d0i**x00j*.

Because ST does not depend on the last four arguments of zd,x0, we know that T satisfies (5), that is, is invariant to shifts and rescalings of the features and is equivariant to shifts and rescalings of the outcome. It remains to show permutation invariance, namely (4). By the permutation invariance of the sample mean and sample standard deviation, it suffices to establish the analogue of this property for ST, namely that STzAdB,Bx0=STzd,x0 for all d,x0𝒟0,A𝒜, and B. For an array M of size Rn×p×o, we will write AMB to mean the Rn×p×o array for which (AMB)**=AM**B for all =1,2,,o. Note that

STzAdB,Bx0=m41pj=1pm3m21ni=1nm1Ad0Bi**Bx00j*=m41pj=1pm3m21ni=1nAm1d0Bi**Bx00j*   (by M1)=m41pj=1pm3m2B1ni=1nAm1d0i**Bx00j*=m41pj=1pm3m2B1ni=1nm1d0i**Bx00j*=m41pj=1pm3Bm21ni=1nm1d0i**Bx00j*   (by M2)=m41pj=1pm3Bm21ni=1nm1d0i**x00j*=m41pj=1pBm3m21ni=1nm1d0i**x00j*   (by M3)=m41pj=1pm3m21ni=1nm1d0i**x00j*=STzd,x0.

Hence, T satisfies (4). □

8. Extensions and Discussion

We have focused on a particular set of invariance properties on the collection of priors Γ, namely P1-P3. Our arguments can be generalized to handle other properties. As a simple example, suppose P3 is strengthened so that Γ is invariant to nonzero (rather than only nonnegative) rescalings b˜ of the outcome – this property is in fact satisfied in all of our experiments. Under this new condition, the results in Section 2 remain valid with the definition of the class of equivariant estimators 𝒯e defined in (4) and (5) modified so that b˜ may range over R{0}. Moreover, for any T, Jensen’s inequality shows that the Γ-maximal risk of the symmetrized estimator that averages T(x,y)x0 and negative T(x,y)x0 is no worse than that of T. To assess the practical utility of this observation, we numerically evaluated the performance of symmetrizations of the estimators learned in our experiments. Symmetrizing improved performance across most settings (see Appendix F). We, therefore, recommend carefully characterizing the invariance properties of a given problem when setting out to meta-learn an estimator.

Much of this work has focused on developing and studying a framework for meta-learning a Γ-minimax estimator for a single, prespecified collection of priors Γ. In some settings, it may be difficult to a priori specify a single such collection that is both small enough so that the Γ-minimax estimator is not too conservative while also being rich enough so that the priors in this collection actually place mass in a neighborhood of the true data-generating distribution. Two approaches for overcoming this challenge seem to warrant further consideration. The first would be to employ an empirical Bayes approach (Efron and Morris, 1972), wherein a large dataset from a parallel situation can be used to inform about the possible forms that the prior might take; this, in turn, would also inform about the form that the collection Γ should take. Recent advances also make it possible to incorporate knowledge about the existence of qualitatively different categories of features when performing empirical Bayes prediction (Nabi et al., 2020). The second approach involves using AMC to approximate Γ-minimax estimators over various choices of Γ, and then to use a stacked ensemble to combine the predictions from these various base estimators. In our data experiments, we saw that a simple version of this ensemble that combined four base AMC estimators consistently performed at least as well as the best of these base estimators.

In this work, we have focused on the case where the problem of interest is a supervised learning problem and the objective is to predict a continuous outcome based on iid data. While the AMC algorithm generalizes naturally to a variety of other sampling schemes and loss functions (see Luedtke et al., 2020), our characterization of the equivariance properties of an optimal estimator was specific to the iid regression setting that we considered. In future work, it would be interesting to characterize these properties in greater generality, including in classification settings and inverse reinforcement learning settings (e.g., Russell, 1998; Geng et al., 2020).

Acknowledgments

The authors thank Devin Didericksen for help in the early stages of this project. Generous support was provided by Amazon through an AWS Machine Learning Research Award and the NIH under award number DP2-LM013340. The content is solely the responsibility of the authors and does not necessarily represent the official views of Amazon or the NIH.

Appendices

A. Review of amenability

In this appendix, we review the definition of an amenable group, an important implication of amenability, and also some sufficient conditions for establishing that a group is amenable. This material will prove useful in our proof of Theorem 1 (see Section 7.2.2). We refer the reader to Pier (1984) for a thorough coverage of amenability.

Definition 1 (Amenability). Let 𝒢 be a locally compact, Hausdorff group and let L(𝒢) be the space of Borel measurable functions that are essentially bounded with respect to the Haar measure. A mean on L(𝒢) is defined as a linear functional ML(𝒢)* such that M(λ)0 whenever λ0 and M1𝒢=1. A mean M is said to be left invariant for a group 𝒢 if and only if Mδg*λ=M(λ) for all λL(𝒢), where δg*λ(h)=λg1h. The group 𝒢 is said to be amenable if and only if there is a left invariant mean on L(𝒢).

We now introduce the fixed point property, and subsequently present a result showing its close connection to the definition given above. Throughout this work, we equip all group actions 𝒢×𝒲𝒲 with the product topology

Definition 2 (Fixed point property). We say that a locally compact, Hausdorff group 𝒢 has the fixed point property if, whenever 𝒢 acts affinely on a compact convex set 𝒦 in a locally convex topological vector space E with the map 𝒢×𝒦𝒦 continuous, there is a point in x0𝒦 fixed under the action of 𝒢.

Theorem S1 (Day’s Fixed Point Theorem). A locally compact, Hausdorff group 𝒢 has the fixed point property if and only if 𝒢 is amenable.

Proof. See the proof of Theorem 5.4 in Pier (1984). □

The following results are useful for establishing amenability.

Lemma S17. Any compact group is amenable.

Proof. Take the normalized Haar measure as an invariant mean. □

Lemma S18. Any locally compact Abelian group is amenable.

Proof. See the proof of Proposition 12.2 in Pier (1984). □

Lemma S19. Let 𝒢 be a locally compact group and 𝒩 a closed normal subgroup of 𝒢. If 𝒩 and 𝒢/𝒩 are amenable, then 𝒢 is amenable.

Proof. Assume that a continuous affine action of 𝒢 on a nonempty compact convex set 𝒦 is given. Let 𝒦𝒩 be the set of all fixed points of 𝒩 in 𝒦. Since 𝒩 is amenable, Theorem S1 implies that 𝒦𝒩 is nonempty. Since the group action is continuous, 𝒦𝒩 is a closed subset of 𝒦 and hence is compact. Since the action is affine, 𝒦𝒩 is convex. Now, note that, for all x𝒦𝒩, g𝒢, and n𝒩, the fact that g1ng𝒩 implies that g1ngx=x which implies ngx=gx. Hence, 𝒦𝒩 is preserved by the action of 𝒢. The action of 𝒢 on 𝒦𝒩 factors to an action of 𝒢/𝒩 on 𝒦𝒩, which has a fixed point x0 since 𝒢/𝒩 is amenable. But then x0 is fixed by each g𝒢. Hence, 𝒢 is amenable. □

B. Examples of collections 𝒮 where T1-T6 hold

B.1. Infinite-dimensional class

We start by presenting an infinite-dimensional class 𝒮 that satisfies T1-T6, and then we subsequently present a finite-dimensional class. To define this class, we fix c,α>0 and a function F:ƵR+some function that is invariant to permutations, shifts, and rescalings, in the sense that both of the following hold:

  • F1.
    Permutations: For all (x,y),x0𝒟0,A𝒜 and B, it holds that
    Fz(AxB,Ay),Bx0=Fz(x,y),x0.
  • F2.

    Shifts and rescalings: For all (x,y),x0𝒟0,aRp,bR+p,a˜R, and b˜>0, it holds that Fzxa,b,a˜+b˜y,a+bx0=Fz(x,y),x0, where xa,b is the n×p matrix with row i equal to a+bxi*.

These conditions bear some resemblance to T4 and T5. One example of a function F satisfies the above conditions is a constant function.

The infinite-dimensional class of ƵR functions that we consider is defined as

𝒮F,α,cS:z𝒵,|S(z)|F(z),supzz𝒵S(z)Szzzαc.

We will now show that this class satisfies T1-T6. Conditions T1 and T2 follow immediately from the definition of 𝒮F,α,c. We now show that T3 holds. Because C(Ƶ,R) is complete, it suffices to show that, if SnS converges compactly and Sn𝒮F,α,c, then S𝒮F,α,c. Let SnS compactly. To see that |S(z)|F(z), note that

|S(z)|Sn(z)S(z)+Sn(z)F(z)+Sn(z)S(z)

and then take the limit as n. To see that S satisfies the Hölder condition, note that, for any zz𝒵,

S(z)SzzzαS(z)Sn(z)zzα+Sn(z)Snzzzα+SnzSzzzα

and again take the limit as n. Hence, S(z)Szzzαc for each zz, and so supzz𝒵S(z)Szzzαc. Hence S𝒮F,α,c, and thus T3 holds. We now show that T4 and T5 hold. To do this, we will use the group theoretic notation defined in Section 7.1. As noted in that section, T4 and T5 are equivalent to the condition that gS𝒮F,α,c for all g𝒢0 and S𝒮F,α,c. We will therefore fix S𝒮F,α,c and g𝒢0 and show that gS𝒮F,α,c. For z𝒵, we have that

|(gS)(z)|=|S(gz)|F(gz)=F(z),

where the inequality holds since S𝒮F,α,c. Note that for any z,z𝒵,gzgz=zz. Hence,

supzz𝒵(gS)(z)(gS)zzzα=supzz𝒵(gS)(z)(gS)zgzgzα=supzz𝒵S(z)Szzzαc,

where the inequality holds since S𝒮F,α,c. Hence, gS𝒮F,α,c, and so T4 and T5 hold. It remains to show T6. To see that this holds, fix S1,S2𝒮F,α,c and δ(0,1) and let S=δS1+(1δ)S2. By the triangle inequality and the fact that S1,S2𝒮F,α,c, we have the following two displays for any z,z𝒵 :

|S(z)|=δS1(z)+(1δ)S2(z)δS1(z)+(1δ)S2(z)F(z),supzz𝒵S(z)Szzzαδsupzz𝒵S1(z)S1zzzα+(1δ)supzz𝒵S2(z)S2zzzαc.

Hence, S𝒮F,α,c, and so T6 holds.

B.2. Finite-dimensional class

B.2.1. Overview

For an explicit representation of 𝒵, we have

𝒵=𝒪np×𝒪n×Rp×Rp×R×Rp×R,

where 𝒪n=aRna=0,s(a)=1. For ease of communication, we will abbreviate

za=xxs(x),yys(y)𝒪np×𝒪n,
zt=x0xs(x)Rp
zm=xs(x),ys(y)Rp×R
zs=logsx,logsyRp×R,

so that z=za,zt,zm,zs. Here, za stands for the angular component, zt stands for the test point, zm stands for the mean, zs stands for the standard deviation.

To define our parametric example for 𝒮, we can use separation of variables to consider the coordinates of z separately. We will consider estimators belonging to the class 𝒮 of all S such that

S(z)=SazaStztSgzm,zs,Sa𝒮a,St𝒮t,Sg𝒮g.

We refer to 𝒮a,𝒮t, and 𝒮g as the angular part, test point part, and group part of 𝒮, respectively. In what follows, we will describe conditions on 𝒮a,𝒮t, and 𝒮g that make it so that T1-T6 hold. We will then describe interesting collections 𝒮a,𝒮t, and 𝒮g that satisfy these conditions.

First note that we have the following inequality:

S(z)Sz=SazaStztSgzm,zsSazaStztSgzm,zsStztSgzm,zsSazaSaza+SazaSgzm,zsStztStzt+SazaStztSgzm,zsSgzm,zs.

Thus if Sa,St, and Sg were uniformly bounded by M1/3 and each of their global Hölder constant was less than or equal to c3M2/3, then supz𝒵|S(z)|M and supzz𝒵S(z)Szzzαc. Hence, if 𝒮a,𝒮t, and 𝒮g are such that functions in these collections are uniformly bounded by M1/3 and are c3M2/3-Hölder, then 𝒮𝒮M,α,c. In that case, conditions T1 and T2 hold. Since every compact subset of 𝒵 can be written as a subset of a product of compact sets K=K1×K2×K3, K1𝒪np+1,K2Rp,K3R2p+2, for condition T3 to hold, it suffices to show 𝒮a,𝒮t, and 𝒮g are closed. Condition T4 holds if 𝒮a is closed under rotations with respect to the n observations and if 𝒮a,𝒮t, and 𝒮g are closed under permutations with respect to the p features. The latter can be done by letting 𝒮a,𝒮t, and 𝒮g be p-fold tensor products of an identical space of functions. Condition T5 is satisfied when 𝒮g is closed under shifts. Finally, condition T6 holds when 𝒮a, 𝒮t, and 𝒮g are convex since the projected tensor product of convex sets is convex.

B.2.2. Angular Part 𝒮a

We define 𝒮a by truncating an orthonormal basis for the tensor product space L2𝒪n(p+1) to a specified finite number of terms and then taking the subset of the span of those basis vectors that are contained in SM1/3,α,c/(3M2/3) for some c and M. Note that OnSn2, where “≅” denotes an isomorphic relation and Sn2 is the (n2)-dimensional unit sphere. Let 1 be the n-dimensional vector of 1's, and note that 𝒪n can be expressed in the following form:

𝒪n=wRnn1/2wT1=0,n1wTw=1.

Let UO(n), the orthogonal group, be such that n1/2U1=en, the nth elementary basis vector. Such a U exists because n1/21=1. Then,

𝒪n=nUTvvRn,vn=0,v2=1

We have the isomorphism ζ:L2𝒪nL2Sn2,ζ(f)(v)=fnUTv. Thus, if we have an orthonormal basis for L2Sn2, we may use the operator ζ1 to obtain an orthonormal basis for L2𝒪n. Let H be the space of harmonic polynomials of degree in (n1)-dimensions. By the Stone-Weierstrass theorem, the direct sum =0H is dense in L2Sn2. We can truncate the series and stop at a prespecified point qa, so that

𝒮a==0qaHnUT(p+1)𝒮M1/3,α,c/3M2/3𝒪np+1. (S1)

We use the orthonormal basis 𝒴l1,l2,,ln2:l1l2ln2 for the spherical harmonics introduced in Higuchi (1987) (replacing “Y” in their notation by “𝒴” to avoid notational overload), where an explicit expression for this basis is provided in that work. Let N(n,p,q)==0qdimHp+1 and

𝒞a=ARNn,p,qa:j=0pl1l2ln2qaAl1,,ln2𝒴l1,,ln2nUTzx,,j𝒮a,

where zx,i,0=zy,i. The set 𝒞a is the coefficient space of the basis expansion in 𝒮a and is convex and compact if and only if 𝒮a is convex and compact. The set 𝒮a is closed under rotations in the n observations since the spherical harmonics for any given degree is closed under rotations. It is also closed under permutations due to the (p+1)-fold tensor product form. As an intersection of closed convex sets, it is closed and convex.

B.2.3. Test Point Part 𝒮t

Similarly to 𝒮a,𝒮t is defined by truncating an orthonormal basis for L2Rp. Let ψkk=0 be the normalized Hermite functions. They form an orthonormal basis of L2(R) and so their p-fold tensor product is an orthonormal basis of L2Rp. We can take

𝒮t=spanψkk0,1,,qtp𝒮M1/3,α,c/3M2/3Rp.

We can similarly define the coefficient space 𝒞t :

𝒞t=ARqtp:ztj=1pk=0qtAjkψkzt,j𝒮t.

Similarly to 𝒮a, the p-fold tensor product form and it being an intersection of closed and convex sets show all of the necessary conditions are satisfied.

B.2.4. Group Part 𝒮g

The 𝒮g that we will define imposes that the functions are periodic in each dimension, in the sense that, if Sg𝒮g and zgzg=±ei for some elementary basis vector ei, then Sgzg=Sgzg. In other words, we will be dealing with functions on the (2p+2)-dimensional torus, T2p+2=S12p+2. Since the torus is a product of 1 -spheres, we can use the same process as described when defining the angular part 𝒮a, namely letting

𝒮g==0qgH(2p+2)𝒮M1/3,α,c/3M2/3T2p+2 (S2)

In this case, H=span{cos(2πx),sin(2πx)} and translations can be dealt with by the sum and difference formulas for sine and cosine. Translations under periodicity are the same as rotation, and since it is known that spherical harmonics are rotationally invariant, 𝒮g is closed under translations. Similarly, the tensor product form of 𝒮a and its being an intersection of closed and convex sets implies that the rest of the sufficient conditions described at the end of Section B.2.1 are satisfied.

C. Examples of collections Γ where P5 hold

We now describe settings where P5 is often applicable. We will specify 𝒫1 in each of these settings, and the model 𝒫 is then defined by expanding 𝒫1 to contain the distributions of all possible shifts and rescalings of a random variate drawn from some P1𝒫1. The first class of models for which P 5 is often satisfied is parametric in nature, with each distribution Pθ𝒫1 indexed smoothly by a finite dimensional parameter θ belonging to a subset Θ of Rk. We note here that, because the sample size n is fixed in our setting, we can obtain an essentialy unrestricted model by allowing k to be large relative to n. In parametric settings, ρ can often be defined as ρPθ,Pθ=θθ2, where we recall that 2 denotes the Euclidean norm. If Γ1 is uniformly tight, which certainly holds if Θ is bounded, then P5 holds provided θRT,Pθ is upper-semicontinuous for all T𝒯e. For a concrete example where the conditions of P5 are satisfied, consider the case that Θ=θ:θ0s0,θ1s1 for sparsity parameters s0 and s1 on θ0#j:θj0 and θ1jθj, and Pθ is the distribution for which X~N0p,Idp, and YX~NθX,1. This setting is closely related to the sparse linear regression example that we study numerically in Section 5.3.2.

Condition P5 also allows for nonparametric regression functions. Define ϕp to be the p-dimensional standard Gaussian measure. Define L02ϕp=fL2ϕpf(x)dϕp(x)=0. Let L02ϕp satisfy the following conditions:

  1. is bounded. supffL2ϕp<.

  2. is uniformly equivanishing. limNsupff1B(0,N)cL2ϕp=0.

  3. is uniformly equicontinuous. limr0supfsupyB(0,r)τyffL2ϕp=0 where τy is the translation by y operator.

  4. is closed in L2ϕp.

  5. There exists q>2 such that Lqϕp.

By a generalization of the Riesz-Kolmogorov theorem as seen in Guo and Zhao (2019), is compact under assumptions (i) through (iv). Let c>0,α(0,1]. We suppose that 𝒮=𝒮0 where 𝒮0 is the set of all functions S:ƵR such that |S(z)|F(z),S(z)Szczz2α for all z,z𝒵. Assume further that F is bounded, i.e.

supz𝒵|F(z)|=B𝒮0<, (S3)

and also that F is constant in the orbits induced by the group action on 𝒵 defined in Section 7.1.

For each f, let Pf denote the distribution of X~N0,Idp,YX~N(f(X),1). Suppose that 𝒫1=Pff. With the metric ρ(f,g)=fgL2ϕp,𝒫1,ρ is a complete separable compact metric space. We also see that PR(T,P) is continuous.

Lemma S20. For all T𝒯e,PR(T,P) is continuous in this example.

Proof. To ease presentation, we introduce some notation. For f, let f(x)fxii=1n, f(x)1ni=1nfxi,sf(d)s(y+f(x)), and y(y)y. Let ST,f denote the map d,x0STzfd,x0, where zfd,x0 takes the same value as zfd,x0 except that the entry yys(y) is replaced with y+f(x)yf(x)sf. Also let ϕϕp(n+1)+n. For q[1,) and a function f:𝒟×𝒳, we let fLqϕfx,y,x0qϕdx,dy,dx01/q. We let fLϕinf{c0:fx,y,x0cϕ-a.s. }. For f:𝒟R, we write fLqϕ to mean d,x0f(d)Lqϕ, and follow a similar convention for functions that only take as input x,xi,y, or x0. We will write ≲ to mean inequality up to a positive multiplicative constant that may only depend on 𝒮 or .

Fix ε(0,1) and T𝒯e. Now, for any f, a change of variables shows that

RT,Pf=EPfT(X,Y)x0fx02dϕpx0=T(x,y)x0fx02(2π)n2exp12i=1nyifxi2ϕp(n+1)dx,dx0dy=T(x,y+f(x))x0fx02ϕdx,dx0,dy=y+s(y+f(x))ST,fd,x0+f(x)fx02ϕdx,dx0,dy.

Hereafter we write dϕ to denote ϕdx,dx0,dy.

Fix f,g. Most of the remainder of this proof will involve establishing that T,PfRT,Pgε2fgL2ϕp+ε. By symmetry, it will follow that RT,PfRT,Pgε2fgL2ϕp+ε.

In what follows we will use the notation (gf)x0 to mean gx0fx0,(gf)(x) to mean g(x)f(x), etc. The above yields that

R(T,PfRT,Pg=f(x)fx02g(x)gx02dϕ (S4)
+2y(gf)x0(gf)(x)dϕ (S5)
+2ysf(d)ST,fd,x0sg(d)ST,gd,x0dϕ (S6)
+sf2(d)ST,fd,x02sg2(d)ST,gd,x02dϕ (S7)
+2f¯xfx0sfdST,fd,x0g¯xgx0sgdST,gd,x0dϕ. (S8)

We bound the labeled terms on the right-hand side separately. After some calculations, it can be seen that (S4) and (S5) are bounded by a constant multiplied by fgL2ϕp. These calculations, which are omitted, involve several applications of the triangle inequality, the Cauchy-Schwarz inequality, and condition (i).

The integral in (S6) bounds as follows:

ysf(d)ST,fd,x0sg(d)ST,gd,x0dϕ=yST,fd,x0sf(d)sg(d)dϕ+ysg(d)ST,fd,x0ST,gd,x0dϕyST,fsfsgL1ϕ+ysgST,fST,gL1ϕ. (S9)

We start by studying first term of the right-hand side above. Note that, by (S3) and the assumption that |S(z)|F(z) for all zƵ and S𝒮, we have that ST,fd,x0B𝒮0. Combining this with Cauchy-Schwarz, the first term on the right-hand side above bounds as

ySTsfsgL1ϕB𝒮0yL2ϕsfsgL2ϕ. (S10)

To continue the above bound, we will show that sfsgL2ϕfgL2ϕp1/2. Noting that

sf2dsg2d=1ni=1nfxi2gxi2+2yiy¯fxigxi+g¯xf¯x+2gxig¯xfxif¯x+f¯(x)2g¯(x)2

we see that, by the triangle inequality and the Cauchy-Schwarz inequality,

sf2sg2L1ϕfgL2ϕp.

For a>0,b>0,|ab||ab|, and so sf(d)sg(d)sf2(d)sg2(d), which implies that sf(d)sg(d)2sf2(d)sg2(d), which in turn implies that sfsgL2ϕ2sf2sg2L1ϕ. Combining this with the above and taking square roots of both sides gives the desired bound, namely

sfsgL2ϕfgL2ϕp1/2. (S11)

Recalling (S10), we then see that the first term on the right-hand side of (S9) satisfies

yST,fsfsgL1ϕfgL2ϕp1/2.

We now study the second term in (S9). Before beginning our analysis, we note that, for all d,

11sg(d)ε+1sg(d)>εsg(d)sf(d)<ε/2+1sg(d)sf(d)ε/2. (S12)

Combining the above with the triangle inequality, the second term in (S9) bounds as:

ysgST,fST,gL1ϕysgST,fST,g1sgεL1ϕ+ysgST,fST,g1sg>εsfsg<ε/2L1ϕ+ysgST,fST,g1sgsfε/2L1ϕ. (S13)

In the above normed quantities, expressions like 1sgε should be interpreted as functions, e.g. 1sg()ε. By (S3), the first term on the right-hand side bounds as

ysgST,fST,g1sgεL1ϕε.

For the second term, we start by noting that

zf(d)zg(d)2=sgsf(d)sg(d)sf(d)(yy)+1sf(d)sg(d)sf(d)(fg+gf)(x)+sgsf(d)(ff)(x)2.

Using that (a+b+c)κaκ+bκ+cκ whenever a,b,c>0 and κ(0,1], this then implies that

zf(d)zg(d)2αsgsf(d)sg(d)sf(d)(yy)2α+(fg+gf)(x)sg(d)2α+sgsf(d)(ff)(x)sf(d)sg(d)2α,

where above α is the exponent from the Hölder condition satisfied by 𝒮0. Combining the Hölder condition with the above, we then see that

ST,fd,x0ST,gd,x0sgsf(d)sg(d)sf(d)(yy)2α+(fg+gf)(x)sg(d)2α+sgsf(d)(ff)(x)sf(d)sg(d)2α.

Multiplying both sides by ysg(d)1sg(d)>ε,sfsg(d)<ε/2, we then see that

ysg(d)ST,fd,x0ST,gd,x01sg(d)>ε,sfsg(d)<ε/2|y|sg(d)sgsf(d)sg(d)sf(d)(yy)2α1sg(d)>ε,sfsg(d)<ε/2+|y|sg(d)(fg+gf)(x)sg(d)2α1sg(d)>ε,sfsg(d)<ε/2+|y|sg(d)sgsf(d)(ff)(x)sf(d)sg(d)2α1sg(d)>ε,sfsg(d)<ε/2εα|y|sg(d)1αyy2αsgsf(d)α+|y|sg(d)1α(fg+gf)(x)2α+εα|y|sg1α(ff)(x)2αsgsf(d)α.

The inequality above remains true if we integrate both sides against ϕ. The resulting three terms on the right-hand side can be bounded using Hölder’s inequality. In particular, we have that

εα|y¯|αyy¯2α|sgsf|α|y¯|1αsg1αL1(ϕ)εαy¯yy¯2(sgsf)L1(ϕ)αy¯sgL1(ϕ)1αεαfgL2(ϕp)α/2,y¯sg1α(fg+g¯f¯)(x)2αL1(ϕ)y¯sgL1(ϕ)1αy¯(fg+g¯f¯)(x)2L1(ϕ)αfgL2(ϕp)α/2,εαy¯sg1α(ff¯)(x)2αsgsfαL1(ϕ)εαy¯sgL1(ϕ)1α(ff¯)(x)2|sgsf|L1(ϕ)αεαfgL2(ϕp)α/2.

Hence, we have shown that the second term on the right-hand side of (S13) satisfies

ysgST,fST,g1sg>ε,sgsf<ε/2L1ϕεαfgL2ϕpα/2.

We now study the third term on the right-hand side of (S13). We start by noting that, by Markov’s inequality and (S11),

Pϕsg(D)sf(D)ε2=Psg(D)sf(D)2ε244ε2sfsgL2ϕ2ε2fgL2ϕp.

Moreover, by the generalized Hölder’s inequality with parameters (4, 2, ∞, 4), we see that

ysgST,fST,g1sgsfε/2L1ϕyL4ϕsgL2ϕST,fST,gLϕ1sgsfε/2L4ϕ2yL4ϕsgL2ϕB𝒮0Psgsfε/21/4ε1/2fgL2ϕp1/4.

Combining our bounds for the three terms on the right-hand side of (S13), we have shown that

ysgST,fST,gL1ϕε+εαfgL2ϕpα/2+ε1/2fgL2ϕp1/4. (S14)

The above provides our bound for the (S6) term from the main expression.

We now study the (S7) term from the main expression. We start by decomposing this term as

sf2ST,f2sg2ST,g2dϕ=ST,f2sf2sg2dϕ+sg2ST,f2ST,g2dϕ,

where for brevity, we have suppressed the dependence on sf,sg,ST,f, and ST,g on their arguments. By (S11), the first term is bounded by a constant times fgL2ϕp. For the second term, we note that the uniform bound on ST,f and ST,g shows that

sg2ST,f2ST,g2L1ϕsg2ST,fST,gL1ϕ

Similarly to as we did when studying (S6), we can use (S12) and the triangle inequality to write

sg2ST,fST,gL1ϕsg2ST,fST,g1sgεL1ϕ+sg2ST,fST,g1sg>ε,sfsg<ε/2L1ϕ+sg2ST,fST,g1sgsfε/2L1ϕ.

The first term on the right upper bounds by a constant times ε2. The analyses of the second and third terms are similar to the analysis of the analogous terms from (S6). A minor difference between the study of these terms and that of (S6) is that, when applying Hölder’s inequality to separate the terms in each normed expression, we use (v) to ensure that sgLqϕ< for some q>2. This helps us deal with the fact that sg2, rather than sg, appears in the normed expressions above. Due to the similarity of the arguments to those given for (S6), the calculations for controlling the second and third terms are omitted. After the relevant calculations, we end up showing that, like (S6), (S7) is bounded by a constant times the right-hand side of (S14).

To study (S8) from the main expression, we rewrite the integral as

[f(x)fx0sf(d)ST,fd,x0g(x)gx0sg(d)ST,gd,x0dϕ=sf(d)ST,fd,x0f(x)g(x)+fx0gx0dϕ+ST,fd,x0g(x)+gx0sfsg(d)dϕ+sg(d)g(x)+gx0ST,fd,x0ST,gd,x0dϕ.

Each of the terms in the expansion can be bounded using similar techniques to those used earlier in this proof. Combining our bounds on (S4) through (S8), we see that

RT,PfRT,Pgε2fgL2ϕp+ε.

As f,g were arbitrary, we see that, for any sequence fk in such that fkf in L2ϕp as k, it holds that limsupkRT,PfkRT,Pfε. As ε(0,1) was arbitrary, this shows that RT,PfkRT,Pf as k. Hence, PR(T,P) is continuous in this example. □

D. Further details on numerical experiments

D.1. Meta-Learning Benchmarks

We implemented MAML via the learn2learn python package (Arnold et al., 2020), which in turn makes use of the Torchmeta package (Deleu et al., 2019) when generating the sinusoid functions. We trained MAML on a total of 106 datasets with a batch size of 25 datasets and used the same learning rates and number of adaptation steps as were used in learn2learn/examples/maml_sine.py. We tried two network architectures, namely the same two-hidden layer perceptron architecture that was used in the sinusoid experiments in Finn et al. (2017) and a larger network whose hidden layers contained the same number of nodes (40) but that used a total of five hidden layers. For each of the three regression settings considered (sinusoid, Gaussian process with a 1-dimensional feature, and Gaussian process with a 5-dimensional feature), we reported results for the architecture that performed best across the sample sizes considered. This ended up corresponding to reporting results for the smaller network architecture across all three settings.

For the Gaussian process example with a 1-dimensional feature, we used the implementation of CNPs provided by Jiang (2021), which corresponds to a Pytorch implementation of the code from Garnelo et al. (2018). We also modified this code so that it could apply to the sinusoidal regression example and the Gaussian process example where the feature is 5-dimensional. The CNPs were updated over the same number of iterations and using the same batch size as AMC, namely 106 and 25, respectively. We tried two network architectures for the CNPs, namely the same architecture as was used in Garnelo et al. (2018), with the input size modified in one of the Gaussian process settings to account for the 5-dimensional feature, and also a deeper architecture that has a similar number of hidden layers as does the architecture used for AMC. In particular, the encoder and decoder in this larger architecture each had nine hidden layers consisting of 100 nodes. Similar to as we did for MAML, for each of the three regression settings considered, we reported results for the architecture that performed best across the sample sizes considered. This corresponded to reporting CNP results for the smaller architecture for the Gaussian process with a 5-dimensional feature, and the larger architecture for the Gaussian process with a 1-dimensional feature and the sinusoidal regression.

D.2. Comparing to Analytically-Derived Estimators with Known Theoretical Performance Guarantees

D.2.1. Preliminaries

We now introduce notation that will be useful for defining Γ1 in the two examples. In both examples, all priors in Γ1 imply the same prior ΠX over the distribution PX of the features. This prior ΠX imposes that the Σ indexing PX is equal in distribution to diagW11/2W1diagW11/2, where W is a p×p matrix drawn from a Wishart distribution with scale matrix 2Idp and 20 degrees of freedom, and diagW1 denotes a matrix with the same diagonal as W1 and zero in all other entries. The expression for Σ normalizes by diagW11/2 to ensure that the diagonal of Σ is equal to 1p, which we require of distributions in 𝒫X. We let Γμ be a collection of Markov kernels κ:𝒫X, so that, for each κ and PX𝒫X,κ,PX is a distribution on . The collections Γμ differ in the two examples, and will be presented in the coming subsections. Let Unif() denote a uniform distribution over the permutations in . For each κΓμ, we let Πκ represent a prior on 𝒫1 from which a draw P can be generated by sampling PX~ΠX,μPX~κ,PX, and BPX,μ~Unif(), and subsequently returning the distribution of X,μ(BX)+ϵP, where X~PX and ϵP~N(0,1) are independent. We let Γ1Πκ:κΓμ. For a general class of estimators 𝒯, enforcing that each draw P has a regression function μP of the form xμ(Bx) for some permutation B is useful because it allows us to restrict the class Γμ so that each function in this class only depends on the first s coordinates of the input, while yielding a regression function μP that may depend on any arbitrary collection of s out of the p total coordinates. For the equivariant class that we consider (Algorithm 2), enforcing this turns out to be unnecessary - the invariance of functions in 𝒯 to permutations of the features implies that the Bayes risk of each T𝒯 remains unchanged if the random variable B defining ΠκΓ1 is replaced by a degenerate random variable that is always equal to the identity matrix. Nonetheless, allowing B to be a random draw from Unif() allows us to ensure that our implied collection of priors Γ satisfies P1, P2, and P3, thereby making the implied Γ compatible with the preservation conditions imposed in Section 2.

Figure S5:

Figure S5:

Bayesian standardized MSE EΠ[R(T,P)], where R is defined in Eq. 1) of the five meta-learning algorithms considered in the sinusoidal regression example when the feature x or the outcome y is scaled down by a multiplicative factor (left two columns) or when x or y is shifted by an additive factor (right two columns). For reference, the numbers reported in Table 1 in the main text are equal to the standardized MSE reported on the far-left side of each facet times the variance of the error (0.09). The three equivariant procedures (MAML-Eq, CNP-Eq, and AMC) have constant standardized MSE under the shifts and rescalings considered. The non-equivariant procedures, namely MAML and CNPs, are sensitive even to small shifts or rescalings of x, and CNPs are also sensitive to small shifts in y.

We now use the notation of Kingma and Ba (2014) to detail the hyperparameters that we used. In all settings, we set β2,ϵ=0.999,108. Whenever we were updating the prior network, we set the momentum parameter β1 to 0, and whenever we were updating the estimator network, we set the momentum parameter to 0.25. The parameter α differed across settings. In the sparse linear regression setting with s=1, we found that choosing α small helped to improve stability. Specifically, we let α=0.0002 when updating both the estimator and prior networks. In the sparse linear regression setting with s=5, we used the more commonly chosen parameter setting of α=0.001 for both networks. In the FLAM example, we chose α=0.001 and α=0.005 for the estimator and prior networks, respectively.

The learning rates were of the estimator and prior networks were decayed at rates t0.15 and t0.25, respectively. Such two-timescale learning rate strategies have proven to be effective in stabilizing the optimization problem pursued by generative adversarial networks (Heusel et al., 2017). As noted in Fiez et al. (2019), using two-timescale strategies can cause the optimization problem to converge to a differential Stackelberg, rather than a differential Nash, equilibrium. Indeed, under some conditions, the two-timescale strategy that we use is expected to converge to a differential Stackelberg equilibrium in the hierarchical two-player game where a prior Π is first selected from Γ, and then an estimator T is selected from 𝒯 to perform well against Π. An optimal prior Π in this game is called Γ-least favorable, in the sense that this prior maximizes infT𝒯r(T,) over Γ. For a given Γ-least favorable prior Π, an optimal estimator T in this game is a Bayes estimator against Π, that is, an estimator that minimizes r,Π over 𝒯. This T may not necessarily be a Γ-minimax strategy, that is, T may not minimize supΠΓr(,Π) over 𝒯. Nevertheless, we note that, under appropriate conditions, the two notions of optimality necessarily agree. Though such a theoretical guarantee is not likely to hold in our experiments given the neural network parameterizations that we use, we elected to use this two-timescale strategy because of the improvements in stability that we saw.

In all settings, the prior and estimator were updated over 106 iterations using batches of 100 datasets. For each dataset, performance is evaluated at 100 values of x0.

D.2.2. Sparse linear regression

We now introduce notation that will be useful for presenting the collection Γμ in the sparse linear regression example. For a function G:RR and a distribution PX𝒫X, we let κG,PX be equal to the distribution of

xU0eGU1,,eGUs,0,,0j=1seGUjx,

where U0~Unif(5,5) and U1,,Us~N0s,Ids are drawn independently. Notably, here κG,PX does not depend on PX. We let ΓμκG:G𝒢, where 𝒢 takes different values when s=1 and when s=5. When s=1,𝒢 consists of all four-hidden layer perceptrons with identity output activation, where each hidden layer consists of forty leaky ReLU units. When s=5,𝒢 consists of all four-hidden layer neural networks with identity output activation, but in this case each layer is a multi-input-output channel equivariant layer as described in Eq. 22 of Zaheer et al. (2017). Each hidden layer is again equipped with a ReLU activation function. The output of each such network is equivariant to permutations of the s=5 inputs.

In each sparse linear regression setting considered, we initialized the estimator network by pretraining for 5,000 iterations against the initial fixed prior network. After these 5,000 iterations, we then began to adversarially update the prior network against the estimator network.

Five thousand Monte Carlo replicates were used to obtain the performance estimates in Table 2.

D.2.3. Fused lasso additive model

When discussing the FLAM example, we will write xj to denote the jth feature, that is, we denote a generic x𝒳 by x=x1,,xp. We emphasize this to avoid any notational confusion with the fact that, elsewhere in the text, Xi𝒳 is used to denote the random variable corresponding to the ith observation.

In the FLAM example, each prior κG in Γμ is indexed by a function G:Rs+2[0,)s belonging to the collection of four-hidden layer perceptrons with identity output activation, where each hidden layer consists of forty leaky ReLU units. Specifically, κG,PX is a distribution over generalized additive models xj=1pμjxj for which each component μj is piecewise-constant and changes values at most 500 times. To obtain a draw μP from κG,PX, we can first draw 500 iid observations from PX and store these observations in the matrix X˜. Each component μj can only have a jump at the 500 points in X˜*j. The magnitude of each jump is defined using the function G and the sign of the jump is defined uniformly at random. More specifically, these increments are defined based on the independent sources of noise Hjk:j=1,,p;k=1,,500), which is an iid collection of Rademacher random variables, and Uk:k=1,,500, which is an iid collection of N0s+2,Ids+2 random variables. The component μj is chosen to be proportional to the function fjxj=k=1500HjkGUkjIxjX˜kj. The proportionality constant cj=1pk=1500GUkj is defined so that the function μP(x)=c1j=1pfjxj saturates the constraint v(μ)1M that is imposed by . To recap, the random draw μP from κG,PX can be obtained by independently drawing X˜,Hj,k:j,k, and Uk:k, and subsequently following the steps described above to define the corresponding proportionality constant c and components fj,j=1,,p.

We evaluated the performance of the learned prediction procedures using a variant of the simulation scenarios 1–4 from the paper that introduced FLAM (Fig. 2 in Petersen et al., 2016). As presented in that work, the four scenarios have p independent Unif (−2.5, 2.5) features, with the components corresponding to s0=4 of these features being nonzero. These scenarios offer a range of smoothness settings, with scenarios 1–4 enforcing that the components be (1) piecewise constant, (2) smooth, (3) a mix of piecewise constant and smooth functions, and (4) constant in some areas of its domain and highly variable in others. To evaluate our procedures trained with vμP05, we used the R function sim. data in the flam package (Petersen, 2018) to generate training data from the scenarios in Petersen et al. (2016) with p=10 features. We then generated new outcomes by rescaling the regression function by a positive multiplicative constant so that vμP1=10, and subsequently added standard Gaussian noise. To evaluate our procedures trained at sparsity level s=1 in a given scenario, we defined a prior over the regression function that first randomly selects one of the four signal components, then rescales this component so that it has total variation equal to 10, and then sets all other components equal to zero. Outcomes were generated by adding Gaussian noise to the sampled regression function. We compared our approach to the FLAM method as implemented in the flam package when, in the notation of Petersen et al. (2016), α=1 and λ was chosen numerically to enforce that the resulting regression function estimate μˆ satisfied v(μˆ)110. Choosing λ in this fashion is reasonable in light of the fact that vμP1=10 for all settings considered.

Two thousand Monte Carlo replicates were used to obtain the performance estimates in Table 3.

E. Additional details and results for data experiments

E.1. Datasets

We start by describing the six datasets that we considered that are available through the UCI Machine Learning Repository (Dua and Graff, 2017). The first dataset (“abalone”) contains information on 4177 abalones. The objective is to predict their age based on 7 features, namely length, diameter, height, whole weight, shucked weight, viscera weight, and shell weight (Nash et al., 1994). The second dataset (“airfoil”) is from the National Aeronautics and Space Administration (NASA) that contains information on 1,503 airfoils at various wind tunnel speeds and angles of attack (Brooks et al., 1989). The objective is to estimate the scaled sound level in decibels. Five features are available, namely frequency, angle of attack, chord length, free-stream velocity, and suction side displacement thickness. The third dataset (“fish”) was originally used to develop quantitative structure-activity relationship (QSAR) models to predict acute aquatic toxicity towards the fathead minnow. This dataset contains 908 total observations, each of which corresponds to a distinct chemical. The outcome is the LC50 for that chemical, which represents the concentration of the chemical that is lethal for 50% of test fish over 96 hours. Six features that describe the molecular characteristics of the chemical are available — see the UCI Machine Learning Repository and Cassotti et al. (2015) for details. The fourth and fifth datasets contain information on 1,599 red wines (“wine-red”) and 4,898 white wines (“wine-white”) (Cortez et al., 2009). The objective is to predict wine quality score based on 11 available features — see the UCI Machine Learning Repository and Cassotti et al. (2015) for details. The sixth dataset (“yacht”) contains information on 308 sailing yachts. The objective is to learn to predict a ship’s performance in terms of residuary resistance. Six features describing a ship’s dimensions and velocity are available, namely: the longitudinal position of the center of buoyancy, the prismatic coefficient, the length-displacement ratio, the beam-draught ratio, the length-beam ratio, and the Froude number. See Gerritsma et al. (1981) for more information on these features.

The seventh and eighth of our datasets that we considered were used to illustrate regression procedures in James et al. (2013). They are available through the ISLR R package (James et al., 2017). One of these datasets (“college”) consists of information on 777 colleges in the United States. The objective is to predict out-of-state tuition based on 16 available continuous features. The second of these datasets (“hitters”) contains information on 322 baseball players. The objective is to predict salary based on the 16 available continuous features. The ninth dataset (“LAozone”) was used to illustrate regression procedures in (Friedman, 2001). It consists of 330 daily meteorological measurements in the Los Angeles basin in 1976. The objective is to predict ozone levels based on 9 available features. The final dataset that we considered (“happiness”) was used in the paper that introduced the FLAM to illustrate the performance of the method (Petersen et al., 2016). This dataset consists of information about 109 countries. The objective is to predict the national happiness level via 12 country-level features.

E.2. Additional results for data experiments

Table S5 displays the cross-validated MSEs across the ten datasets in numerical form. Figure S6 shows the performance of the individual linear algorithms considered at different sparsity levels, and Figure S7 shows the same results but for the stacking algorithms.

Table S5:

Cross-validated MSEs on the ten datasets. The first 5 datasets had the same number of features the same as were used during meta-training (10), whereas the others had fewer. For each of the three categories (linear estimators, FLAM estimators, and stacked estimators) and each dataset, the algorithm with the lowest Monte Carlo MSE is emphasized in bold. There was no clear ordering between the performance of AMC Linear and the existing estimators (OLS and lasso). AMC FLAM tended to outperform FLAM when the number of features was the same as were used during meta-training, and be slightly outperformed otherwise. When the number of features was the same as were used during meta-training, stacking the existing and AMC estimators consistently outperformed all other approaches. When there were fewer features than were used during meta-training, stacking all available learners performed similarly to stacking only the existing algorithms and still outperformed all individual learners.

Features OLS Lasso AMC Linear (ours) FLAM AMC FLAM (ours) Stacked Existing Stacked AMC (ours) Stacked Both (ours)
college 10 0.414 0.397 0.377 0.392 0.395 0.358 0.354 0.348
happiness 10 0.270 0.277 0.275 0.315 0.311 0.280 0.261 0.256
hitters 10 0.667 0.660 0.662 0.626 0.619 0.602 0.615 0.585
wine-red 10 0.768 0.737 0.746 0.826 0.776 0.737 0.737 0.731
wine-white 10 0.833 0.814 0.824 0.899 0.860 0.809 0.815 0.802
LAozone 9 0.341 0.335 0.337 0.335 0.367 0.310 0.320 0.309
abalone 7 0.559 0.546 0.540 0.709 0.675 0.539 0.538 0.537
fish 6 0.471 0.475 0.480 0.544 0.554 0.464 0.476 0.468
yacht 6 0.381 0.372 0.350 0.019 0.035 0.015 0.029 0.015
airfoil 5 0.524 0.525 0.528 0.617 0.701 0.516 0.523 0.520

F. Performance of symmetrized estimators in experiments

We now present the additional experimental results that we alluded to in Section 8. These results were obtained by symmetrizing the meta-learned AMC100 and AMC500 estimators whose performance was reported in Section 5. In particular, we symmetrized a given AMC estimator T as

Tsym(x,y)x012T(x,y)T(x,y)x0.

When reporting our experimental results, we refer to the symmetrized estimator derived from the meta-learned AMC100 and AMC500 estimators as ‘symmetrized AMC100' and ‘symmetrized AMC500', respectively. We emphasize that these symmetrized estimators are derived directly from the AMC100 and AMC500 fits that we reported in Section 5 – we did not rerun our AMC meta-learning algorithm to obtain these estimators.

Figure S6:

Figure S6:

Performance of OLS, lasso, and AMC Linear at different sparsity levels. For each training-validation split of the data, between 1 and q features are selected at random from the original dataset (x-axis), where q is the minimum of 10 and the total number of features in the dataset, and Gaussian noise features are then added so that there are 10 total features. Therefore, the signal is expected to become denser and stronger as the x-axis value increases. AMC Linear consistently outperformed OLS and performed similarly to or better than lasso in most settings (54% of all sparsity-dataset pairs).

Table S6 reports the results for the linear regression example. In many settings, the two approaches performed similarly. However, in the sparse setting, the improvements that resulted from symmetrization sometimes resulted in the MSE being cut in half. In one setting (dense, interior, n=100), AMC100 outperformed symmetrized AMC100 slightly – though not deducible from the table, we note here that the difference in MSE in this case was less than 0.003, and it seems likely that this discrepancy is a result of Monte Carlo error. Table S6 reports the results for the fused lasso additive model example. Symmetrization led to a reduction in MSE in most settings. In all other settings, the MSE remained unchanged.

Figure S7:

Figure S7:

Performance of the three stacking algorithms at different sparsity levels. For each training-validation split of the data, between 1 and q features are selected at random from the original dataset (x-axis), where q is the minimum of 10 and the total number of features in the dataset, and Gaussian noise features are then added so that there are 10 total features. Therefore, the signal is expected to become denser and stronger as the x-axis value increases. Though all algorithms performed similarly, the stacking algorithm that combined all available algorithms (Stacked Both) performed slightly better than the others in a majority of the settings (53% of all sparsity-dataset pairs), and Stacked AMC performed best in most other settings (39% of all sparsity-dataset pairs).

Table S6:

MSEs based on datasets of size n in the linear regression settings. All Monte Carlo standard errors are less than 0.001. Symmetrized AMC100 entries appear in bold when they had lower MSE (rounded to the nearest hundredth) than the corresponding AMC100 entry, and vice versa. Similarly, symmetrized AMC500 entries appear in bold when they had lower MSE than the corresponding AMC500 entry, and vice versa.

(a) Sparse signal
Boundary Interior
n=100 500 100 500
OLS 0.12 0.02 0.12 0.02
Lasso 0.06 0.01 0.06 0.01
AMC100 (ours) 0.02 <0.01 0.11 0.09
Symmetrized AMC100 (ours) 0.02 <0.01 0.06 0.04
AMC500 (ours) 0.02 <0.01 0.07 0.04
Symmetrized AMC500 (ours) 0.02 <0.01 0.06 0.03
(b) Dense signal
Boundary Interior
n=100 500 100 500
OLS 0.13 0.02 0.13 0.02
Lasso 0.11 0.02 0.09 0.02
AMC100 (ours) 0.10 0.04 0.08 0.02
Symmetrized AMC100 (ours) 0.09 0.03 0.09 0.02
AMC500 (ours) 0.09 0.02 0.09 0.02
Symmetrized AMC500 (ours) 0.09 0.02 0.09 0.02

Table S7:

MSEs based on datasets of size n in the FLAM settings. The Monte Carlo standard errors for the MSEs of FLAM and (symmetrized) AMC are all less than 0.04 and 0.01, respectively. Symmetrized AMC100 entries appear in bold when they had lower MSE (rounded to the nearest hundredth) than the corresponding AMC100 entry, and vice versa. Similarly, symmetrized AMC500 entries appear in bold when they had lower MSE than the corresponding AMC500 entry, and vice versa.

(a) Sparse signal
Scenario 1 Scenario 2 Scenario 3 Scenario 4
n=100 500 100 500 100 500 100 500
FLAM 0.44 0.12 0.47 0.17 0.38 0.11 0.51 0.19
AMC100 (ours) 0.34 0.20 0.18 0.08 0.27 0.14 0.17 0.08
Symmetrized AMC100 (ours) 0.32 0.18 0.18 0.08 0.26 0.13 0.16 0.08
AMC500 (ours) 0.48 0.12 0.19 0.06 0.35 0.10 0.23 0.08
Symmetrized AM5100 (ours) 0.43 0.12 0.17 0.05 0.32 0.09 0.21 0.07
(b) Dense signal
Scenario 1 Scenario 2 Scenario 3 Scenario 4
n=100 500 100 500 100 500 100 500
FLAM 0.59 0.17 0.65 0.24 0.53 0.16 0.76 0.36
AMC100 (ours) 1.20 0.91 0.47 0.39 0.87 0.57 0.30 0.30
Symmetrized AMC100 (ours) 1.16 0.84 0.45 0.37 0.83 0.52 0.29 0.30
AMC500 (ours) 0.58 0.15 0.37 0.08 0.46 0.12 0.36 0.09
Symmetrized AM5100 (ours) 0.55 0.15 0.36 0.08 0.43 0.11 0.34 0.09

References

  1. Arnold SM, Mahajan P, Datta D, Bunner I, and Zarkias KS. learn2learn: A library for meta-learning research. arXiv preprint arXiv:2008.12284, 2020. [Google Scholar]
  2. Berger JO. Statistical Decision Theory and Bayesian Analysis. Springer Science & Business Media, 1985. [Google Scholar]
  3. Bertinetto L, Henriques JF, Torr PH, and Vedaldi A. Meta-learning with differentiable closed-form solvers. arXiv preprint arXiv:1805.08136, 2018. [Google Scholar]
  4. Billingsley P. Convergence of probability measures. Wiley, 1999. [Google Scholar]
  5. Bosc T. Learning to learn neural networks. arXiv preprint arXiv:1610.06072, 2016. [Google Scholar]
  6. Breiman L. Stacked regressions. Machine learning, 24(1):49–64, 1996. [Google Scholar]
  7. Breiman L. Random forests. Machine learning, 45(1):5–32, 2001. [Google Scholar]
  8. Brooks TF, Pope DS, and Marcolini MA. Airfoil self-noise and prediction, volume 1218. National Aeronautics and Space Administration , Office of Management..., 1989. [Google Scholar]
  9. Cassotti M, Ballabio D, Todeschini R, and Consonni V. A similarity-based qsar model for predicting acute toxicity towards the fathead minnow (pimephales promelas). SAR and QSAR in Environmental Research, 26(3):217–243, 2015. [DOI] [PubMed] [Google Scholar]
  10. Chamberlain G. Econometric applications of maxmin expected utility. Journal of Applied Econometrics, 15(6):625–644, 2000. [Google Scholar]
  11. Chang K-C. Methods in nonlinear analysis. Springer Science & Business Media, 2006. [Google Scholar]
  12. Chen T and Guestrin C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794. ACM, 2016. [Google Scholar]
  13. Cohn DL. Measure theory. Springer, 2013. [Google Scholar]
  14. Conway JB. A course in functional analysis, volume 96. Springer, 2010. [Google Scholar]
  15. Cortez P, Teixeira J, Cerdeira A, Almeida F, Matos T, and Reis J. Using data mining for wine quality assessment. In International Conference on Discovery Science, pages 66–79. Springer, 2009. [Google Scholar]
  16. Cybenko G. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989. [Google Scholar]
  17. Dalvi N, Domingos P, Sanghai S, and Verma D. Adversarial classification. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 99–108, 2004. [Google Scholar]
  18. Day MM. Fixed-point theorems for compact convex sets. Illinois Journal of Mathematics, 5 (4):585–590, 1961. [Google Scholar]
  19. Deleu T, Würfl T, Samiei M, Cohen JP, and Bengio Y. Torchmeta: A Meta-Learning library for PyTorch, 2019. URL https://arxiv.org/abs/1909.06576. Available at: https://github.com/tristandeleu/pytorch-meta.
  20. Dua D and Graff C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
  21. Efron B and Morris C. Limiting the risk of bayes and empirical bayes estimators—part ii: The empirical bayes case. Journal of the American Statistical Association, 67(337):130–139, 1972. [Google Scholar]
  22. Fan K. Minimax theorems. Proceedings of the National Academy of Sciences of the United States of America, 39(1):42, 1953. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Fiez T, Chasnov B, and Ratliff LJ. Convergence of learning dynamics in stackelberg games. arXiv preprint arXiv:1906.01217, 2019. [Google Scholar]
  24. Finn C, Abbeel P, and Levine S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR. org, 2017. [Google Scholar]
  25. Finn C, Xu K, and Levine S. Probabilistic model-agnostic meta-learning. arXiv preprint arXiv:1806.02817, 2018. [Google Scholar]
  26. Friedman J, Hastie T, Tibshirani R, et al. The elements of statistical learning, volume 1. Springer series in statistics New York, 2001. [Google Scholar]
  27. Friedman JH. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001. [Google Scholar]
  28. Garnelo M, Rosenbaum D, Maddison C, Ramalho T, Saxton D, Shanahan M, Teh YW, Rezende D, and Eslami SA. Conditional neural processes. In International Conference on Machine Learning, pages 1704–1713. PMLR, 2018. [Google Scholar]
  29. Geman S and Geman D. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on pattern analysis and machine intelligence, (6):721–741, 1984. [DOI] [PubMed] [Google Scholar]
  30. Geng S, Nassif H, Manzanares CA, Reppen AM, and Sircar R. Deep pqr: Solving inverse reinforcement learning using anchor actions. arXiv e-prints, pages arXiv-2007, 2020. [Google Scholar]
  31. Gerritsma J, Onnink R, and Versluis A. Geometry, resistance and stability of the delft systematic yacht hull series. International shipbuilding progress, 28(328):276–297, 1981. [Google Scholar]
  32. Ghosal S and Van der Vaart A. Fundamentals of nonparametric Bayesian inference, volume 44. Cambridge University Press, 2017. [Google Scholar]
  33. Gidel G, Berard H, Vignoud G, Vincent P, and Lacoste-Julien S. A variational inequality perspective on generative adversarial networks. arXiv preprint arXiv:1802.10551, 2018. [Google Scholar]
  34. Glynn PW. Likelihood ratio gradient estimation: an overview. In Proceedings of the 19th conference on Winter simulation, pages 366–375. ACM, 1987. [Google Scholar]
  35. Goldblum M, Fowl L, and Goldstein T. Adversarially robust few-shot learning: A meta-learning approach. arXiv preprint arXiv:1910.00982v2, 2019. [Google Scholar]
  36. Goodfellow IJ, Shlens J, and Szegedy C. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014. [Google Scholar]
  37. Guo W and Zhao G. An improvement on the relatively compactness criteria. arXiv preprint arXiv:1904.03427, 2019. [Google Scholar]
  38. Hartford J, Graham DR, Leyton-Brown K, and Ravanbakhsh S. Deep models of interactions across sets. arXiv preprint arXiv:1803.02879, 2018. [Google Scholar]
  39. Hastings WK. Monte carlo sampling methods using markov chains and their applications. 1970. [Google Scholar]
  40. Heusel M, Ramsauer H, Unterthiner T, Nessler B, and Hochreiter S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pages 6626–6637, 2017. [Google Scholar]
  41. Higuchi A. Symmetric tensor spherical harmonics on the n-sphere and their application to the de sitter group so (n, 1). Journal of mathematical physics, 28(7):1553–1566, 1987. [Google Scholar]
  42. Hochreiter S and Schmidhuber J. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. [DOI] [PubMed] [Google Scholar]
  43. Hochreiter S, Younger AS, and Conwell PR. Learning to learn using gradient descent. In International Conference on Artificial Neural Networks, pages 87–94. Springer, 2001. [Google Scholar]
  44. Hornik K. Approximation capabilities of multilayer feedforward networks. Neural networks, 4 (2):251–257, 1991. [Google Scholar]
  45. Hospedales T, Antoniou A, Micaelli P, and Storkey A. Meta-learning in neural networks: A survey. arXiv preprint arXiv:2004.05439, 2020. [DOI] [PubMed] [Google Scholar]
  46. Hunt G and Stein C. Most stringent tests of statistical hypotheses. Unpublished manuscript, 1946. [Google Scholar]
  47. James G, Witten D, Hastie T, and Tibshirani R. An introduction to statistical learning, volume 112. Springer, 2013. [Google Scholar]
  48. James G, Witten D, Hastie T, and Tibshirani R. ISLR: Data for an Introduction to Statistical Learning with Applications in R, 2017. URL https://CRAN.R-project.org/package=ISLR. R package version 1.2. [Google Scholar]
  49. Jiang S. Conditional neural process pytorch implementation, 2021. URL https://github.com/shalijiang/neural-process. [Google Scholar]
  50. Kempthorne PJ. Numerical specification of discrete least favorable prior distributions. SIAM Journal on Scientific and Statistical Computing, 8(2):171–184, 1987. [Google Scholar]
  51. Kingma DP and Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [Google Scholar]
  52. Korpelevich GM. The extragradient method for finding saddle points and other problems. Matecon, 12:747–756, 1976. [Google Scholar]
  53. Le Cam L. Asymptotic methods in statistical decision theory. Springer Science & Business Media, 2012. [Google Scholar]
  54. Lee K, Maji S, Ravichandran A, and Soatto S. Meta-learning with differentiable convex optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10657–10665, 2019. [Google Scholar]
  55. Lin T, Jin C, and Jordan MI. On gradient descent ascent for nonconvex-concave minimax problems. arXiv preprint arXiv:1906.00331v6, 2019. [Google Scholar]
  56. Luedtke A, Carone M, Simon NR, and Sofrygin O. Learning to learn from data: using deep adversarial learning to construct optimal statistical procedures. Science Advances (in press; available online late Feb or Mar 2020), 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Maron H, Fetaya E, Segol N, and Lipman Y. On the universality of invariant networks. arXiv preprint arXiv:1901.09342, 2019. [Google Scholar]
  58. Munkres J Topology. Featured Titles for Topology Series. Prentice Hall, Incorporated, 2000. ISBN; 9780131816299. URL https://books.google.com/books?id=XjoZAQAAIAAJ. [Google Scholar]
  59. Nabi S, Nassif H, Hong J, Mamani H, and Imbens G. Decoupling learning rates using empirical bayes priors. arXiv preprint arXiv:2002.01129, 2020. [Google Scholar]
  60. Nash WJ, Sellers TL, Talbot SR, Cawthorn AJ, and Ford WB. The population biology of abalone (haliotis species) in tasmania. i. blacklip abalone (h. rubra) from the north coast and islands of bass strait. Sea Fisheries Division, Technical Report, 48:p411, 1994. [Google Scholar]
  61. Nelder JA and Wedderburn RW. Generalized linear models. Journal of the Royal Statistical Society: Series A (General), 135(3):370–384, 1972. [Google Scholar]
  62. Nelson W. Minimax solution of statistical decision problems by iteration. The Annals of Mathematical Statistics, pages 1643–1657, 1966. [Google Scholar]
  63. Noubiap RF and Seidel W. An algorithm for calculating γ-minimax decision rules under generalized moment conditions. The Annals of Statistics, 29(4):1094–1116, 2001. [Google Scholar]
  64. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, and Antiga L. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pages 8024–8035, 2019. [Google Scholar]
  65. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, and Duchesnay E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. [Google Scholar]
  66. Petersen A. flam: Fits Piecewise Constant Models with Data-Adaptive Knots, 2018. URL https://CRAN.R-project.org/package=flam. R package version 3.2. [Google Scholar]
  67. Petersen A, Witten D, and Simon N. Fused lasso additive model. Journal of Computational and Graphical Statistics, 25(4):1005–1025, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  68. Pier J-P. Amenable locally compact groups. Wiley-Interscience, 1984. [Google Scholar]
  69. Polley EC and Van der Laan MJ. Super learner in prediction. Technical report, University of California, Berkeley, 2010. [Google Scholar]
  70. Ravanbakhsh S, Schneider J, and Poczos B. Deep learning with sets and point clouds. arXiv preprint arXiv:1611.04500, 2016. [Google Scholar]
  71. Ravanbakhsh S, Schneider J, and Poczos B. Equivariance through parameter-sharing. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2892–2901. JMLR. org, 2017. [Google Scholar]
  72. Ravi S and Larochelle H. Optimization as a model for few-shot learning. In International Conference on Learning Representations (ICLR), 2017. [Google Scholar]
  73. Robert C. The Bayesian choice: from decision-theoretic foundations to computational implementation. Springer Science & Business Media, 2007. [Google Scholar]
  74. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015. [Google Scholar]
  75. Russell S. Learning agents for uncertain environments. In Proceedings of the eleventh annual conference on Computational learning theory, pages 101–103, 1998. [Google Scholar]
  76. Santoro A, Bartunov S, Botvinick M, Wierstra D, and Lillicrap T. Meta-learning with memory-augmented neural networks. In International conference on machine learning, pages 1842–1850, 2016. [Google Scholar]
  77. Schafer CM and Stark PB. Constructing confidence regions of optimal expected size. Journal of the American Statistical Association, 104(487):1080–1089, 2009. [Google Scholar]
  78. Schmidhuber J. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-… hook. PhD thesis, Technische Universität München, 1987. [Google Scholar]
  79. Terkelsen F. Some minimax theorems. Mathematica Scandinavica, 31(2):405–413, 1973. [Google Scholar]
  80. Thrun S and Pratt L. Learning to learn: Introduction and overview. In Learning to learn, pages 3–17. Springer, 1998. [Google Scholar]
  81. Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996. [Google Scholar]
  82. Van der Laan MJ, Polley EC, and Hubbard AE. Super learner. Statistical applications in genetics and molecular biology, 6(1), 2007. [DOI] [PubMed] [Google Scholar]
  83. Van der Vaart AW, Dudoit S, and van der Laan MJ. Oracle inequalities for multi-fold cross validation. Statistics and Decisions, 24(3):351–371, 2006. [Google Scholar]
  84. van Gaans O. Probability measures on metric spaces. Technical report, Technical report, Delft University of Technology, 2003. [Google Scholar]
  85. Vilalta R and Drissi Y. A perspective view and survey of meta-learning. Artificial intelligence review, 18(2):77–95, 2002. [Google Scholar]
  86. Vinyals O, Blundell C, Lillicrap T, and Wierstra D. Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638, 2016. [Google Scholar]
  87. Vuorio R, Sun S-H, Hu H, and Lim JJ. Toward multimodal model-agnostic meta-learning. arXiv preprint arXiv:1812.07172, 2018. [Google Scholar]
  88. Wald A. Statistical decision functions which minimize the maximum risk. Annals of Mathematics, pages 265–280, 1945. [Google Scholar]
  89. Yin C, Tang J, Xu Z, and Wang Y. Adversarial meta-learning. arXiv preprint arXiv:1806.03316, 2018. [Google Scholar]
  90. Zaheer M, Kottur S, Ravanbakhsh S, Poczos B, Salakhutdinov RR, and Smola AJ. Deep sets. In Advances in neural information processing systems, pages 3391–3401, 2017. [Google Scholar]

RESOURCES