Abstract
We frame the meta-learning of prediction procedures as a search for an optimal strategy in a two-player game. In this game, Nature selects a prior over distributions that generate labeled data consisting of features and an associated outcome, and the Predictor observes data sampled from a distribution drawn from this prior. The Predictor’s objective is to learn a function that maps from a new feature to an estimate of the associated outcome. We establish that, under reasonable conditions, the Predictor has an optimal strategy that is equivariant to shifts and rescalings of the outcome and is invariant to permutations of the observations and to shifts, rescalings, and permutations of the features. We introduce a neural network architecture that satisfies these properties. The proposed strategy performs favorably compared to standard practice in both parametric and nonparametric experiments.
1. Introduction
1.1. Problem Formulation
Consider a dataset consisting of observations drawn independently from a distribution belonging to some known model , where each is a continuously distributed feature with support contained in and each is an outcome with support contained in . This dataset can be written as , where is the matrix for which row contains and is the -dimensional vector for which entry contains . The support of is contained in . The objective is to develop an estimator of the regression function that maps from to . An estimator belongs to the collection of operators that take as input a dataset and output a prediction function , where here and throughout we use to denote a possible realization of the random variable . Examples of estimators include the generalized linear models (Nelder and Wedderburn, 1972), random forests (Breiman, 2001), and gradient boosting machines (Friedman, 2001). We will also refer to estimators as prediction procedures. We focus on the case that the performance of an estimator is quantified via the standardized mean-squared error (MSE), namely
| (1) |
where the expectation above is over the draw of under sampling from denotes the marginal distribution of implied by , and denotes the variance of the error when . Note that may be heteroscedastic. Throughout we assume that, for all and is a continuous random variable. Note that the continuity of implies that is continuous and that .
In practice, the distribution is not known, and therefore the risk of a given estimator is also not known. We now describe three existing criteria for judging the performance of that do not rely on knowledge of . The first criterion is the maximal risk . If minimizes the maximal risk over , then is referred to as a minimax estimator (Wald, 1945). Minimax estimators optimize for the worst-case scenario wherein the distribution is chosen adversarially in such a way that the selected estimator performs as poorly as possible. The second criterion is Bayesian in nature, namely the average of the risk over draws of from a given prior on . Specifically, this Bayes risk is defined as (Robert, 2007). A -Bayes estimator optimally incorporates the prior beliefs encoded in with respect to the Bayes risk — more concretely, an estimator is referred to as a -Bayes estimator if it minimizes the Bayes risk over . Though the optimality property of Bayes estimators is useful in settings where only encodes substantive prior knowledge, its utility is less clear otherwise. Indeed, as the function generally depends on the choice of , it is possible that a -Bayes estimator is meaningfully suboptimal with respect to some other prior , that is, that . This phenomenon can be especially common when the sample size is small or the model is nonparametric. In fact, in the latter case, Bayes estimators against particular priors can easily be inconsistent even though consistent frequentist estimators are available (Ghosal and Van der Vaart, 2017) — for such priors, Bayes estimators perform poorly even when the sample size is large. Therefore, in settings where there is no substantive reason to favor a particular choice of , it is sensible to seek another approach for judging the performance of . A natural criterion is the worst-case Bayes risk of over some user-specified collection of priors, namely . This criterion is referred to as the -maximal Bayes risk of . The collection may be restricted to contain all priors that are compatible with available prior information, such as knowledge about the smoothness of a regression function, while being left large enough to acknowledge that prior knowledge may be too vague to encode within a single prior distribution (see Section 3.6 of Robert, 2007, for more possible forms of vague prior information). If is a minimizer of the -maximal Bayes risk, then is referred to as a -minimax estimator (Berger, 1985). Such estimators can be viewed as the optimal strategy in a sequential two-player game between a Predictor and Nature, where the Predictor selects an estimator and Nature then selects a prior in at which the Predictor’s chosen estimator performs as poorly as possible in terms of Bayes risk. Notably, in settings where contains all distributions with support in , the -maximal Bayes risk is equivalent to the maximal risk. Consequently, in this special case, an estimator is -minimax if and only if it is minimax. In settings where , an estimator is -minimax if and only if it is -Bayes. Therefore, by allowing for a choice of as large as the unrestricted set of all possible distributions or as small as a singleton set, -minimaxity provides a means of interpolating between the minimax and Bayesian criteria.
Though -minimax estimators represent an appealing compromise between the Bayesian and minimax paradigms, they have seen limited use in practice because they are rarely available in closed form. In this work, we aim to overcome this challenge in the context of prediction by providing an iterative strategy for learning -minimax prediction procedures. Due to the potentially high computational cost of this iterative scheme, a key focus of our work involves identifying conditions under which we can identify a small subclass of that still contains a T-minimax estimator. This then makes it possible to optimize over this subclass, which we show in our experiments can dramatically improve the performance of our iterative scheme given a fixed computational budget.
Hereafter we refer to -minimax estimators as ‘optimal’, where it is to be understood that this notion of optimality relies on the choice of .
1.2. Overview of Our Strategy and Our Contributions
Our strategy builds on two key results, each of which will be established later in this work. First, under conditions on and , there exists a -minimax estimator in the subclass of estimators that are equivariant to shifts and rescalings of the outcome and are invariant to permutations of the observations and to shifts, rescalings, and permutations of the features. Second, under further conditions, there is an equilibrium point such that
| (2) |
Upper bounding the right-hand side by and applying the max-min in-equality shows that is -minimax. To find an equilibrium numerically, we propose to use adversarial Monte Carlo meta-learning (AMC) (Luedtke et al., 2020) to iteratively update an estimator in and a prior in . AMC is a form of stochastic gradient descent ascent (e.g., Lin et al., 2019) that can be used to learn optimal statistical procedures in general decision problems.
We make the following contributions:
In Section 2, we characterize several equivariance properties of optimal estimators for a wide range of .
In Section 3, we present a general framework for adversarially learning optimal prediction procedures.
In Section 4, we present a novel neural network architecture for parameterizing estimators that satisfy the equivariance properties established in Section 2.
In Section 5, we apply our algorithm in two settings and learn estimators that outperform standard approaches in numerical experiments. In Section 6, we also evaluate the performance of these learned estimators in data experiments.
All proofs for the results in the above sections can be found in Section 7. Section 8 describes possible extensions and provides concluding remarks.
To maximize the accessibility of our main theoretical results, we do not use group theoretic notation when presenting them in Sections 2 through 4. However, when proving these results, we will heavily rely on tools from group theory; consequently, we adopt this notation in Section 7.
1.3. Related Works
The approach proposed in this work is a form of meta-learning (Schmidhuber, 1987; Thrun and Pratt, 1998; Vilalta and Drissi, 2002), where here each task is a regression problem. Most existing works in this area pursue a task-distribution strategy to meta-learning (Hospedales et al., 2020), where the objective is to minimize the average loss (risk) across draws of tasks from some specified distribution. As we will now show, the objective function employed in such strategies in fact corresponds to a Bayes risk. In regression problems, each task is a tuple containing a dataset and a task-dependent loss . For a given prior , a draw from the task distribution can be obtained by first sampling , next sampling a dataset of independent observations from , drawing an evaluation point , and finally defining the loss by or some related loss, such as a squared error loss that does not standardize by . The objective function is then equal to , where the expectation is over the draw of from the task distribution. This objective function is exactly equal to the Bayes risk function . Hence, existing meta-learning approaches for regression problems whose objective functions take this form can be viewed as optimizing a Bayes risk.
We now review existing meta-learning strategies, starting with those that parameterize as a neural network class. Hochreiter et al. (2001) advocated parameterizing as a collection of long short-term (LSTM) networks (Hochreiter and Schmidhuber, 1997). More recent works have advocated using memory-augmented neural networks (Santoro et al., 2016) or conditional neural processes (CNPs) (Garnelo et al., 2018) rather than LSTMs in meta-learning tasks. There have also been other works on the meta-learning of supervised learning procedures that are parameterized as neural networks (Bosc, 2016; Vinyals et al., 2016; Ravi and Larochelle, 2017). Compared to these works, we adversarially learn a prior from a collection of priors, and we also formally characterize equivariance properties that will be satisfied by any optimal prediction procedure in a wide variety of problems. This characterization leads us to develop a neural network architecture designed for the prediction settings that we consider.
Model-agnostic meta-learning (MAML) is another popular meta-learning approach (Finn et al., 2017). In our setting, MAML aims to initialize the weights of a regression function estimate (parameterized as a neural network, for example) in such a way that, on any new task, only a limited number of gradient updates are needed. More recent approaches leverage the fact that, in certain settings, the initial estimate can instead be updated using a convex optimization algorithm (Bertinetto et al., 2018; Lee et al., 2019). To run any of these approaches, a prespecified prior over tasks is required. In our setting, these tasks take the form of data-generating distributions . In contrast, our approach adversarially selects a prior from .
Two recent works (Yin et al., 2018; Goldblum et al., 2019) developed meta-learning procedures that are trained under a different adversarial regime than that studied in the current work, namely under adversarial manipulation of one or both of the dataset and evaluation point (Dalvi et al., 2004). This adversarial framework appears to be most useful when there truly is a malicious agent that aims to contaminate the data, which is not the case that we consider. In contrast, in our setting, the adversarial nature of our framework allows us to ensure that our procedure will perform well regardless of the true value of , while also taking into account prior knowledge that we may have.
Our approach is also related to existing works in the statistics and econometrics literatures on the numerical learning of minimax and -minimax statistical decision rules. In finite-dimensional models, early works showed that it is possible to numerically learn minimax rules (Nelson, 1966; Kempthorne, 1987) and, in settings where consists of all priors that satisfy a finite number of generalized moment conditions, -minimax rules (Noubiap and Seidel, 2001). Other works have studied the -minimax case where consists of priors that only place mass on a pre-specified finite set of distributions in , both for general decision problems (Chamberlain, 2000) and for constructing confidence intervals (Schafer and Stark, 2009). Defining in this fashion modifies the statistical model to only consist of finitely many distributions, which can be restrictive. A recent work introduced a new approach, termed AMC, for learning minimax procedures for general models (Luedtke et al., 2020). In contrast to earlier works, AMC does not require the explicit computation of a Bayes estimator under any given prior, thereby improving the feasibility of this approach in moderate-to-high dimensional models. In their experiments, Luedtke et al. (2020) used neural network classes to define the sets of allowable statistical procedures. Unlike the current work, none of the aforementioned studies identified or leveraged the equivariance properties that characterize optimal procedures. As we will see in our experiments, leveraging these properties can dramatically improve performance.
1.4. Notation
We now introduce the notation and conventions that we use. For a function , we let denote the pushforward measure that is defined as the distribution of when . For any dataset and mapping with domain , we let . We take all vectors to be column vectors when they are involved in matrix operations. We write to mean the entrywise product and to mean . For an matrix , we let denote the row, denote the column, , and . When we standardize a vector as , we always use the convention that 0/0 = 0. We write to denote the column concatenation of two matrices. For an array , we let denote the matrix with entry equal to denote the -dimensional vector with entry equal to , etc. For and , we write to mean .
2. Characterization of Optimal Procedures
2.1. Optimality of Equivariant Estimators
We start by presenting conditions that we impose on the collection of priors . Let denote the collection of all permutation matrices, and let denote the collection of all permutation matrices. We suppose that is preserved under the following transformations:
-
P1.
Permutations of features: and implies that , where is the distribution of when .
-
P2.
Shifts and rescalings of features: , and implies that , where is the distribution of when .
-
P3.
Shift and rescaling of outcome: and and implies that , where is the distribution of when .
The above conditions implicitly encode that , and all belong to whenever . Section 7.1 provides an alternative characterization of P1, P2, and P3 in terms of the preservation of under a certain group action.
Condition P1 ensures that permuting the features during preprocessing will not impact the collection of priors considered. This condition is reasonable in settings where there is only a limited prior understanding of each individual feature under consideration or, if such information is available, there is little anticipated benefit from including it in the analysis. Most commonly supervised machine learning algorithms similarly do not incorporate specific prior information about individual features, and are instead designed to work across a variety of settings — this is the case, for example, for commonly used implementations of random forests, extreme gradient boosting, and penalized linear models (Pedregosa et al., 2011; Chen and Guestrin, 2016). It is worth noting, however, that P1 still allows information on the features to be incorporated should it be available — for example, prior beliefs on the multivariate feature distribution, such as the number of modes that it has, or the regression function, such as its level of sparsity, can be imposed in the collection of prior distributions. Conditions P2 and P3 are imposed to ensure that the -maximal risk criterion captures the possibility that the data may be preprocessed via affine transformations, such as prestandardization or a change of the unit of measure (Fahrenheit to Celsius, say), before being supplied to the prediction algorithm. By having be large enough to ensure that P2 and P3 are satisfied, the -minimax risk reflects performance in an adversarial setting wherein affine transformations are applied to the features and outcome in such a way as to make the (Bayes) risk as large as possible for a given prediction algorithm. Because it minimizes this adversarial criterion, a -minimax estimator should be robust to such adversarial transformations, thereby ensuring satisfactory performance regardless of the chosen unit of measure or prestandardization scheme.
We also assume that the signal-to-noise ratio (SNR) is finite — this condition is important in light of the fact that the MSE risk that we consider standardizes by .
-
P4.
Finite SNR : .
We now present conditions that we impose on the class of estimators . In what follows we let . For , we let
where is the vector where log is applied entrywise and where we abuse notation and let represent the matrix for which row is equal to , and similarly for . We let . When it will not cause confusion, we will write . Fix . Let denote the unique function that satisfies
| (3) |
The uniqueness arises because on . Because we have assumed that and are continuous random variables under sampling from any , it follows that, for all , the class uniquely characterizes the functions in up to their behavior on subsets of of -probability zero. In what follows, we will impose smoothness constraints on , which in turn imposes constraints on . The first three conditions suffice to show that is compact in the space of continuous functions equipped with the compact-open topology.
-
T1.
is pointwise bounded: For all .
-
T2.
is locally Hölder: For all compact sets , there exists an such that
where denotes the Euclidean norm. We take the supremum to be zero if is a singleton or is empty.
-
T3.
is sequentially closed in the topology of compact convergence: If is a sequence in and compactly in the sense that, for all compact , then .
The following conditions ensure that is invariant to certain preprocessings of the data, in the sense that, for any function , the function that first preprocesses the data in an appropriate fashion and then applies to this data is itself in . When formulating these conditions, we write to mean an element of . Because is a bijection between and , it is possible to recover from . Below we use this fact to abuse notation and define functions with domain like for functions with domain , without explicitly introducing notation for the inverse of .
-
T4.
Permutations: For all , and is in .
-
T5.
Shifts and rescalings: For all , and , the function is in , where is the matrix with row equal to .
In Appendix B, we provide two examples of classes that satisfy Conditions T1-T5. One of these classes is finite-dimensional and the other is infinite-dimensional. The infinite-dimensional class takes a particularly simple form. In particular, for some and some function that is invariant to permutations, shifts, and rescalings, we consider the class to be the collection of all the collection of all such that and for all .
Let denote the class of estimators that are equivariant to shifts and rescalings of the outcome and are invariant to permutations of the observations and to shifts, rescalings, and permutations of the features. Specifically, consists of functions in satisfying the following properties for all pairs of datasets and features in , permutation matrices and , shifts and , and rescalings and :
| (4) |
| (5) |
The following result shows that the -maximal risk is the same over and .
Theorem 1. Under P1-P4 and T1-T5,
The above does not rule out the possibility that there exists a non-equivariant -minimax estimator, that is, a -minimax estimator that belongs to . Rather, when paired with additional conditions that ensure that the infimum over above is achieved (see Theorem 3), the above implies that contains at least one -minimax estimator.
Theorem 1 is a variant of the Hunt-Stein theorem (Hunt and Stein, 1946). Our proof, which draws inspiration from Le Cam (2012), consists in showing that our prediction problem is invariant to the action of an amenable group and subsequently applying Day’s fixed-point theorem (Day, 1961) to show that, for all , the collection of for which has nonempty intersection with .
This theorem has a natural analogy to the translation equivariance that is enjoyed by convolutional neural networks in object detection problems, where the goal is to classify and draw a bounding box around objects in an image (Russakovsky et al., 2015). To simplify the discussion, here we focus on the special case where there is only one object class of interest (e.g., humans), so that the goal is simply to draw a bounding box around each object that is contained in the image. In object detection settings, a key insight is that an objecťs class does not change even if its position is shifted. Given this insight, it seems reasonable to expect that any sufficiently rich collection of candidate detectors will be such that, given any object detector , the collection will contain a translation equivariant detector with equal or superior performance to that of . For this to be true, certain requirements are also generally needed of the loss function used to measure performance. In particular, the error accrued by incorrectly bounding or failing to bound an object should not depend on the position of that object in the image — this condition is satisfied by many loss functions that are commonly used in this setting. In our setting, conditions P1-P3, which say that a prior still belongs to even after certain transformations are applied to the distributions drawn from that prior, are the analogues of the translation invariance property of an objecťs class (“a human remains a human if they are shifted to the left, and the pushforward of a prior in remains in even if features and outcomes are permuted, shifted, or rescaled”); conditions T4 and T5 are the analogues of the requirement that the collection of detectors be sufficiently rich; and the fact that the standardized squared error does not depend on the particular ordering of the features or the centering or scaling of the features or outcomes is analogous to the translation invariance of the loss functions used in object detection.
2.2. Focusing Only on Distributions with Standardized Predictors and Outcome
Theorem 1 suggests restricting attention to estimators in when trying to learn a -minimax estimator. We now show that, once this restriction has been made, it also suffices to restrict attention to a smaller collection of priors when identifying a least favorable prior. In fact, we show something slightly stronger, namely that the restriction to can be made even if optimal estimators are sought over the richer class of estimators that satisfy the equivariance property (5) but do not necessarily satisfy (4).
We now define . Let denote the distribution of
when . Note that here, and here only, we have written to denote the feature rather than the observation. Also let , which is a collection of priors on .
Theorem 2. If and hold and all satisfy (5), then is -minimax if and only if it is -minimax.
We conclude by noting that, under P2 and P3, consists precisely of those that satisfy:
| (6) |
2.3. Existence of an Equilibrium Point
We also make the following additional assumption on .
-
T6.
is convex: and implies that is in .
The two examples in Appendix B also satisfy T6.
We also impose the following condition on the size of the collection of distributions and the collection of priors , which in turn imposes restrictions on and .
-
P5.
There exists a metric on such that (i) is a complete separable metric space, (ii) is tight in the sense that, for all , there exists a compact set in such that for all , and (iii) for all is upper semi-continuous and bounded from above on .
In Appendix C, we give examples of parametric and nonparametric settings where P5 is applicable.
So far, the only conditions that we have required on the -algebra of are that and , are measurable. In this subsection, and in this subsection only, we add the assumptions that P5 holds and that is such that equals , where is the collection of Borel sets on .
We will also assume the following two conditions on .
-
P6.
is closed in the topology of weak convergence: if is a sequence in that converges weakly to , then .
-
P7.
is convex: for all and , the mixture distribution is in .
Under Conditions P5 and P6, Prokhorov’s theorem (Billingsley, 1999) can be used to establish that is compact in the topology of weak convergence. This compactness will be useful for proving the following result, which shows that there is an equilibrium point under our conditions.
Theorem 3. If T1-T3, T6, and P2-P7 hold, then there exists and such that, for all and , it is true that .
Combining the above with Lemma 10 in Section 7.2.3 establishes (2), that is, that the conclusion of Theorem 3 remains valid if varies over rather than over .
3. AMC Meta-Learning Algorithm
We now present an AMC meta-learning strategy for obtaining a -minimax estimator within some class . Here we suppose that , where each is an estimator indexed by a finite-dimensional parameter that belongs to some set . We note that this framework encapsulates: model-based approaches (e.g., Hochreiter et al., 2001), where can be evaluated by a single pass of through a neural network with weights ; optimization-based approaches, where are the initial weights of some estimate that are subsequently optimized based on (e.g., Finn et al., 2017); and metric-based approaches, where indexes a measure of similarity that is used to obtain an estimate of the form (e.g., Vinyals et al., 2016).
We suppose that all estimators in satisfy the equivariance property (5), which can be arranged by prestandardizing the outcome and features and then poststandardizing the final prediction — see Algorithm 2 for an example. Since all satisfy (5), Theorem 2 shows that it suffices to consider a collection of priors with support on , that is, so that, for all satisfies (6) almost surely. To ensure that the priors are easy to sample from, we parameterize them via generator functions (Goodfellow et al., 2014) that are indexed by a finite-dimensional that belongs to some set . Each takes as input a source of noise drawn from a user-specified distribution and outputs the parameters indexing a distribution in (Luedtke et al., 2020). Though this form of sampling limits to parametric families , the number of parameters indexing this family may be much larger than the sample size , which can, for all practical purposes, lead to a nonparametric estimation problem. For each , we let denote the distribution of when . We then let . It is worth noting that classes that are defined in this way will not generally satisfy the conditions P5-P7 used in Theorem 3. To iteratively improve the performance of the prior, we require the ability to differentiate realized datasets through the parameters indexing the prior. To do this, we assume that, for each , the user has access to a generator function such that has the same distribution as when noise is drawn from a user-specified distribution . We suppose that, for all realizations of the noise in the support of and in the support of , the function is differentiable at each parameter value indexing the prior.

The AMC learning strategy is presented in Algorithm 1. The algorithm takes stochastic gradient steps on the parameters indexing an estimator and prior generator to iteratively reduce and increase the Bayes risk, respectively. All gradients in the algorithm can be computed via backpropagation using standard software — in our experiments, we used Pytorch for this purpose (Paszke et al., 2019). Note that, when computing Loss, the dependence of Loss on is tracked through the dependence of on on line 5, the dependence of and on on lines 6 and 7, and the the dependence of Loss on , and on line 8. We caution that, when the outcome or some of the features are discrete, Loss will not generally represent an unbiased estimate of the gradient of , which can cause Algorithm 1 to perform poorly. To handle these cases, the algorithm can be modified to instead obtain an unbiased gradient estimate using the likelihood ratio method (Glynn, 1987).
Though studying the convergence properties of the minimax optimization in Algorithm 1 is not the main focus of this work, we now provide an overview of how results from Lin et al. (2019) can be used to provide some guarantees for this algorithm. When doing so, we focus on the special case where there exists some such that, for all is differentiable with -Lipschitz gradient and, for some finite (but potentially large) collection is the collection of all mixtures of distributions in . We also suppose that the parameter indexing the generator takes values on the simplex and that this generator is parameterized in such a way that has the same distribution as the mixture of distributions in that places mass on distribution . In this case, provided the learning rates and are chosen appropriately, Theorem 4.5 in Lin et al. (2019) gives guarantees on the number of iterations required to return an -stationary point (idem, Definition 3.7) within a specified number of iterations — this stationary point is such that there exists a near at which the function has at least one small subgradient (idem, Lemma 3.8, for details). If, also, is convex for all , then this also implies that is nearly -minimax. If, alternatively, the prior update step in Algorithm 1 (line 13) is replaced by an oracle optimizer such that, at each iteration, is defined as a true maximizer of the Bayes risk , then Theorem E.4 of Lin et al. (2019) similarly guarantees that an -stationary point will be reached within a specified number of iterations.
Alternatives to Algorithm 1 are possible. As one example, the stochastic gradient descent ascent optimization scheme could be replaced by an extragadient method (Korpelevich, 1976), which has been shown to perform well in generative adversarial network settings (Gidel et al., 2018). As another example, the prior distribution could, in principle, be specified via its density rather than as the pushforward distribution defined by the generator. While this density-based parameterization may make it easier to relate the specified priors to commonly used probability distributions, it may also lead to challenges since sampling from a distribution specified by its density is generally a hard problem that necessitates the use of numerical approaches such as Markov chain Monte Carlo methods (Hastings, 1970; Geman and Geman, 1984). Because the prior is updated at each of the iterations, it seems that many instances of these numerical sampling schemes would need to be run before theb termination of the AMC algorithm. Identifying a means to expedite the convergence of this density-based approach is an interesting area for future work.
4. Proposed Class of Estimators
4.1. Equivariant Estimator Architecture
Algorithm 2 presents our proposed estimator architecture, which relies on four modules. Each module can be represented as a function belonging to a collection of functions mapping from to , where the values of and can be deduced from Algorithm 2. For given data , a prediction at a feature can be obtained by sequentially calling the modules and, between calls, either mean pooling across one of the dimensions of the output or concatenating the evaluation point as a new column in the output matrix.
We let represent the collection of all prediction procedures described by Algorithm 2, where here varies over . We now give conditions under which the proposed architecture yields an equivariant estimator.
-
M1)
for all , and .
-
M2)
for all , and .
-
M3)
for all , and .

Theorem 4. If M1-M3, then all satisfy (4) and (5).
4.2. Neural Network Parameterization
In our experiments, we choose the four module classes , indexing our estimator architecture to be collections of neural networks. For each , we let contain the neural networks consisting of hidden layers of widths , where the types of layers used depends on the module . When , multi-input-output channel equivariant layers as defined in Hartford et al. (2018) are used. In particular, for , we let denote the collection of all such layers that map from to , where we let and . For each , each member of is equivariant in the sense that, for all , and for all . When , multi-input-output channel equivariant layers as described in Eq. 22 of Zaheer et al. (2017) are used, except that we replace the sum-pool term in that equation with a mean-pool term (see the next subsection for the rationale). In particular, for , we let denote the collection of all such equivariant layers that map from to . For each , each member of is equivariant in the sense that, for all and , . When , standard linear layers mapping from to are used for each , where and . For each , we let denote the collection of all such layers. For a user-specified activation function , we then define the module classes as follows for :
Notably, satisfies M1 (Ravanbakhsh et al., 2017; Hartford et al., 2018), and and satisfy M2 and M3, respectively (Ravanbakhsh et al., 2016; Zaheer et al., 2017). Each element of is a multilayer perceptron.
The proposed architecture bears some resemblance to CNPs (Garnelo et al., 2018). Like our proposed architecture, CNPs are invariant to permutations of the observations. Nevertheless, CNPs fail to satisfy the other properties imposed on , namely invariance to shifts, rescalings, and permutations of the features and equivariance to shifts and rescalings of the outcome. Moreover, a decision-theoretic rationale for making CNPs invariant to permutations of the observations has not yet been provided in the literature, for example, via a Hunt-Stein-type theorem.
4.3. Pros and Cons of Proposed Architecture
A benefit of using the proposed architecture in Algorithm 2 is that Modules 1 and 2 can be evaluated without knowing the feature at which a prediction is desired. As a consequence, these modules can be precomputed before making predictions at new feature values, which can lead to substantial computational savings when the number of values at which predictions will be made is large. Another advantage of the proposed architecture is that it can be evaluated on a dataset that has a different sample size than did the datasets used during meta-training. In the notation of Eq. 4 from Hartford et al., this corresponds to noting that the weights from an multi-input-output channel layer can be used to define an layer for which the output is given by the same symbolic expression as that displayed in Eq. 4 from that work, but now with ranging over . We will show in our upcoming experiments that procedures trained using 500 observations can perform well even when evaluated on datasets containing only 100 observations. It is similarly possible to evaluate the proposed architecture on datasets containing a different number of features than did the datasets used during meta-training — again see Eq. 4 in Hartford et al. (2018), and also see Eq. 22 in Zaheer et al. (2017), but with the sum-pool term replaced by a mean-pool term. The rationale for replacing the sum-pool term by a mean-pool term is that this will ensure that the scale of the hidden layers will remain fairly stable when the number of testing features differs somewhat from the number of training features.
A disadvantage of the proposed architecture is that it currently has no established universality guarantees. Such guarantees have been long available for standard multilayer perceptrons (e.g., Cybenko, 1989; Hornik, 1991), and have recently also become available for certain invariant architectures (Maron et al., 2019). In future work, it would be interesting to see if the arguments in Maron et al. (2019) can be modified to provide universality guarantees for our architecture. Establishing such results may also help us to overcome a second disadvantage of our architecture, namely that the resulting neural network classes will not generally satisfy the convexity condition T6 used in Theorem 3. If a network class that we have proposed can be shown to satisfy a universality result for some appropriate convex class , and if is itself a subset of , then perhaps it will be possible to invoke Theorem 3 to establish an equilibrium result over the class of estimators , and then to use this result to establish an (approximate) equilibrium result for . To ensure that conditions T1-T3 are satisfied, such an argument will likely require that the weights of the networks in be restricted to belong to some compact set.
5. Numerical Experiments
5.1. Overview
In this section, we present the results from two sets of numerical experiments, with the first corresponding to benchmarks from the meta-learning literature and the second consisting of settings designed to evaluate the performance of our method relative to that of analytically-derived estimators that are commonly used in practice for which theoretical performance guarantees are available. In each example, the collection of estimators is parameterized as the network architecture introduced in Section 4.2 with , and, for . For each module, we use the leaky ReLU activation . At the end of this section, we report the results of an ablation study that evaluates the extent to which imposing invariance to permutations of the observations and features improves performance.
All experiments were run in Pytorch 1.0.1 on Tesla V100 GPUs using Amazon Web Services. The code used to conduct the experiments can be found at https://github.com/alexluedtke12/amc-meta-learning-of-optimal-prediction-procedures. Further experimental details can be found in Appendix D.
5.2. Meta-Learning Benchmarks
5.2.1. Preliminaries
We now evaluate the performance of AMC on widely used meta-learning benchmarks. As described in the Introduction, existing meta-learning algorithms tend to be Bayesian in nature, where the goal during meta-training is to learn an estimator with small Bayes risk under a specified prior . Consequently, when adjudicating performance in this study, we will primarily focus on the evaluation of each learned estimator in terms of its Bayes MSE against this fixed prior , defined as .
Because our method is designed to learn adversarially over a collection of priors that satisfies the invariance properties P1, P2, and P3, we define the collection used when training our method as the smallest collection of priors that satisfies these three properties and contains . It can be verified that is a singleton in this case, so that the generator is a constant function and is never updated in these benchmark settings. Though this simplified meta-training may make it appear that AMC will not be robust to an adversarial choice of prior, it is worth noting that the learned estimator in fact is robust to such a choice in the sense that the Bayes risk of the learned estimator will be invariant under permutations of the features and also under shifts and rescalings of the outcomes and features. The main motivation for using a small when comparing to these benchmarks is that doing so will help inform on the performance of the estimator architecture that we proposed in Section 4 even in Bayesian settings for which existing meta-learning approaches are tailor-made.
We compare the performance of AMC to that of two popular meta-learning methods for which code is readily available: MAML (Finn et al., 2017) and CNPs (Garnelo et al., 2018). Because these algorithms do not prestandardize the features and outcomes, they may have large standardized Bayes MSEs (the Bayes risk derived from Eq. 1) if these quantities are simply shifted or rescaled. To ensure that possible discrepancies in performance between AMC and MAML or CNPs are not solely due to prestandardization, we also compare our method to natural variants of MAML and CNPs that, like AMC, are robust to such shifts and rescalings. For each method, these variants prestandardize the features and outcomes, and then, in an analogous fashion to line 9 of Algorithm 2, scale the final output by the sample standard deviation of the original training outcomes and shift by their sample mean. These algorithms, which we refer to as MAML-Eq and CNP-Eq, are invariant to shifts and rescalings of the features and equivariant to shifts and rescalings of the outcomes. Details on the MAML and CNP implementations used can be found in Appendix D.1.
5.2.2. Sinusoidal Regression
We start with a benchmark few-shot regression setting from that is commonly used in the meta-learning literature. The prior is defined as follows. The feature is 1-dimensional and is Unif (−5, 5) distributed, and the regression function takes the form , where the parameters and are drawn independently from a Unif (0.1, 5.0) and Unif distribution, respectively (Finn et al., 2017). Following related meta-learning benchmarks (Finn et al., 2018; Vuorio et al., 2018), the error added to the signal is distributed as . We use the same sample sizes as were used in Finn et al. (2017), namely , and 20.
We now report on the performance of the various meta-learning approaches in this setting. In Table 1a, we can see that MAML and CNPs consistently outperform their equivariant counterparts, namely MAML-Eq and CNP-Eq, in this setting. Nevertheless, as we noted earlier, MAML and CNPs are non-robust in that their standardized MSE can be made large by simply shifting or rescaling the outcomes or features. In Figure S5 in the appendix we provide evidence that this is indeed the case. As a particularly striking example, when , scaling the feature down by a factor of 5 leads to 24-fold and 149-fold increases in the MSEs of MAML and CNPs, respectively. The degradation of performance worsens with sample size. Indeed, when , the same rescaling leads to 144-fold and 487-fold increases in the MSEs of these two methods. Consequently, even seemingly innocuous preprocessings of the data, such as applying an affine transformation to change the unit of measurement, can have a dramatic impact on the performance of MAML and CNPs. In contrast, the standardized MSE performance of MAML-Eq and CNP-Eq is invariant to such preprocessings of the data.
Table 1:
Bayes MSEs of meta-learning approaches in the meta-learning benchmark experiments, where the Bayes MSE is defined as the squared difference between the predictions and true underlying regression function, averaged across draws of the data-generating distribution from the prior and the feature from the feature distribution. Standard errors all <0.005 in the sinusoid experiment and < 0.001 in the Gaussian process experiments.
| (a) Sinusoid | (b) Gaussian process | |||||||
|---|---|---|---|---|---|---|---|---|
| 1d feature | 5d feature | |||||||
| n=5 | 10 | 20 | n=5 | 50 | 5 | 50 | ||
| MAML* | 0.22 | 0.10 | 0.03 | MAML* | 0.85 | 0.13 | 1.00 | 1.00 |
| CNP* | 0.05 | 0.02 | 0.01 | CNP* | 0.47 | 0.04 | 0.95 | 0.73 |
| MAML-Eq | 2.06 | 0.47 | 0.07 | MAML-Eq | 0.93 | 0.13 | 1.22 | 1.02 |
| CNP-Eq | 1.13 | 0.13 | 0.04 | CNP-Eq | 0.56 | 0.04 | 1.12 | 0.73 |
| AMC (ours) | 0.89 | 0.09 | 0.03 | AMC (ours) | 0.56 | 0.03 | 1.11 | 0.66 |
As these two algorithms do not prestandardize the features or outcomes, their standardized MSEs can be made large by simply shifting or rescaling the features and outcomes. See Figure S5 for more information.
Table 1a also displays results for AMC. AMC consistently outperforms the robust versions of existing algorithms, namely MAML-Eq and CNP-Eq. When compared with the non-robust variants, AMC is outperformed by MAML when , outperforms MAML when , and has about the same performance as MAML when . CNPs perform better than MAML and AMC, though this difference begins to diminish as the sample size increases.
5.2.3. Gaussian Process Regression
We next consider a benchmark Gaussian process regression setting. We consider two cases for the prior. The first is the same as that considered in Garnelo et al. (2018), except that they considered the noise-free case where almost surely, whereas we consider the noisy case where the errors are homoscedastic and distributed as . Considering a noisy case where is non-degenerate is necessary for the standardized MSE that we consider to be well-defined, and also better reflects real-world regression scenarios where observed outcomes are rarely, if ever, deterministic functions of the features considered. Following Garnelo et al. (2018), the feature is 1-dimensional and follows a Unif (−2, 2) distribution, and the regression function is drawn from a mean-zero Gaussian process with a squared exponential kernel with lengthscale 0.4 and variance 1. We also use the same sample sizes as were used in that work, namely and 50. The second case that we consider is the same as the first except that the feature is 5-dimensional, where the entries of are independent Unif(−2, 2) random variables, and the lengthscale is taken to be equal to 1.2.
Table 1b displays the performance of the various methods in this setting. Adversarial Monte Carlo noticeably outperforms MAML and MAML-Eq across all settings except the 5-dimensional, case, where MAML performs slightly better than does AMC. The ordering between AMC and the CNP-based methods varies by sample size. At the smaller sample size considered , AMC outperforms the robust CNP-based method, namely CNP-Eq, but is outperformed by the non-robust method, namely CNP. In the larger sample size considered , AMC outperforms both CNP and CNP-Eq. The fact that AMC outperforms CNP in this setting is notable given that CNPs are designed to mimic the desirable properties of Gaussian process regression procedures (Garnelo et al., 2018).
5.3. Comparing to (Regularized) Empirical Risk Minimizers
5.3.1. Preliminaries
We now compare the performance of our approach to that of existing estimators that are commonly used in practice for which theoretical performance guarantees are available. The examples differ in the definitions of the model and the collection of priors on . In each case, satisfies the invariance properties P1, P2, and P3. By the equivariance of the estimators in , Theorem 2 shows that it suffices to consider a collection of priors with support on . Hence, it suffices to define the collection of distributions satisfying (6). By P2 and P3, we see that , where consists of the distributions of when ; here, , and vary over , and , respectively. In each setting, the submodel takes the form
and the dimensional feature is known to be drawn from a distribution in the set of distributions, where varies over all positive-definite covariance matrices with diagonal equal to . The collections of regression functions differ in the examples and are detailed in the coming subsections. These collections are indexed by a sparsity parameter that specifies the number of features that may contribute to the regression function . In each setting, we considered all four combinations of and , where denotes the number of observations in the datasets used to evaluate the performance of the final learned estimators. For each , we evaluated the performance of AMC meta-trained with datasets of size observations (AMC100) and observations (AMC500).
5.3.2. Sparse Linear Regression
We next considered the setting where belongs to a sparse linear model and the feature is dimensional. In this setting,
| (7) |
where and . The collection is described in Appendix D.
For each sparsity level , we evaluated the performance of the prediction procedure trained at sparsity level using two priors. Both priors sample the covariance matrix of the feature distribution from the Wishart prior described in Appendix D.2.1 and let for a random satisfying . They differ in how is drawn. Both make use of a uniform draw from ball . The first sets , whereas the second sets for drawn independently of . We will refer to the two settings as ‘boundary’ and ‘interior’, respectively. We refer to the and cases as the ‘sparse’ and ‘dense’ settings, respectively. Further details can be found in Appendix D.2.2.
In this example, AMC leverages knowledge of the underlying sparse linear regression model by generating synthetic training data from distributions for which belongs to the class defined in Eq. 7 (see line 5 of Algorithm 1). Therefore, we aimed to compare AMC’s performance to that of estimators that also take advantage of this linearity. Ideally, we would compare AMC’s performance to that of the true -minimax estimator. Unfortunately, as is the case in most problems, the form of this estimator is not known in this sparse linear regression setting. Therefore, we instead compared AMC’s performance to ordinary least squares (OLS) and lasso (Tibshirani, 1996) with tuning parameter selected by 10-fold cross-validation, as implemented in scikit-learn (Pedregosa et al., 2011).
Table 2a displays performance for the sparse setting. We see that AMC outperformed OLS and lasso for the boundary priors, and was outperformed for the interior priors. Surprisingly, AMC500 outperformed AMC100 for the interior prior when observations were used to evaluate performance. The fact that AMC100 was trained specifically for the case suggests that a suboptimal equilibrium may have been reached in this setting. Table 2b displays performance for the dense setting. Here AMC always performed at least as well as OLS and lasso when , and performed comparably even when .
Table 2:
MSEs based on datasets of size in the linear regression settings. Standard errors all < 0.001.
| (a) Sparse signal | ||||
|---|---|---|---|---|
| Boundary | Interior | |||
| n=100 | 500 | 100 | 500 | |
| OLS | 0.12 | 0.02 | 0.12 | 0.02 |
| Lasso | 0.06 | 0.01 | 0.06 | 0.01 |
| AMC100 (ours) | 0.02 | <0.01 | 0.11 | 0.09 |
| AMC500 (ours) | 0.02 | <0.01 | 0.07 | 0.04 |
| (b) Dense signal | ||||
| Boundary | Interior | |||
| n=100 | 500 | 100 | 500 | |
| OLS | 0.13 | 0.02 | 0.13 | 0.02 |
| Lasso | 0.11 | 0.02 | 0.09 | 0.02 |
| AMC100 (ours) | 0.10 | 0.04 | 0.08 | 0.02 |
| AMC500 (ours) | 0.09 | 0.02 | 0.09 | 0.02 |
5.3.3. Fused Lasso Additive Model
We next considered the setting where belongs to a variant of the fused lasso additive model (FLAM) (Petersen et al., 2016) and the feature is dimensional. This model enforces that belong to a generalized additive model, that only a certain number of the components can be different from the zero function, and that the sum of the total variations of the remaining components is not too large. We recall that the total variation of is equal to the supremum of over all such that and (Cohn, 2013). Let . Writing to denote feature , the model we considered imposes that falls in
We take in the experiments in this section. The collection is described in Appendix D.
In this example, we preprocessed the features before supplying them to the estimator. In particular, we replaced each entry with its rank statistic among the observations so that, for each and , we replaced by and by . This preprocessing step is natural given that the FLAM estimator (Petersen et al., 2016) also only depends on the features through their ranks. An advantage of making this restriction is that, by the homoscedasticity of the errors and the invariance of the rank statistics and total variation to strictly increasing transformations, the learned estimators should perform well even if the feature distributions do not belong to a Gaussian model, but instead belong to a much richer Gaussian copula model.
We evaluated the performance of the learned estimators using variants of simulation scenarios 1–4 from Petersen et al. (2016). The level of smoothness varies across the settings (see Fig. 2 in that work). In the variants we considered, the true regression function either contains (‘sparse’) or (‘dense’) nonzero components. In the sparse setting, we evaluated the performance of the estimators that were meta-trained at sparsity level , and, in the dense setting, we evaluated the performance of the estimators that were meta-trained at . Further details can be found in Appendix D.2.3.
Figure 2:

Improvement of AMC estimators over existing estimators, in terms of differences of cross-validated MSEs of FLAM and AMC FLAM (x-axis) and Lasso and AMC Linear (y-axis). Positive values indicate that AMC outperformed the comparator. AMC performed similarly to or better than existing estimators in settings where the number of features in the dataset was the same as were used in meta-training. As expected, the performance was somewhat worse for datasets that had fewer features than were used during meta-training, though, surprisingly, it was still sometimes better than that of existing methods.
Similarly to as in the previous example, AMC leverages knowledge of the possible forms of the regression function that is imposed by — in this case, the model for the regression function is nonparametric but does impose that this function belongs to a particular sparse generalized additive model. Though there does not exist a competing estimator that is designed to optimize over , the FLAM estimator (Petersen et al., 2016) optimizes over the somewhat larger, non-sparse model where . We, therefore, compared the performance of AMC to this estimator as a benchmark, with the understanding that AMC is slightly advantaged in that it has knowledge of the underlying sparsity pattern. Nevertheless, we view this experiment as an important proof-of-concept, as it is the first, to our knowledge, to evaluate whether it is feasible to adversarially meta-learn a prediction procedure within a nonparametric regression model.
To illustrate the kinds of functions that AMC can approximate, Fig. 1 displays examples of AMC500 fits from scenario 3 when . Table 3 provides a more comprehensive view of the performance of AMC and compares it to that of FLAM. Table 3a displays performance for the sparse setting. The AMC procedures meta-trained with observations outperformed FLAM for all of these settings. Interestingly, AMC procedures meta-trained with also outperformed FLAM in a majority of these settings, suggesting that learned procedures can perform well even at different sample sizes from those at which they were meta-trained. In the dense setting (Table 3b), AMC500 outperformed both AMC100 and FLAM in all but one setting (scenario 4, ), and in this setting both AMC100 and AMC500 dramatically outperformed FLAM. The fact that AMC500 also sometimes outperformed AMC100 when in the linear regression setting suggests that there may be some benefit to training a procedure at a larger sample size than that at which it will be evaluated. We leave an investigation of the generality of this phenomenon to future work.
Figure 1:

Examples of AMC500 fits (thin blue lines) based on observations drawn from distributions at sparsity level with four possible signal components (thick black lines). Predictions obtained at different signal feature values with all 9 other features set to zero.
Table 3:
MSEs based on datasets of size in the FLAM settings. Standard errors for FLAM all < 0.04 and for AMC all < 0.01.
| (a) Sparse signal | ||||||||
|---|---|---|---|---|---|---|---|---|
| Scenario 1 | Scenario 2 | Scenario 3 | Scenario 4 | |||||
| n=100 | 500 | 100 | 500 | 100 | 500 | 100 | 500 | |
| FLAM | 0.44 | 0.12 | 0.47 | 0.17 | 0.38 | 0.11 | 0.51 | 0.19 |
| AMC100 (ours) | 0.34 | 0.20 | 0.18 | 0.08 | 0.27 | 0.14 | 0.17 | 0.08 |
| AMC500 (ours) | 0.48 | 0.12 | 0.19 | 0.06 | 0.35 | 0.10 | 0.23 | 0.08 |
| (b) Dense signal | ||||||||
| Scenario 1 | Scenario 2 | Scenario 3 | Scenario 4 | |||||
| n=100 | 500 | 100 | 500 | 100 | 500 | 100 | 500 | |
| FLAM | 0.59 | 0.17 | 0.65 | 0.24 | 0.53 | 0.16 | 0.76 | 0.36 |
| AMC100 (ours) | 1.20 | 0.91 | 0.47 | 0.39 | 0.87 | 0.57 | 0.30 | 0.30 |
| AMC500 (ours) | 0.58 | 0.15 | 0.37 | 0.08 | 0.46 | 0.12 | 0.36 | 0.09 |
5.4. Ablation Study to Evaluate the Performance of Permutation Invariance
We numerically evaluated the utility of imposing invariance in the architecture in Algorithm 2. To do this, we repeated the and FLAM settings, separately modifying the architecture to remove invariance to permutations of the observations and the features. In the case where the architecture was not invariant to permutations of the observations, we weakened M1 to the condition that for all , and . We used the same architecture as was used in our earlier experiment, except that each layer in Module 1 was replaced by a multi-input-output channel layer that is equivariant to permutations of the features (Zaheer et al., 2017), and the output of the final layer was of dimension so that the subsequent mean pooling layer could be removed. In the case where the architecture was not invariant to permutations of the features, we removed conditions M2 and M3 and also weakened M1 to the condition that for all , , and . We used the same architecture as in our earlier experiment except that Modules 2 and 3 were replaced by multilayer perceptrons and each layer in Module 1 was replaced by a multi-input-output channel layer that is equivariant to permutations of the observations.
Table 4 displays the results. In every setting considered, removing invariance to permutations of the observations led to a marked increase in the MSE of the estimator, with the degradation of performance tending to be worse at the larger sample size. In the most extreme scenario, the MSE of the non-invariant estimator was 38 times higher than that of the invariant estimator. Removing invariance to permutations of the features also tended to worsen performance, sometimes by a factor of 2 or 3, though there were a few settings where performance improved slightly (no more than 5%). Taken together, these results suggest that a priori enforcing that the estimator is invariant to permutations of the features and observations can dramatically improve performance.
Table 4:
Fold-change in MSEs for modifications of AMC in the FLAM settings with , as compared to the performances of FLAM listed in Table 3. Standard errors all ≤ 0.03 times the fold-change in the MSE.
| (a) Sparse signal | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Scenario 1 | Scenario 2 | Scenario 3 | Scenario 4 | ||||||
| n=100 | 500 | 100 | 500 | 100 | 500 | 100 | 500 | ||
| Not invariant to | observations | 6.98 | 38.29 | 5.82 | 29.93 | 5.03 | 27.58 | 4.29 | 13.08 |
| permutations of: | features | 1.01 | 0.95 | 1.16 | 1.09 | 1.02 | 0.98 | 1.01 | 0.99 |
| (b) Dense signal | |||||||||
| Scenario 1 | Scenario 2 | Scenario 3 | Scenario 4 | ||||||
| n=100 | 500 | 100 | 500 | 100 | 500 | 100 | 500 | ||
| Not invariant to | observations | 1.86 | 14.68 | 1.69 | 8.60 | 1.97 | 14.20 | 1.51 | 4.70 |
| permutations of: | features | 1.05 | 2.55 | 0.99 | 1.98 | 1.09 | 3.02 | 1.04 | 1.67 |
6. Data Experiments
We also used real datasets to evaluate the performance of AMC100 estimators meta-trained in sparse linear regression settings (Section 5.3.2) or fused lasso additive model settings (Section 5.3.3). We compared the performance of our estimators to the estimators from our numerical experiments, namely, the OLS, lasso, and FLAM estimators. These estimators are natural comparators because they assume the same or similar models as do our AMC estimators; consequently, comparing to these estimators allows us to focus our discussion on differences in the performance of existing estimation strategies as compared to that of new meta-learned strategies, rather than on differences in underlying assumptions that could potentially be resolved by training a new AMC estimator in a different model.
Because the implementations of lasso and FLAM that we compared to both use 10-fold cross-validation to select tuning parameters, we also used 10-fold cross-validation to select tuning parameters for the AMC100 estimators. The first of these estimators, which we refer to as “AMC Linear”, selects a tuning parameter by finding the value of for which the cross-validated MSE of an AMC100 estimator trained in the sparse linear regression setting with sparsity level is minimal. The final prediction then corresponds to that returned by the AMC100 estimator trained in the model with this selected value of . The second, which we refer to as “AMC FLAM”, selects two tuning parameters, one of which reflects the sparsity level of the problem and the other of which corresponds to the bound on the sum of the variation norms of the components in the fused lasso additive model. In particular, the tuning parameters are chosen to be those that minimize the cross-validated MSE of an AMC100 estimator trained in the fused lasso additive model with parameters . Notably, each candidate estimator considered by AMC Linear and AMC FLAM only has access to 90, rather than 100, observations when selecting tuning parameter values using 10 -fold cross-validation on a dataset of size . This does not pose a problem because, as was noted in Section 4.3, the trained estimators can be evaluated at different sample sizes than those at which they were trained.
In settings where both AMC-trained estimators and other estimators are available, it is natural to wonder whether there is a way to capitalize on the availability of both types of methods. Ensemble algorithms provide a natural means to do this, with stacked ensembles representing an especially appealing option given theoretical guarantees that adding base learners will not typically degrade performance (Van der Vaart et al., 2006; Van der Laan et al., 2007) and existing experiments showing that they often outperform all included base learners (e.g., Polley and Van der Laan, 2010). We, therefore, evaluate the performance of three stacked ensembles in these experiments. The first includes only the AMC Linear and AMC FLAM estimators as base learners. The second only includes the OLS, lasso, and FLAM estimators. The third includes all five of these estimators. Predictions of the base learners were combined using 10-fold cross-validation. Following the recommendation of Breiman (1996), we employed a non-negative least squares estimator for this combination step.
Our experiments make use of ten datasets. Six of these datasets are available through the University of California, Irvine (UCI) Machine Learning Repository (Dua and Graff, 2017), three were used to illustrate supervised learning machines in popular statistical learning textbooks (Friedman et al., 2001; James et al., 2013), and one was used as an illustrative example in the paper that introduced FLAM (Petersen et al., 2016). All of these datasets contain more than 100 observations. Five of them have at least 10 features and the others have fewer (5, 6, 6, 7, and 9). All outcomes are standardized to have empirical variance 1 so that, for each dataset, the cross-validated MSE performance of a sample mean for predicting the outcome is approximately 1. Further details on these datasets can be found in Appendix E.1.
We evaluated our learned estimators in three settings. First, we considered the case where the number of features in the datasets matched the number that they saw during training, namely 10. In particular, we evaluated the performance of AMC Linear and AMC FLAM in the 5 datasets that have 10 or more features by randomly selecting 100 observations and 10 features from each dataset and evaluating MSE on the held out observations. This and all other Monte Carlo evaluations of MSE described in what follows were repeated 200 times and averaged across the replications. Second, we evaluated the robustness of our learned estimators to a key assumption used during training. In particular, we evaluated the performance of our estimators (a) Datasets with same number of features as used during meta-training (b) Datasets with fewer features than used during meta-training on the 5 datasets that have fewer features than the 10 used during meta-training, again sampling 100 observations and evaluating MSE on the held out observations. Third, we evaluated the relative performance of our estimators at varying levels of signal sparsity for each of the ten datasets. In particular, for each training-test split of the data, we selected total features from the dataset, removed the remaining features, and then included Gaussian noise features so that the dimension of the feature was always .
We first discuss performance on datasets with the same number of features as were used during meta-training. Complete numerical results for estimator performance can be found in Table S5 in Appendix E.2. Here, we focus on graphical summaries of performance to communicate the key trends that we saw. Figure 2a shows that AMC FLAM performed similarly to or better than FLAM across all settings, and AMC Linear performed similarly to lasso across all settings. We have compared AMC Linear to lasso as a baseline in this figure because lasso performed similarly to or better than OLS across all settings. Figure 3a shows that stacking all available base learners consistently yielded better performance than did only stacking only the existing estimators or the AMC estimators. This stacked ensemble also outperformed all base learners considered. These results suggest that incorporating AMC estimators into regression pipelines (a) Datasets with same number of features as used during meta-training (b) Datasets with fewer features than used during meta-training can reliably lead to improved predictions even in settings where performant learners are already available.
Figure 3:

Improvement of the stacked ensemble algorithm that includes all base learners over those which only include a subset (existing learners or AMC learners), in terms of differences of cross-validated MSEs. Including both AMC and existing estimators as base learners always outperformed only including a subset when the dataset contained the same number of features as were used during training. Adding AMC base learners did not tend to improve performance when the dataset had fewer features than were used during meta-training, though any degradation in performance was minimal.
We now discuss performance on datasets with fewer features than were used during meta-training. Figure 2b displays performance on datasets that have fewer features than were used during meta-training. Unsurprisingly, performance was somewhat less desirable than it was for datasets with the same number of features as were used during meta-training. AMC FLAM tended to be somewhat outperformed by FLAM, though did outperform FLAM in one setting. AMC Linear continued to perform similarly to lasso across all settings. Figure 3 shows that stacking all available base learners outperformed stacking only AMC estimators, and performed similarly to stacking existing estimators.
We conclude by discussing the performance of the estimators when we induce varying levels of signal sparsity. Figure 4 shows that AMC FLAM outperformed FLAM for the vast majority of datasets and sparsity patterns. The only exception to this trend occurred for the yacht dataset and the LAozone dataset for denser signals (7, 8, or 9 signal features), where AMC FLAM was slightly outperformed by FLAM. Figure S6 in the appendix shows that AMC Linear consistently outperformed OLS and performed comparably to or slightly better than lasso in most settings.
Figure 4:

Performance of FLAM and AMC FLAM at different sparsity levels. For each training-validation split of the data, between 1 and features are selected at random from the original dataset (x-axis), where is the minimum of 10 and the total number of features in the dataset, and Gaussian noise features are then added so that there are 10 total features. Therefore, the signal is expected to become denser and stronger as the x-axis value increases. AMC FLAM outperforms FLAM in most settings.
Figure S7 shows that there was not a major difference between the cross-validated MSE of the three stacking algorithms. Nevertheless, it is worth noting that stacking all available base learners did outperform the other two stacking schemes in 53% of the 83 dataset-sparsity settings considered, with the stacking scheme that only included AMC algorithms performing best in 39% of the settings and the scheme that only included existing algorithms performing best in only 8% of these settings. Thus, we again see evidence that including AMC base learners in a stacked ensemble can improve performance, even when other learners are already available.
7. Proofs
7.1. A Study of Group Actions that are Useful for Our Setting
To prove Theorem 1, it will be convenient to use tools from group theory to describe and study the behavior of our estimation problem under the shifts, rescalings, and permutations that we consider. For , let be the symmetric group on symbols. Let be the semidirect product of the real numbers with the positive real numbers with the group multiplication
Define . Let . Throughout we equip with the product topology.
We note that the quantity defined in Section 2.1 writes as
| (8) |
Denote the generic group element where , and . Denote the generic element by
For , two arbitrary elements in , define the group multiplication as
Define the group action by
where and .
We make use of the below result without statement in the remainder of this section.
Lemma 1. The map defined above is a left group action.
Proof. The identity axiom, namely that when is the identity element of , is straightforward to verify and so we omit the arguments. Fix and . We establish compatibility by showing that . To see that this is indeed the case, note that, for all and :
□
We now introduce several group actions that we will make heavy use of in our proof of Theorem 1 and in the lemmas that precede it. We first define . For and , define to be . Conditions T4 and T5 can be restated as for all and . It can then readily be shown that, under these conditions, the defined map is a left group action. For , we will write to denote the operator defined so that
It is possible that does not belong to due to its behavior when , and therefore that the defined map is not a group action. Nonetheless, because has -probability one for any , this fact will not pose any difficulties in our arguments.
We now define the group action . For , define as
Similar arguments to those used to prove Lemma 1 show that the map defined above is a left group action. We now define the group action . For , define by , where
Under P1, P2, and P3, which, as noted in the Section 2.1, implicitly encode that , it can readily be shown that the defined map is a left group action. Finally, we define the group action . For , define by where
We can restate P1, P2, and P3 as for all . Under these conditions, it can be shown that the defined map is a left group action.
We now show that is amenable — see Appendix A for a review of this concept. Establishing this fact will allow us to apply Day’s fixed point theorem (Theorem S1 in Appendix A) in the upcoming proof of Theorem 1.
Lemma 2. is amenable.
Proof. Because and are finite groups, they are compact, and therefore amenable. Because and are Abelian, they are also amenable. By Theorem S19, group extensions of amenable groups are amenable. □
7.2. Proofs of Theorems 1 through 4
This section is organized as follows. Section 7.2.1 introduces three general lemmas that will be useful in proving the results from the main text. Section 7.2.2 proves several lemmas, proves the variant of the Hunt-Stein theorem from the main text (Theorem 1), and concludes with a discussion of the relation of this result to those in Le Cam (2012). Section 7.2.3 establishes a preliminary lemma and then proves that, when the class of estimators is equivariant, it suffices to restrict attention to priors in when aiming to learn a -minimax estimator (Theorem 2). Section 7.2.4 establishes several lemmas, including a minimax theorem for our setting, before proving the existence of an equilibrium point (Theorem 3). Section 7.2.5 establishes the equivariance of our proposed neural network architecture (Theorem 4).
In this section, we always equip with the topology of compact convergence and, whenever T2 holds so that , we equip with the subspace topology. For a fixed compact and a function , we also let .
7.2.1. Preliminary lemmas
We now prove three lemmas that will be used in our proofs of Theorems 1 and 3.
Lemma 3. with the compact-open topology is metrizable.
Proof. See Example IV.2.2 in Conway (2010). □
As a consequence of the above, we can show that a subset of is closed by showing that it is sequentially closed, and we can show that a subset of is continuous by showing that it is sequentially continuous.
Lemma 4. If T1, T2, and T3 hold, then is a compact subset of .
Proof. By T1, is pointwise bounded. Moreover, the local Hölder condition T2 implies that is equicontinuous, in the sense that, for every and every there exists an open neighborhood of such that, for all and all , it holds that . Hence, by the Arzelà-Ascoli theorem (see Theorem 47.1 in Munkres, 2000 for a convenient version), is a relatively compact subset of . By T3, is closed, and therefore is compact. □
We now show that the group action is continuous under conditions that we assume in Theorem 1. Establishing this continuity condition is necessary for our use of Day’s fixed point theorem in the upcoming proof of that result.
Lemma 5. If T2, T4, and T5 hold, then the group action is continuous.
Proof. By T4 and T5, is indeed a group action. Also, by T2 and Lemma 3, is metrizable. Recall the expression for given in (8) and that
The product topology is compatible with semidirect products, and so the fact that each multiplicand is a metric space implies that is a metric space. Hence, it suffices to show sequential continuity. Let be a sequence in such that , where . By the definition of the product metric, and . Let , and be compact spaces. Since each compact space is contained in such a , it suffices to show that
for arbitrary compact sets . To show this, we will use the decomposition , where , and . We similarly use the decomposition . For all large enough, all of the statements are true for all is contained in a compact neighbourhood of , and is contained in a compact neighbourhood of .
Since permutations are continuous, , and , are compact. In the following we use the decomposition for an arbitrary element . Since addition and multiplication are continuous, , and are compact. Define to be the compact set
Then,
7.2.2. Proof of Theorem 1
We begin this subsection with four lemmas and then we prove Theorem 1. Following this proof, we briefly describe how the argument relates to that given in Le Cam (2012). In the proof of Theorem 1, we will use notation that we established about the group in Section 7.1. We refer the reader to that section for details.
Lemma 6. For any , and
Proof. Fix and , and let , where is defined in (3). By the change-of-variables formula,
Plugging the fact that and that
into the right-hand side of the preceding display yields that
By the shift and scale properties of the standard deviation and variance, the above continues as
□
Lemma 7. For any , and , it holds that .
Proof. This result follows quickly from Lemma 6. Indeed, for any , and ,
Let for all consists of the -invariant elements of . The following fact will be useful when proving Theorem 1, and also when proving results in the upcoming Section 7.2.3.
Lemma 8. It holds that .
Proof. Fix and . By the definition of , there exists a such that . For this , the fact that implies that
As was arbitrary, . Hence, . □
Now fix and . Note that . Using that implies that , we see that
As, was arbitrary, , and so . □
We define as follows:
| (9) |
Because occurs with -probability one (for any ), it holds that for any .
Lemma 9. Fix . If T1, T2, and P4 hold, then is lower semicontinuous.
Proof. Fix . For any compact , we define by
where here and throughout in this proof we let and . Recalling that there exists an increasing sequence of compact subsets such that , we see that by the monotone convergence theorem. Moreover, as suprema of collections of continuous functions are lower semicontinuous, we see that is lower semicontinuous if is continuous for every . In the remainder of this proof, we will show that this is indeed the case.
By Lemma 3, it suffices to show that is sequentially continuous. Fix . By Jensen’s inequality,
| (10) |
In what follows, we will bound the right-hand side above by some finite constant times . We start by noting that, for any such that ,
where is finite by T1 and T2. Integrating both sides shows that
| (11) |
We now bound the three expectations on the right-hand side by finite constants that do not depend on or . All three bounds make use of the bound on the first expectation, namely , where . We note that (P4) can be used to show that . Indeed,
and so, by the law of total variance and (P4), . By Cauchy-Schwarz, the second expectation on the right-hand side of (11) bound as
and the third expectation bounds as
Plugging these bounds into (11), we see that
Plugging this into (10), we have shown that
We now conclude the proof by showing that the above implies that is sequentially continuous at every , and therefore is sequentially continuous on . Fix and a sequence such that compactly. This implies that , and so the above display implies that , as desired. □
We now prove Theorem 1.
Proof of Theorem 1. Fix and let . Let be the set of all elements that satisfy
For fixed , the set of that satisfy is closed due to the lower semicontinuity of the risk function (Lemma 9) and contains . The intersection of such sets is closed and contains so that is a nonempty closed subset of the compact Hausdorff set , implying that is compact. By the convexity of , the risk function is convex. Hence, is convex. If , then Lemma 7 shows that, for any ,
Thus, and is an affine group action on a nonempty, convex, compact subset of a locally compact topological vector space. Combining this with the fact that is amenable (Lemma 2) shows that we may apply Day’s fixed point theorem (Theorem S1) to see that there exists an such that, for all and
The conclusion is at hand. By Lemma 8, there exists a such that . Furthermore, as noted below (9), and for all . Recalling that , the above shows that . As was arbitrary and , we have shown that . □
The proof of Theorem 1 is inspired by that of the Hunt-Stein theorem given in Le Cam (2012). Establishing this result in our context required making meaningful modifications to these earlier arguments. Indeed, Le Cam (2012) uses transitions, linear maps between L-spaces, to characterize the space of decision procedures. This more complicated machinery makes it possible to broaden the set of procedures under consideration. Indeed, with this characterization, it is possible to describe decision procedures that cannot even be represented as randomized decision procedures via a Markov kernel, but instead come about as limits of such decision procedures. Despite the richness of the space of decision procedures considered, Le Cam is still able to show that this space is compact by using a coarse topology, namely the topology of pointwise convergence. Unfortunately, this topology appears to generally be too coarse for our Bayes risk function to be lower semi-continuous, which is a fact that we used at the beginning of our proof of Theorem 1. Another disadvantage to this formulation is that it makes it difficult to enforce any natural conditions or structure, such as continuity, on the set of estimators. It is unclear whether it would be possible to implement a numerical strategy optimizing over a class of estimators that lacks such structure. In contrast, we showed that, under appropriate conditions, it is indeed possible to prove a variant of the Hunt-Stein theorem in our setting even once natural structure is imposed on the class of estimators. To show the compactness of the space of estimators that we consider, we applied the Arzelà-Ascoli theorem.
7.2.3. Proof of Theorem 2
We provide one additional lemma before proving Theorem 2. The lemma relates to the class of estimators in that satisfy the equivariance property (5) but do not necessarily satisfy (4). Note that
Lemma 10. If and hold, then, for all ,
and so .
Proof of Lemma 10. Let be the identity element in . For each , define to be
It holds that
□
We conclude by proving Theorem 2.
Proof of Theorem 2. Under the conditions of the theorem, . Recalling that , Lemma 10 yields that, for any . Hence, an estimator is -minimax if and only if it is -minimax. □
7.2.4. Proof of Theorem 3
In this subsection, we assume (without statement) that all are defined on the measurable space , where is such that equals , where is the collection of Borel sets on the metric space described in P5. Under P2 and P3, which we also assume without statement throughout this subsection, it then follows that each is defined on the measurable space , where is the collection of Borel sets on . Let denote the collection of all distributions on . For each , define the -enlargement of by such that . Further let denote the Lévy-Prokhorov metric on , namely
Lemma 11. If and , then is a compact metric space.
Proof of Lemma 11. By Prokhorov’s theorem (see Theorem 5.2 in van Gaans, 2003 for a convenient version, or see Theorems 1.5.1 and 1.6.8 in Billingsley, 1999), P5 implies that is relatively compact in . The fact that is closed (P6) implies the result. □
We now define , which is the analogue of from Section 7.2.2:
| (12) |
Note that, because each distribution in is continuous, each distribution in is also continuous. Hence, occurs with -probability one for all , and so the definition of combined with Lemma 8 shows that for any and .
Lemma 12. If , then, for each is upper semicontinuous on .
Proof of Lemma 12. Fix , and note that, by Lemma 8, there exists a such that . Let be such that in for some . Because metrizes weak convergence (Theorem 1.6.8 in Billingsley, 1999), the Portmanteau theorem shows that for every that is upper semicontinuous and bounded from above on . By part (iii) of P5, we can apply this result at to see that . As was arbitrary, is upper semicontinuous on . Because and , we have this shown that is upper semicontinuous on . □
Lemma 13. Under the conditions of Lemma 4, is a compact subset of .
Proof. By Lemma 4 is relatively compact. Hence, it suffices to show that is closed. By Lemma 3, a subset of is closed in the topology of compact convergence if it is sequentially closed. Let be a sequence on such that compactly. Because and is closed by T3, we see that . We now wish to show that . Fix and . Because the doubleton set is compact, and , and thus . Moreover, because for all . Hence, . As these two limits must be equal, we see that . Because and were arbitrary, . □
Lemma 14. Fix . If T1, T2, and P4 hold, then is lower semicontinuous.
Proof. The proof is similar to that of Lemma 9 and is therefore omitted. □
Lemma 15. If , then is convex.
Proof. Fix and . For any and ,
where the latter equality holds since . Hence, for all . By T6, . Hence, for all .
Lemma 16 (Minimax theorem). Under the conditions of Theorem 3,
| (13) |
Proof of Lemma 16. We will show that the conditions of Theorem 1 in Fan (1953) are satisfied. By Lemma 3, is metrizable by some metric . By Lemma 13, is a compact metric space. Moreover, by Lemma 11, is a compact metric space. As all metric spaces are Hausdorff, and are Hausdorff. By Lemma 12, for each for each is upper semicontinuous on . By Lemma 14, for each is lower semicontinuous on . It remains to show that is concavelike on (called “concave on” by Fan) and that is convexlike on (called “convex on” by Fan). To see that is concavelike on , note that is convex (P7), and also that, for all is linear, and therefore concave, on . Hence, is concavelike on (page 409 of Terkelsen, 1973). To see that is convexlike on , note that is convex (Lemma 15), and also that, for all is convex on . Hence, is convexlike on (ibid.). Thus, by Theorem 1 in Fan (1953), (13) holds. □
We conclude by proving Theorem 3.
Proof of Theorem 3. We follow arguments given on page 93 of Chang (2006) to show that, under the conditions of this theorem, (13) implies that there exists an and a such that
| (14) |
Noting that pointwise maxima of lower semicontinuous functions are themselves lower semicontinuous, Lemma 14 implies that is lower semicontinuous. Because is compact (Lemma 13), there exists an such that
Similarly, Lemma 12 implies that is upper semicontinuous on . Because is compact (Lemma 11), there exists a such that
By Lemma 16, the above two displays show that . Combining this result with the elementary fact that shows that (14) holds.
Recall from below (12) that for all and . Moreover, since (Lemma 8), there exists a such that . Combining these observations shows that (i) ; (ii) ; and (iii) . Hence, by (14), . Equivalently, for all and . □
7.2.5. Proof of Theorem 4
Proof of Theorem 4. Fix , and let be the corresponding modules. Recall from Algorithm 2 that, for a given and is defined so that for all and for all . Now, for any ,
and so takes the form
Because does not depend on the last four arguments of , we know that satisfies (5), that is, is invariant to shifts and rescalings of the features and is equivariant to shifts and rescalings of the outcome. It remains to show permutation invariance, namely (4). By the permutation invariance of the sample mean and sample standard deviation, it suffices to establish the analogue of this property for , namely that for all , and . For an array of size , we will write to mean the array for which for all . Note that
Hence, satisfies (4). □
8. Extensions and Discussion
We have focused on a particular set of invariance properties on the collection of priors , namely P1-P3. Our arguments can be generalized to handle other properties. As a simple example, suppose P3 is strengthened so that is invariant to nonzero (rather than only nonnegative) rescalings of the outcome – this property is in fact satisfied in all of our experiments. Under this new condition, the results in Section 2 remain valid with the definition of the class of equivariant estimators defined in (4) and (5) modified so that may range over . Moreover, for any , Jensen’s inequality shows that the -maximal risk of the symmetrized estimator that averages and negative is no worse than that of . To assess the practical utility of this observation, we numerically evaluated the performance of symmetrizations of the estimators learned in our experiments. Symmetrizing improved performance across most settings (see Appendix F). We, therefore, recommend carefully characterizing the invariance properties of a given problem when setting out to meta-learn an estimator.
Much of this work has focused on developing and studying a framework for meta-learning a -minimax estimator for a single, prespecified collection of priors . In some settings, it may be difficult to a priori specify a single such collection that is both small enough so that the -minimax estimator is not too conservative while also being rich enough so that the priors in this collection actually place mass in a neighborhood of the true data-generating distribution. Two approaches for overcoming this challenge seem to warrant further consideration. The first would be to employ an empirical Bayes approach (Efron and Morris, 1972), wherein a large dataset from a parallel situation can be used to inform about the possible forms that the prior might take; this, in turn, would also inform about the form that the collection should take. Recent advances also make it possible to incorporate knowledge about the existence of qualitatively different categories of features when performing empirical Bayes prediction (Nabi et al., 2020). The second approach involves using AMC to approximate -minimax estimators over various choices of , and then to use a stacked ensemble to combine the predictions from these various base estimators. In our data experiments, we saw that a simple version of this ensemble that combined four base AMC estimators consistently performed at least as well as the best of these base estimators.
In this work, we have focused on the case where the problem of interest is a supervised learning problem and the objective is to predict a continuous outcome based on iid data. While the AMC algorithm generalizes naturally to a variety of other sampling schemes and loss functions (see Luedtke et al., 2020), our characterization of the equivariance properties of an optimal estimator was specific to the iid regression setting that we considered. In future work, it would be interesting to characterize these properties in greater generality, including in classification settings and inverse reinforcement learning settings (e.g., Russell, 1998; Geng et al., 2020).
Acknowledgments
The authors thank Devin Didericksen for help in the early stages of this project. Generous support was provided by Amazon through an AWS Machine Learning Research Award and the NIH under award number DP2-LM013340. The content is solely the responsibility of the authors and does not necessarily represent the official views of Amazon or the NIH.
Appendices
A. Review of amenability
In this appendix, we review the definition of an amenable group, an important implication of amenability, and also some sufficient conditions for establishing that a group is amenable. This material will prove useful in our proof of Theorem 1 (see Section 7.2.2). We refer the reader to Pier (1984) for a thorough coverage of amenability.
Definition 1 (Amenability). Let be a locally compact, Hausdorff group and let be the space of Borel measurable functions that are essentially bounded with respect to the Haar measure. A mean on is defined as a linear functional such that whenever and . A mean is said to be left invariant for a group if and only if for all , where . The group is said to be amenable if and only if there is a left invariant mean on .
We now introduce the fixed point property, and subsequently present a result showing its close connection to the definition given above. Throughout this work, we equip all group actions with the product topology
Definition 2 (Fixed point property). We say that a locally compact, Hausdorff group has the fixed point property if, whenever acts affinely on a compact convex set in a locally convex topological vector space with the map continuous, there is a point in fixed under the action of .
Theorem S1 (Day’s Fixed Point Theorem). A locally compact, Hausdorff group has the fixed point property if and only if is amenable.
Proof. See the proof of Theorem 5.4 in Pier (1984). □
The following results are useful for establishing amenability.
Lemma S17. Any compact group is amenable.
Proof. Take the normalized Haar measure as an invariant mean. □
Lemma S18. Any locally compact Abelian group is amenable.
Proof. See the proof of Proposition 12.2 in Pier (1984). □
Lemma S19. Let be a locally compact group and a closed normal subgroup of . If and are amenable, then is amenable.
Proof. Assume that a continuous affine action of on a nonempty compact convex set is given. Let be the set of all fixed points of in . Since is amenable, Theorem S1 implies that is nonempty. Since the group action is continuous, is a closed subset of and hence is compact. Since the action is affine, is convex. Now, note that, for all , , and , the fact that implies that which implies . Hence, is preserved by the action of . The action of on factors to an action of on , which has a fixed point since is amenable. But then is fixed by each . Hence, is amenable. □
B. Examples of collections where T1-T6 hold
B.1. Infinite-dimensional class
We start by presenting an infinite-dimensional class that satisfies T1-T6, and then we subsequently present a finite-dimensional class. To define this class, we fix and a function some function that is invariant to permutations, shifts, and rescalings, in the sense that both of the following hold:
-
F1.Permutations: For all and , it holds that
-
F2.
Shifts and rescalings: For all , and , it holds that , where is the matrix with row equal to .
These conditions bear some resemblance to T4 and T5. One example of a function satisfies the above conditions is a constant function.
The infinite-dimensional class of functions that we consider is defined as
We will now show that this class satisfies T1-T6. Conditions T1 and T2 follow immediately from the definition of . We now show that T3 holds. Because is complete, it suffices to show that, if converges compactly and , then . Let compactly. To see that , note that
and then take the limit as . To see that satisfies the Hölder condition, note that, for any ,
and again take the limit as . Hence, for each , and so . Hence , and thus T3 holds. We now show that T4 and T5 hold. To do this, we will use the group theoretic notation defined in Section 7.1. As noted in that section, T4 and T5 are equivalent to the condition that for all and . We will therefore fix and and show that . For , we have that
where the inequality holds since . Note that for any . Hence,
where the inequality holds since . Hence, , and so T4 and T5 hold. It remains to show T6. To see that this holds, fix and and let . By the triangle inequality and the fact that , we have the following two displays for any :
Hence, , and so T6 holds.
B.2. Finite-dimensional class
B.2.1. Overview
For an explicit representation of , we have
where . For ease of communication, we will abbreviate
so that . Here, stands for the angular component, stands for the test point, stands for the mean, stands for the standard deviation.
To define our parametric example for , we can use separation of variables to consider the coordinates of separately. We will consider estimators belonging to the class of all such that
We refer to , and as the angular part, test point part, and group part of , respectively. In what follows, we will describe conditions on , and that make it so that T1-T6 hold. We will then describe interesting collections , and that satisfy these conditions.
First note that we have the following inequality:
Thus if , and were uniformly bounded by and each of their global Hölder constant was less than or equal to , then and . Hence, if , and are such that functions in these collections are uniformly bounded by and are -Hölder, then . In that case, conditions T1 and T2 hold. Since every compact subset of can be written as a subset of a product of compact sets , , for condition T3 to hold, it suffices to show , and are closed. Condition T4 holds if is closed under rotations with respect to the observations and if , and are closed under permutations with respect to the features. The latter can be done by letting , and be -fold tensor products of an identical space of functions. Condition T5 is satisfied when is closed under shifts. Finally, condition T6 holds when , , and are convex since the projected tensor product of convex sets is convex.
B.2.2. Angular Part
We define by truncating an orthonormal basis for the tensor product space to a specified finite number of terms and then taking the subset of the span of those basis vectors that are contained in for some and . Note that , where “≅” denotes an isomorphic relation and is the -dimensional unit sphere. Let 1 be the -dimensional vector of 1's, and note that can be expressed in the following form:
Let , the orthogonal group, be such that , the th elementary basis vector. Such a exists because . Then,
We have the isomorphism . Thus, if we have an orthonormal basis for , we may use the operator to obtain an orthonormal basis for . Let be the space of harmonic polynomials of degree in -dimensions. By the Stone-Weierstrass theorem, the direct sum is dense in . We can truncate the series and stop at a prespecified point , so that
| (S1) |
We use the orthonormal basis for the spherical harmonics introduced in Higuchi (1987) (replacing “” in their notation by “” to avoid notational overload), where an explicit expression for this basis is provided in that work. Let and
where . The set is the coefficient space of the basis expansion in and is convex and compact if and only if is convex and compact. The set is closed under rotations in the observations since the spherical harmonics for any given degree is closed under rotations. It is also closed under permutations due to the -fold tensor product form. As an intersection of closed convex sets, it is closed and convex.
B.2.3. Test Point Part
Similarly to is defined by truncating an orthonormal basis for . Let be the normalized Hermite functions. They form an orthonormal basis of and so their -fold tensor product is an orthonormal basis of . We can take
We can similarly define the coefficient space :
Similarly to , the -fold tensor product form and it being an intersection of closed and convex sets show all of the necessary conditions are satisfied.
B.2.4. Group Part
The that we will define imposes that the functions are periodic in each dimension, in the sense that, if and for some elementary basis vector , then . In other words, we will be dealing with functions on the -dimensional torus, . Since the torus is a product of 1 -spheres, we can use the same process as described when defining the angular part , namely letting
| (S2) |
In this case, and translations can be dealt with by the sum and difference formulas for sine and cosine. Translations under periodicity are the same as rotation, and since it is known that spherical harmonics are rotationally invariant, is closed under translations. Similarly, the tensor product form of and its being an intersection of closed and convex sets implies that the rest of the sufficient conditions described at the end of Section B.2.1 are satisfied.
C. Examples of collections where P5 hold
We now describe settings where P5 is often applicable. We will specify in each of these settings, and the model is then defined by expanding to contain the distributions of all possible shifts and rescalings of a random variate drawn from some . The first class of models for which P 5 is often satisfied is parametric in nature, with each distribution indexed smoothly by a finite dimensional parameter belonging to a subset of . We note here that, because the sample size is fixed in our setting, we can obtain an essentialy unrestricted model by allowing to be large relative to . In parametric settings, can often be defined as , where we recall that denotes the Euclidean norm. If is uniformly tight, which certainly holds if is bounded, then P5 holds provided is upper-semicontinuous for all . For a concrete example where the conditions of P5 are satisfied, consider the case that for sparsity parameters and on and , and is the distribution for which , and . This setting is closely related to the sparse linear regression example that we study numerically in Section 5.3.2.
Condition P5 also allows for nonparametric regression functions. Define to be the -dimensional standard Gaussian measure. Define . Let satisfy the following conditions:
is bounded. .
is uniformly equivanishing. .
is uniformly equicontinuous. where is the translation by operator.
is closed in .
There exists such that .
By a generalization of the Riesz-Kolmogorov theorem as seen in Guo and Zhao (2019), is compact under assumptions (i) through (iv). Let . We suppose that where is the set of all functions such that for all . Assume further that is bounded, i.e.
| (S3) |
and also that is constant in the orbits induced by the group action on defined in Section 7.1.
For each , let denote the distribution of . Suppose that . With the metric is a complete separable compact metric space. We also see that is continuous.
Lemma S20. For all is continuous in this example.
Proof. To ease presentation, we introduce some notation. For , let , , and . Let denote the map , where takes the same value as except that the entry is replaced with . Also let . For and a function , we let . We let -a.s. . For , we write to mean , and follow a similar convention for functions that only take as input , or . We will write ≲ to mean inequality up to a positive multiplicative constant that may only depend on or .
Fix and . Now, for any , a change of variables shows that
Hereafter we write to denote .
Fix . Most of the remainder of this proof will involve establishing that . By symmetry, it will follow that .
In what follows we will use the notation to mean to mean , etc. The above yields that
| (S4) |
| (S5) |
| (S6) |
| (S7) |
| (S8) |
We bound the labeled terms on the right-hand side separately. After some calculations, it can be seen that (S4) and (S5) are bounded by a constant multiplied by . These calculations, which are omitted, involve several applications of the triangle inequality, the Cauchy-Schwarz inequality, and condition (i).
The integral in (S6) bounds as follows:
| (S9) |
We start by studying first term of the right-hand side above. Note that, by (S3) and the assumption that for all and , we have that . Combining this with Cauchy-Schwarz, the first term on the right-hand side above bounds as
| (S10) |
To continue the above bound, we will show that . Noting that
we see that, by the triangle inequality and the Cauchy-Schwarz inequality,
For , and so , which implies that , which in turn implies that . Combining this with the above and taking square roots of both sides gives the desired bound, namely
| (S11) |
Recalling (S10), we then see that the first term on the right-hand side of (S9) satisfies
We now study the second term in (S9). Before beginning our analysis, we note that, for all ,
| (S12) |
Combining the above with the triangle inequality, the second term in (S9) bounds as:
| (S13) |
In the above normed quantities, expressions like should be interpreted as functions, e.g. . By (S3), the first term on the right-hand side bounds as
For the second term, we start by noting that
Using that whenever and , this then implies that
where above is the exponent from the Hölder condition satisfied by . Combining the Hölder condition with the above, we then see that
Multiplying both sides by , we then see that
The inequality above remains true if we integrate both sides against . The resulting three terms on the right-hand side can be bounded using Hölder’s inequality. In particular, we have that
Hence, we have shown that the second term on the right-hand side of (S13) satisfies
We now study the third term on the right-hand side of (S13). We start by noting that, by Markov’s inequality and (S11),
Moreover, by the generalized Hölder’s inequality with parameters (4, 2, ∞, 4), we see that
Combining our bounds for the three terms on the right-hand side of (S13), we have shown that
| (S14) |
The above provides our bound for the (S6) term from the main expression.
We now study the (S7) term from the main expression. We start by decomposing this term as
where for brevity, we have suppressed the dependence on , and on their arguments. By (S11), the first term is bounded by a constant times . For the second term, we note that the uniform bound on and shows that
Similarly to as we did when studying (S6), we can use (S12) and the triangle inequality to write
The first term on the right upper bounds by a constant times . The analyses of the second and third terms are similar to the analysis of the analogous terms from (S6). A minor difference between the study of these terms and that of (S6) is that, when applying Hölder’s inequality to separate the terms in each normed expression, we use (v) to ensure that for some . This helps us deal with the fact that , rather than , appears in the normed expressions above. Due to the similarity of the arguments to those given for (S6), the calculations for controlling the second and third terms are omitted. After the relevant calculations, we end up showing that, like (S6), (S7) is bounded by a constant times the right-hand side of (S14).
To study (S8) from the main expression, we rewrite the integral as
Each of the terms in the expansion can be bounded using similar techniques to those used earlier in this proof. Combining our bounds on (S4) through (S8), we see that
As were arbitrary, we see that, for any sequence in such that in as , it holds that . As was arbitrary, this shows that as . Hence, is continuous in this example. □
D. Further details on numerical experiments
D.1. Meta-Learning Benchmarks
We implemented MAML via the learn2learn python package (Arnold et al., 2020), which in turn makes use of the Torchmeta package (Deleu et al., 2019) when generating the sinusoid functions. We trained MAML on a total of 106 datasets with a batch size of 25 datasets and used the same learning rates and number of adaptation steps as were used in learn2learn/examples/maml_sine.py. We tried two network architectures, namely the same two-hidden layer perceptron architecture that was used in the sinusoid experiments in Finn et al. (2017) and a larger network whose hidden layers contained the same number of nodes (40) but that used a total of five hidden layers. For each of the three regression settings considered (sinusoid, Gaussian process with a 1-dimensional feature, and Gaussian process with a 5-dimensional feature), we reported results for the architecture that performed best across the sample sizes considered. This ended up corresponding to reporting results for the smaller network architecture across all three settings.
For the Gaussian process example with a 1-dimensional feature, we used the implementation of CNPs provided by Jiang (2021), which corresponds to a Pytorch implementation of the code from Garnelo et al. (2018). We also modified this code so that it could apply to the sinusoidal regression example and the Gaussian process example where the feature is 5-dimensional. The CNPs were updated over the same number of iterations and using the same batch size as AMC, namely 106 and 25, respectively. We tried two network architectures for the CNPs, namely the same architecture as was used in Garnelo et al. (2018), with the input size modified in one of the Gaussian process settings to account for the 5-dimensional feature, and also a deeper architecture that has a similar number of hidden layers as does the architecture used for AMC. In particular, the encoder and decoder in this larger architecture each had nine hidden layers consisting of 100 nodes. Similar to as we did for MAML, for each of the three regression settings considered, we reported results for the architecture that performed best across the sample sizes considered. This corresponded to reporting CNP results for the smaller architecture for the Gaussian process with a 5-dimensional feature, and the larger architecture for the Gaussian process with a 1-dimensional feature and the sinusoidal regression.
D.2. Comparing to Analytically-Derived Estimators with Known Theoretical Performance Guarantees
D.2.1. Preliminaries
We now introduce notation that will be useful for defining in the two examples. In both examples, all priors in imply the same prior over the distribution of the features. This prior imposes that the indexing is equal in distribution to , where is a matrix drawn from a Wishart distribution with scale matrix and 20 degrees of freedom, and denotes a matrix with the same diagonal as and zero in all other entries. The expression for normalizes by to ensure that the diagonal of is equal to , which we require of distributions in . We let be a collection of Markov kernels , so that, for each and is a distribution on . The collections differ in the two examples, and will be presented in the coming subsections. Let denote a uniform distribution over the permutations in . For each , we let represent a prior on from which a draw can be generated by sampling , and , and subsequently returning the distribution of , where and are independent. We let . For a general class of estimators , enforcing that each draw has a regression function of the form for some permutation is useful because it allows us to restrict the class so that each function in this class only depends on the first coordinates of the input, while yielding a regression function that may depend on any arbitrary collection of out of the total coordinates. For the equivariant class that we consider (Algorithm 2), enforcing this turns out to be unnecessary - the invariance of functions in to permutations of the features implies that the Bayes risk of each remains unchanged if the random variable defining is replaced by a degenerate random variable that is always equal to the identity matrix. Nonetheless, allowing to be a random draw from allows us to ensure that our implied collection of priors satisfies P1, P2, and P3, thereby making the implied compatible with the preservation conditions imposed in Section 2.
Figure S5:

Bayesian standardized MSE , where is defined in Eq. 1) of the five meta-learning algorithms considered in the sinusoidal regression example when the feature or the outcome is scaled down by a multiplicative factor (left two columns) or when or is shifted by an additive factor (right two columns). For reference, the numbers reported in Table 1 in the main text are equal to the standardized MSE reported on the far-left side of each facet times the variance of the error (0.09). The three equivariant procedures (MAML-Eq, CNP-Eq, and AMC) have constant standardized MSE under the shifts and rescalings considered. The non-equivariant procedures, namely MAML and CNPs, are sensitive even to small shifts or rescalings of , and CNPs are also sensitive to small shifts in .
We now use the notation of Kingma and Ba (2014) to detail the hyperparameters that we used. In all settings, we set . Whenever we were updating the prior network, we set the momentum parameter to 0, and whenever we were updating the estimator network, we set the momentum parameter to 0.25. The parameter differed across settings. In the sparse linear regression setting with , we found that choosing small helped to improve stability. Specifically, we let when updating both the estimator and prior networks. In the sparse linear regression setting with , we used the more commonly chosen parameter setting of for both networks. In the FLAM example, we chose and for the estimator and prior networks, respectively.
The learning rates were of the estimator and prior networks were decayed at rates and , respectively. Such two-timescale learning rate strategies have proven to be effective in stabilizing the optimization problem pursued by generative adversarial networks (Heusel et al., 2017). As noted in Fiez et al. (2019), using two-timescale strategies can cause the optimization problem to converge to a differential Stackelberg, rather than a differential Nash, equilibrium. Indeed, under some conditions, the two-timescale strategy that we use is expected to converge to a differential Stackelberg equilibrium in the hierarchical two-player game where a prior is first selected from , and then an estimator is selected from to perform well against . An optimal prior in this game is called -least favorable, in the sense that this prior maximizes over . For a given -least favorable prior , an optimal estimator in this game is a Bayes estimator against , that is, an estimator that minimizes over . This may not necessarily be a -minimax strategy, that is, may not minimize over . Nevertheless, we note that, under appropriate conditions, the two notions of optimality necessarily agree. Though such a theoretical guarantee is not likely to hold in our experiments given the neural network parameterizations that we use, we elected to use this two-timescale strategy because of the improvements in stability that we saw.
In all settings, the prior and estimator were updated over 106 iterations using batches of 100 datasets. For each dataset, performance is evaluated at 100 values of .
D.2.2. Sparse linear regression
We now introduce notation that will be useful for presenting the collection in the sparse linear regression example. For a function and a distribution , we let be equal to the distribution of
where and are drawn independently. Notably, here does not depend on . We let , where takes different values when and when . When consists of all four-hidden layer perceptrons with identity output activation, where each hidden layer consists of forty leaky ReLU units. When consists of all four-hidden layer neural networks with identity output activation, but in this case each layer is a multi-input-output channel equivariant layer as described in Eq. 22 of Zaheer et al. (2017). Each hidden layer is again equipped with a ReLU activation function. The output of each such network is equivariant to permutations of the inputs.
In each sparse linear regression setting considered, we initialized the estimator network by pretraining for 5,000 iterations against the initial fixed prior network. After these 5,000 iterations, we then began to adversarially update the prior network against the estimator network.
Five thousand Monte Carlo replicates were used to obtain the performance estimates in Table 2.
D.2.3. Fused lasso additive model
When discussing the FLAM example, we will write to denote the feature, that is, we denote a generic by . We emphasize this to avoid any notational confusion with the fact that, elsewhere in the text, is used to denote the random variable corresponding to the observation.
In the FLAM example, each prior in is indexed by a function belonging to the collection of four-hidden layer perceptrons with identity output activation, where each hidden layer consists of forty leaky ReLU units. Specifically, is a distribution over generalized additive models for which each component is piecewise-constant and changes values at most 500 times. To obtain a draw from , we can first draw 500 iid observations from and store these observations in the matrix . Each component can only have a jump at the 500 points in . The magnitude of each jump is defined using the function and the sign of the jump is defined uniformly at random. More specifically, these increments are defined based on the independent sources of noise , which is an iid collection of Rademacher random variables, and , which is an iid collection of random variables. The component is chosen to be proportional to the function . The proportionality constant is defined so that the function saturates the constraint that is imposed by . To recap, the random draw from can be obtained by independently drawing , and , and subsequently following the steps described above to define the corresponding proportionality constant and components .
We evaluated the performance of the learned prediction procedures using a variant of the simulation scenarios 1–4 from the paper that introduced FLAM (Fig. 2 in Petersen et al., 2016). As presented in that work, the four scenarios have independent Unif (−2.5, 2.5) features, with the components corresponding to of these features being nonzero. These scenarios offer a range of smoothness settings, with scenarios 1–4 enforcing that the components be (1) piecewise constant, (2) smooth, (3) a mix of piecewise constant and smooth functions, and (4) constant in some areas of its domain and highly variable in others. To evaluate our procedures trained with , we used the function sim. data in the flam package (Petersen, 2018) to generate training data from the scenarios in Petersen et al. (2016) with features. We then generated new outcomes by rescaling the regression function by a positive multiplicative constant so that , and subsequently added standard Gaussian noise. To evaluate our procedures trained at sparsity level in a given scenario, we defined a prior over the regression function that first randomly selects one of the four signal components, then rescales this component so that it has total variation equal to 10, and then sets all other components equal to zero. Outcomes were generated by adding Gaussian noise to the sampled regression function. We compared our approach to the FLAM method as implemented in the flam package when, in the notation of Petersen et al. (2016), and was chosen numerically to enforce that the resulting regression function estimate satisfied . Choosing in this fashion is reasonable in light of the fact that for all settings considered.
Two thousand Monte Carlo replicates were used to obtain the performance estimates in Table 3.
E. Additional details and results for data experiments
E.1. Datasets
We start by describing the six datasets that we considered that are available through the UCI Machine Learning Repository (Dua and Graff, 2017). The first dataset (“abalone”) contains information on 4177 abalones. The objective is to predict their age based on 7 features, namely length, diameter, height, whole weight, shucked weight, viscera weight, and shell weight (Nash et al., 1994). The second dataset (“airfoil”) is from the National Aeronautics and Space Administration (NASA) that contains information on 1,503 airfoils at various wind tunnel speeds and angles of attack (Brooks et al., 1989). The objective is to estimate the scaled sound level in decibels. Five features are available, namely frequency, angle of attack, chord length, free-stream velocity, and suction side displacement thickness. The third dataset (“fish”) was originally used to develop quantitative structure-activity relationship (QSAR) models to predict acute aquatic toxicity towards the fathead minnow. This dataset contains 908 total observations, each of which corresponds to a distinct chemical. The outcome is the LC50 for that chemical, which represents the concentration of the chemical that is lethal for 50% of test fish over 96 hours. Six features that describe the molecular characteristics of the chemical are available — see the UCI Machine Learning Repository and Cassotti et al. (2015) for details. The fourth and fifth datasets contain information on 1,599 red wines (“wine-red”) and 4,898 white wines (“wine-white”) (Cortez et al., 2009). The objective is to predict wine quality score based on 11 available features — see the UCI Machine Learning Repository and Cassotti et al. (2015) for details. The sixth dataset (“yacht”) contains information on 308 sailing yachts. The objective is to learn to predict a ship’s performance in terms of residuary resistance. Six features describing a ship’s dimensions and velocity are available, namely: the longitudinal position of the center of buoyancy, the prismatic coefficient, the length-displacement ratio, the beam-draught ratio, the length-beam ratio, and the Froude number. See Gerritsma et al. (1981) for more information on these features.
The seventh and eighth of our datasets that we considered were used to illustrate regression procedures in James et al. (2013). They are available through the ISLR R package (James et al., 2017). One of these datasets (“college”) consists of information on 777 colleges in the United States. The objective is to predict out-of-state tuition based on 16 available continuous features. The second of these datasets (“hitters”) contains information on 322 baseball players. The objective is to predict salary based on the 16 available continuous features. The ninth dataset (“LAozone”) was used to illustrate regression procedures in (Friedman, 2001). It consists of 330 daily meteorological measurements in the Los Angeles basin in 1976. The objective is to predict ozone levels based on 9 available features. The final dataset that we considered (“happiness”) was used in the paper that introduced the FLAM to illustrate the performance of the method (Petersen et al., 2016). This dataset consists of information about 109 countries. The objective is to predict the national happiness level via 12 country-level features.
E.2. Additional results for data experiments
Table S5 displays the cross-validated MSEs across the ten datasets in numerical form. Figure S6 shows the performance of the individual linear algorithms considered at different sparsity levels, and Figure S7 shows the same results but for the stacking algorithms.
Table S5:
Cross-validated MSEs on the ten datasets. The first 5 datasets had the same number of features the same as were used during meta-training (10), whereas the others had fewer. For each of the three categories (linear estimators, FLAM estimators, and stacked estimators) and each dataset, the algorithm with the lowest Monte Carlo MSE is emphasized in bold. There was no clear ordering between the performance of AMC Linear and the existing estimators (OLS and lasso). AMC FLAM tended to outperform FLAM when the number of features was the same as were used during meta-training, and be slightly outperformed otherwise. When the number of features was the same as were used during meta-training, stacking the existing and AMC estimators consistently outperformed all other approaches. When there were fewer features than were used during meta-training, stacking all available learners performed similarly to stacking only the existing algorithms and still outperformed all individual learners.
| Features | OLS | Lasso | AMC Linear (ours) | FLAM | AMC FLAM (ours) | Stacked Existing | Stacked AMC (ours) | Stacked Both (ours) | |
|---|---|---|---|---|---|---|---|---|---|
| college | 10 | 0.414 | 0.397 | 0.377 | 0.392 | 0.395 | 0.358 | 0.354 | 0.348 |
| happiness | 10 | 0.270 | 0.277 | 0.275 | 0.315 | 0.311 | 0.280 | 0.261 | 0.256 |
| hitters | 10 | 0.667 | 0.660 | 0.662 | 0.626 | 0.619 | 0.602 | 0.615 | 0.585 |
| wine-red | 10 | 0.768 | 0.737 | 0.746 | 0.826 | 0.776 | 0.737 | 0.737 | 0.731 |
| wine-white | 10 | 0.833 | 0.814 | 0.824 | 0.899 | 0.860 | 0.809 | 0.815 | 0.802 |
| LAozone | 9 | 0.341 | 0.335 | 0.337 | 0.335 | 0.367 | 0.310 | 0.320 | 0.309 |
| abalone | 7 | 0.559 | 0.546 | 0.540 | 0.709 | 0.675 | 0.539 | 0.538 | 0.537 |
| fish | 6 | 0.471 | 0.475 | 0.480 | 0.544 | 0.554 | 0.464 | 0.476 | 0.468 |
| yacht | 6 | 0.381 | 0.372 | 0.350 | 0.019 | 0.035 | 0.015 | 0.029 | 0.015 |
| airfoil | 5 | 0.524 | 0.525 | 0.528 | 0.617 | 0.701 | 0.516 | 0.523 | 0.520 |
F. Performance of symmetrized estimators in experiments
We now present the additional experimental results that we alluded to in Section 8. These results were obtained by symmetrizing the meta-learned AMC100 and AMC500 estimators whose performance was reported in Section 5. In particular, we symmetrized a given AMC estimator as
When reporting our experimental results, we refer to the symmetrized estimator derived from the meta-learned AMC100 and AMC500 estimators as ‘symmetrized AMC100' and ‘symmetrized AMC500', respectively. We emphasize that these symmetrized estimators are derived directly from the AMC100 and AMC500 fits that we reported in Section 5 – we did not rerun our AMC meta-learning algorithm to obtain these estimators.
Figure S6:

Performance of OLS, lasso, and AMC Linear at different sparsity levels. For each training-validation split of the data, between 1 and features are selected at random from the original dataset (x-axis), where is the minimum of 10 and the total number of features in the dataset, and Gaussian noise features are then added so that there are 10 total features. Therefore, the signal is expected to become denser and stronger as the x-axis value increases. AMC Linear consistently outperformed OLS and performed similarly to or better than lasso in most settings (54% of all sparsity-dataset pairs).
Table S6 reports the results for the linear regression example. In many settings, the two approaches performed similarly. However, in the sparse setting, the improvements that resulted from symmetrization sometimes resulted in the MSE being cut in half. In one setting (dense, interior, ), AMC100 outperformed symmetrized AMC100 slightly – though not deducible from the table, we note here that the difference in MSE in this case was less than 0.003, and it seems likely that this discrepancy is a result of Monte Carlo error. Table S6 reports the results for the fused lasso additive model example. Symmetrization led to a reduction in MSE in most settings. In all other settings, the MSE remained unchanged.
Figure S7:

Performance of the three stacking algorithms at different sparsity levels. For each training-validation split of the data, between 1 and features are selected at random from the original dataset (x-axis), where is the minimum of 10 and the total number of features in the dataset, and Gaussian noise features are then added so that there are 10 total features. Therefore, the signal is expected to become denser and stronger as the x-axis value increases. Though all algorithms performed similarly, the stacking algorithm that combined all available algorithms (Stacked Both) performed slightly better than the others in a majority of the settings (53% of all sparsity-dataset pairs), and Stacked AMC performed best in most other settings (39% of all sparsity-dataset pairs).
Table S6:
MSEs based on datasets of size in the linear regression settings. All Monte Carlo standard errors are less than 0.001. Symmetrized AMC100 entries appear in bold when they had lower MSE (rounded to the nearest hundredth) than the corresponding AMC100 entry, and vice versa. Similarly, symmetrized AMC500 entries appear in bold when they had lower MSE than the corresponding AMC500 entry, and vice versa.
| (a) Sparse signal | ||||||||
|---|---|---|---|---|---|---|---|---|
| Boundary | Interior | |||||||
| n=100 | 500 | 100 | 500 | |||||
| OLS | 0.12 | 0.02 | 0.12 | 0.02 | ||||
| Lasso | 0.06 | 0.01 | 0.06 | 0.01 | ||||
| AMC100 (ours) | 0.02 | <0.01 | 0.11 | 0.09 | ||||
| Symmetrized AMC100 (ours) | 0.02 | <0.01 | 0.06 | 0.04 | ||||
| AMC500 (ours) | 0.02 | <0.01 | 0.07 | 0.04 | ||||
| Symmetrized AMC500 (ours) | 0.02 | <0.01 | 0.06 | 0.03 | ||||
| (b) Dense signal | ||||||||
| Boundary | Interior | |||||||
| n=100 | 500 | 100 | 500 | |||||
| OLS | 0.13 | 0.02 | 0.13 | 0.02 | ||||
| Lasso | 0.11 | 0.02 | 0.09 | 0.02 | ||||
| AMC100 (ours) | 0.10 | 0.04 | 0.08 | 0.02 | ||||
| Symmetrized AMC100 (ours) | 0.09 | 0.03 | 0.09 | 0.02 | ||||
| AMC500 (ours) | 0.09 | 0.02 | 0.09 | 0.02 | ||||
| Symmetrized AMC500 (ours) | 0.09 | 0.02 | 0.09 | 0.02 | ||||
Table S7:
MSEs based on datasets of size in the FLAM settings. The Monte Carlo standard errors for the MSEs of FLAM and (symmetrized) AMC are all less than 0.04 and 0.01, respectively. Symmetrized AMC100 entries appear in bold when they had lower MSE (rounded to the nearest hundredth) than the corresponding AMC100 entry, and vice versa. Similarly, symmetrized AMC500 entries appear in bold when they had lower MSE than the corresponding AMC500 entry, and vice versa.
| (a) Sparse signal | ||||||||
|---|---|---|---|---|---|---|---|---|
| Scenario 1 | Scenario 2 | Scenario 3 | Scenario 4 | |||||
| n=100 | 500 | 100 | 500 | 100 | 500 | 100 | 500 | |
| FLAM | 0.44 | 0.12 | 0.47 | 0.17 | 0.38 | 0.11 | 0.51 | 0.19 |
| AMC100 (ours) | 0.34 | 0.20 | 0.18 | 0.08 | 0.27 | 0.14 | 0.17 | 0.08 |
| Symmetrized AMC100 (ours) | 0.32 | 0.18 | 0.18 | 0.08 | 0.26 | 0.13 | 0.16 | 0.08 |
| AMC500 (ours) | 0.48 | 0.12 | 0.19 | 0.06 | 0.35 | 0.10 | 0.23 | 0.08 |
| Symmetrized AM5100 (ours) | 0.43 | 0.12 | 0.17 | 0.05 | 0.32 | 0.09 | 0.21 | 0.07 |
| (b) Dense signal | ||||||||
| Scenario 1 | Scenario 2 | Scenario 3 | Scenario 4 | |||||
| n=100 | 500 | 100 | 500 | 100 | 500 | 100 | 500 | |
| FLAM | 0.59 | 0.17 | 0.65 | 0.24 | 0.53 | 0.16 | 0.76 | 0.36 |
| AMC100 (ours) | 1.20 | 0.91 | 0.47 | 0.39 | 0.87 | 0.57 | 0.30 | 0.30 |
| Symmetrized AMC100 (ours) | 1.16 | 0.84 | 0.45 | 0.37 | 0.83 | 0.52 | 0.29 | 0.30 |
| AMC500 (ours) | 0.58 | 0.15 | 0.37 | 0.08 | 0.46 | 0.12 | 0.36 | 0.09 |
| Symmetrized AM5100 (ours) | 0.55 | 0.15 | 0.36 | 0.08 | 0.43 | 0.11 | 0.34 | 0.09 |
References
- Arnold SM, Mahajan P, Datta D, Bunner I, and Zarkias KS. learn2learn: A library for meta-learning research. arXiv preprint arXiv:2008.12284, 2020. [Google Scholar]
- Berger JO. Statistical Decision Theory and Bayesian Analysis. Springer Science & Business Media, 1985. [Google Scholar]
- Bertinetto L, Henriques JF, Torr PH, and Vedaldi A. Meta-learning with differentiable closed-form solvers. arXiv preprint arXiv:1805.08136, 2018. [Google Scholar]
- Billingsley P. Convergence of probability measures. Wiley, 1999. [Google Scholar]
- Bosc T. Learning to learn neural networks. arXiv preprint arXiv:1610.06072, 2016. [Google Scholar]
- Breiman L. Stacked regressions. Machine learning, 24(1):49–64, 1996. [Google Scholar]
- Breiman L. Random forests. Machine learning, 45(1):5–32, 2001. [Google Scholar]
- Brooks TF, Pope DS, and Marcolini MA. Airfoil self-noise and prediction, volume 1218. National Aeronautics and Space Administration , Office of Management..., 1989. [Google Scholar]
- Cassotti M, Ballabio D, Todeschini R, and Consonni V. A similarity-based qsar model for predicting acute toxicity towards the fathead minnow (pimephales promelas). SAR and QSAR in Environmental Research, 26(3):217–243, 2015. [DOI] [PubMed] [Google Scholar]
- Chamberlain G. Econometric applications of maxmin expected utility. Journal of Applied Econometrics, 15(6):625–644, 2000. [Google Scholar]
- Chang K-C. Methods in nonlinear analysis. Springer Science & Business Media, 2006. [Google Scholar]
- Chen T and Guestrin C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794. ACM, 2016. [Google Scholar]
- Cohn DL. Measure theory. Springer, 2013. [Google Scholar]
- Conway JB. A course in functional analysis, volume 96. Springer, 2010. [Google Scholar]
- Cortez P, Teixeira J, Cerdeira A, Almeida F, Matos T, and Reis J. Using data mining for wine quality assessment. In International Conference on Discovery Science, pages 66–79. Springer, 2009. [Google Scholar]
- Cybenko G. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989. [Google Scholar]
- Dalvi N, Domingos P, Sanghai S, and Verma D. Adversarial classification. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 99–108, 2004. [Google Scholar]
- Day MM. Fixed-point theorems for compact convex sets. Illinois Journal of Mathematics, 5 (4):585–590, 1961. [Google Scholar]
- Deleu T, Würfl T, Samiei M, Cohen JP, and Bengio Y. Torchmeta: A Meta-Learning library for PyTorch, 2019. URL https://arxiv.org/abs/1909.06576. Available at: https://github.com/tristandeleu/pytorch-meta.
- Dua D and Graff C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
- Efron B and Morris C. Limiting the risk of bayes and empirical bayes estimators—part ii: The empirical bayes case. Journal of the American Statistical Association, 67(337):130–139, 1972. [Google Scholar]
- Fan K. Minimax theorems. Proceedings of the National Academy of Sciences of the United States of America, 39(1):42, 1953. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fiez T, Chasnov B, and Ratliff LJ. Convergence of learning dynamics in stackelberg games. arXiv preprint arXiv:1906.01217, 2019. [Google Scholar]
- Finn C, Abbeel P, and Levine S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR. org, 2017. [Google Scholar]
- Finn C, Xu K, and Levine S. Probabilistic model-agnostic meta-learning. arXiv preprint arXiv:1806.02817, 2018. [Google Scholar]
- Friedman J, Hastie T, Tibshirani R, et al. The elements of statistical learning, volume 1. Springer series in statistics New York, 2001. [Google Scholar]
- Friedman JH. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001. [Google Scholar]
- Garnelo M, Rosenbaum D, Maddison C, Ramalho T, Saxton D, Shanahan M, Teh YW, Rezende D, and Eslami SA. Conditional neural processes. In International Conference on Machine Learning, pages 1704–1713. PMLR, 2018. [Google Scholar]
- Geman S and Geman D. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on pattern analysis and machine intelligence, (6):721–741, 1984. [DOI] [PubMed] [Google Scholar]
- Geng S, Nassif H, Manzanares CA, Reppen AM, and Sircar R. Deep pqr: Solving inverse reinforcement learning using anchor actions. arXiv e-prints, pages arXiv-2007, 2020. [Google Scholar]
- Gerritsma J, Onnink R, and Versluis A. Geometry, resistance and stability of the delft systematic yacht hull series. International shipbuilding progress, 28(328):276–297, 1981. [Google Scholar]
- Ghosal S and Van der Vaart A. Fundamentals of nonparametric Bayesian inference, volume 44. Cambridge University Press, 2017. [Google Scholar]
- Gidel G, Berard H, Vignoud G, Vincent P, and Lacoste-Julien S. A variational inequality perspective on generative adversarial networks. arXiv preprint arXiv:1802.10551, 2018. [Google Scholar]
- Glynn PW. Likelihood ratio gradient estimation: an overview. In Proceedings of the 19th conference on Winter simulation, pages 366–375. ACM, 1987. [Google Scholar]
- Goldblum M, Fowl L, and Goldstein T. Adversarially robust few-shot learning: A meta-learning approach. arXiv preprint arXiv:1910.00982v2, 2019. [Google Scholar]
- Goodfellow IJ, Shlens J, and Szegedy C. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014. [Google Scholar]
- Guo W and Zhao G. An improvement on the relatively compactness criteria. arXiv preprint arXiv:1904.03427, 2019. [Google Scholar]
- Hartford J, Graham DR, Leyton-Brown K, and Ravanbakhsh S. Deep models of interactions across sets. arXiv preprint arXiv:1803.02879, 2018. [Google Scholar]
- Hastings WK. Monte carlo sampling methods using markov chains and their applications. 1970. [Google Scholar]
- Heusel M, Ramsauer H, Unterthiner T, Nessler B, and Hochreiter S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pages 6626–6637, 2017. [Google Scholar]
- Higuchi A. Symmetric tensor spherical harmonics on the n-sphere and their application to the de sitter group so (n, 1). Journal of mathematical physics, 28(7):1553–1566, 1987. [Google Scholar]
- Hochreiter S and Schmidhuber J. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. [DOI] [PubMed] [Google Scholar]
- Hochreiter S, Younger AS, and Conwell PR. Learning to learn using gradient descent. In International Conference on Artificial Neural Networks, pages 87–94. Springer, 2001. [Google Scholar]
- Hornik K. Approximation capabilities of multilayer feedforward networks. Neural networks, 4 (2):251–257, 1991. [Google Scholar]
- Hospedales T, Antoniou A, Micaelli P, and Storkey A. Meta-learning in neural networks: A survey. arXiv preprint arXiv:2004.05439, 2020. [DOI] [PubMed] [Google Scholar]
- Hunt G and Stein C. Most stringent tests of statistical hypotheses. Unpublished manuscript, 1946. [Google Scholar]
- James G, Witten D, Hastie T, and Tibshirani R. An introduction to statistical learning, volume 112. Springer, 2013. [Google Scholar]
- James G, Witten D, Hastie T, and Tibshirani R. ISLR: Data for an Introduction to Statistical Learning with Applications in R, 2017. URL https://CRAN.R-project.org/package=ISLR. R package version 1.2. [Google Scholar]
- Jiang S. Conditional neural process pytorch implementation, 2021. URL https://github.com/shalijiang/neural-process. [Google Scholar]
- Kempthorne PJ. Numerical specification of discrete least favorable prior distributions. SIAM Journal on Scientific and Statistical Computing, 8(2):171–184, 1987. [Google Scholar]
- Kingma DP and Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [Google Scholar]
- Korpelevich GM. The extragradient method for finding saddle points and other problems. Matecon, 12:747–756, 1976. [Google Scholar]
- Le Cam L. Asymptotic methods in statistical decision theory. Springer Science & Business Media, 2012. [Google Scholar]
- Lee K, Maji S, Ravichandran A, and Soatto S. Meta-learning with differentiable convex optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10657–10665, 2019. [Google Scholar]
- Lin T, Jin C, and Jordan MI. On gradient descent ascent for nonconvex-concave minimax problems. arXiv preprint arXiv:1906.00331v6, 2019. [Google Scholar]
- Luedtke A, Carone M, Simon NR, and Sofrygin O. Learning to learn from data: using deep adversarial learning to construct optimal statistical procedures. Science Advances (in press; available online late Feb or Mar 2020), 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Maron H, Fetaya E, Segol N, and Lipman Y. On the universality of invariant networks. arXiv preprint arXiv:1901.09342, 2019. [Google Scholar]
- Munkres J Topology. Featured Titles for Topology Series. Prentice Hall, Incorporated, 2000. ISBN; 9780131816299. URL https://books.google.com/books?id=XjoZAQAAIAAJ. [Google Scholar]
- Nabi S, Nassif H, Hong J, Mamani H, and Imbens G. Decoupling learning rates using empirical bayes priors. arXiv preprint arXiv:2002.01129, 2020. [Google Scholar]
- Nash WJ, Sellers TL, Talbot SR, Cawthorn AJ, and Ford WB. The population biology of abalone (haliotis species) in tasmania. i. blacklip abalone (h. rubra) from the north coast and islands of bass strait. Sea Fisheries Division, Technical Report, 48:p411, 1994. [Google Scholar]
- Nelder JA and Wedderburn RW. Generalized linear models. Journal of the Royal Statistical Society: Series A (General), 135(3):370–384, 1972. [Google Scholar]
- Nelson W. Minimax solution of statistical decision problems by iteration. The Annals of Mathematical Statistics, pages 1643–1657, 1966. [Google Scholar]
- Noubiap RF and Seidel W. An algorithm for calculating γ-minimax decision rules under generalized moment conditions. The Annals of Statistics, 29(4):1094–1116, 2001. [Google Scholar]
- Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, and Antiga L. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pages 8024–8035, 2019. [Google Scholar]
- Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, and Duchesnay E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. [Google Scholar]
- Petersen A. flam: Fits Piecewise Constant Models with Data-Adaptive Knots, 2018. URL https://CRAN.R-project.org/package=flam. R package version 3.2. [Google Scholar]
- Petersen A, Witten D, and Simon N. Fused lasso additive model. Journal of Computational and Graphical Statistics, 25(4):1005–1025, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pier J-P. Amenable locally compact groups. Wiley-Interscience, 1984. [Google Scholar]
- Polley EC and Van der Laan MJ. Super learner in prediction. Technical report, University of California, Berkeley, 2010. [Google Scholar]
- Ravanbakhsh S, Schneider J, and Poczos B. Deep learning with sets and point clouds. arXiv preprint arXiv:1611.04500, 2016. [Google Scholar]
- Ravanbakhsh S, Schneider J, and Poczos B. Equivariance through parameter-sharing. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2892–2901. JMLR. org, 2017. [Google Scholar]
- Ravi S and Larochelle H. Optimization as a model for few-shot learning. In International Conference on Learning Representations (ICLR), 2017. [Google Scholar]
- Robert C. The Bayesian choice: from decision-theoretic foundations to computational implementation. Springer Science & Business Media, 2007. [Google Scholar]
- Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015. [Google Scholar]
- Russell S. Learning agents for uncertain environments. In Proceedings of the eleventh annual conference on Computational learning theory, pages 101–103, 1998. [Google Scholar]
- Santoro A, Bartunov S, Botvinick M, Wierstra D, and Lillicrap T. Meta-learning with memory-augmented neural networks. In International conference on machine learning, pages 1842–1850, 2016. [Google Scholar]
- Schafer CM and Stark PB. Constructing confidence regions of optimal expected size. Journal of the American Statistical Association, 104(487):1080–1089, 2009. [Google Scholar]
- Schmidhuber J. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-… hook. PhD thesis, Technische Universität München, 1987. [Google Scholar]
- Terkelsen F. Some minimax theorems. Mathematica Scandinavica, 31(2):405–413, 1973. [Google Scholar]
- Thrun S and Pratt L. Learning to learn: Introduction and overview. In Learning to learn, pages 3–17. Springer, 1998. [Google Scholar]
- Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996. [Google Scholar]
- Van der Laan MJ, Polley EC, and Hubbard AE. Super learner. Statistical applications in genetics and molecular biology, 6(1), 2007. [DOI] [PubMed] [Google Scholar]
- Van der Vaart AW, Dudoit S, and van der Laan MJ. Oracle inequalities for multi-fold cross validation. Statistics and Decisions, 24(3):351–371, 2006. [Google Scholar]
- van Gaans O. Probability measures on metric spaces. Technical report, Technical report, Delft University of Technology, 2003. [Google Scholar]
- Vilalta R and Drissi Y. A perspective view and survey of meta-learning. Artificial intelligence review, 18(2):77–95, 2002. [Google Scholar]
- Vinyals O, Blundell C, Lillicrap T, and Wierstra D. Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638, 2016. [Google Scholar]
- Vuorio R, Sun S-H, Hu H, and Lim JJ. Toward multimodal model-agnostic meta-learning. arXiv preprint arXiv:1812.07172, 2018. [Google Scholar]
- Wald A. Statistical decision functions which minimize the maximum risk. Annals of Mathematics, pages 265–280, 1945. [Google Scholar]
- Yin C, Tang J, Xu Z, and Wang Y. Adversarial meta-learning. arXiv preprint arXiv:1806.03316, 2018. [Google Scholar]
- Zaheer M, Kottur S, Ravanbakhsh S, Poczos B, Salakhutdinov RR, and Smola AJ. Deep sets. In Advances in neural information processing systems, pages 3391–3401, 2017. [Google Scholar]
