Abstract
Model-assisted estimators have attracted a lot of attention in the last three decades. These estimators attempt to make an efficient use of auxiliary information available at the estimation stage. A working model linking the survey variable to the auxiliary variables is specified and fitted on the sample data to obtain a set of predictions, which are then incorporated in the estimation procedures. A nice feature of model-assisted procedures is that they maintain important design properties such as consistency and asymptotic unbiasedness irrespective of whether or not the working model is correctly specified. In this article, we examine several model-assisted estimators from a design-based point of view and in a high-dimensional setting, including linear regression and penalized estimators. We conduct an extensive simulation study using data from the Irish Commission for Energy Regulation Smart Metering Project, to assess the performance of several model-assisted estimators in terms of bias and efficiency in this high-dimensional data set.
Keywords: Design consistency, elastic net, Lasso, random forest, ridge regression, XGBoost
1. Introduction
Surveys conducted by national statistical offices (NSO) aim at estimating finite population parameters, which are those describing some aspects of the finite population under study. In this article, the interest lies in estimating the population total of a survey variable Y. Population totals can be estimated unbiasedly using the well-known Horvitz–Thompson estimator [23]. In the absence of nonsampling errors, the Horvitz–Thompson estimator is unbiased with respect to the customary design-based inferential approach, whereby the properties of estimators are evaluated with respect to the sampling design; e.g., see [36]. However, Horvitz–Thompson type estimators may exhibit a large variance in some situations. The efficiency of the Horvitz–Thompson estimator can be improved by incorporating some auxiliary information, capitalizing on the relationship between the survey variable Y and a set of auxiliary variables . The resulting estimation procedures, referred to as model-assisted estimation procedures, use a working model as a vehicle for constructing point estimators. Model-assisted estimators remain design-consistent even if the working model is misspecified, which is a desirable feature. When the working model provides an adequate description of the relationship between Y and , model-assisted estimators are expected to be more efficient than the Horvitz–Thompson estimator.
The class of model-assisted estimators includes a wide variety of procedures, some of which have been extensively studied in the literature both theoretically and empirically. When the working model is the customary linear regression model, the resulting estimator is the well-known generalized regression estimator (GREG), e.g. Särndal [35], Särndal and Wright [37] and Särndal et al. [36]. Other works include model-assisted procedures based on generalized linear models [15,26], local polynomial regression [4], splines [3,16,17,27], neural nets [30], generalized additive models [31], nonparametric additive models [42], regression trees [29,41] and random forests [12].
Due to the recent advances of information technology, NSOs have now access to a variety of data sources, some of which may exhibit a large number of observations on a large number of variables. So far, the properties of model-assisted estimator have been established under the customary asymptotic framework in finite population sampling [24] for which both the population size N and the sample size n increase to infinity, while assuming that the number of auxiliary variables p is fixed. In other words, existing results require n to be large relative to p. This framework is generally not adequate in the context of high-dimensional data sets as p may be of the same order as n, or even larger, i.e. p>n. A more appropriate asymptotic framework would let p increase to infinity in addition to N and n. Cardot et al. [8] studied dimension reduction through principal component analysis and established the design consistency of the resulting calibration estimator. More recently, Ta et al. [38] investigated the properties of the GREG estimator from a model point of view and when p is allowed to diverge and [10] studied the asymptotic variance of the calibration estimator when the number p of calibration variables is going to infinity.
The aim of this paper is to give a general consistency result for a class of model-assisted estimators when the number p of auxiliary variables is allowed to grow to infinity. This class of model-assisted estimators includes the GREG estimator as well as model-assisted estimators based on penalization methods such as ridge, lasso and elastic net. The latter methods were proposed to cope with multicollinearity between predictors in a high-dimension setting. Under mild regularity assumptions, we show that these model-assisted estimators are design consistent provided that goes to zero. As we argue in Section 3, this rate can be improved if one is willing to make additional assumptions about the rate of convergence of the estimated regression coefficient. In particular, we lay out a set of additional conditions under which the model-assisted ridge estimator is consistent if p/n goes to zero and moreover, is -consistent if with Also, provided that the predictors are orthogonal, we show that both the model-assisted lasso and elastic net estimators are consistent provided that p/n goes to zero.
To the best of our knowledge, an empirical comparison of penalized or nonparametric model-assisted estimators in terms of bias and efficiency in a high-dimensional setting is currently lacking. We aim to fill this gap in the article. To assess the performance of several model-assisted estimators in a high-dimensional setting, we conduct a large simulation study using data from the Irish Commission for Energy Regulation Smart Metering Project. The data set consists of electricity consumption recorded every half an hour for a 2-year period and for more than 6000 households and businesses, leading to highly correlated data. Due to the high-dimensional feature, model-assisted estimators based on a linear model tend to breakdown and penalized and reduction dimension based estimators may provide good alternatives.
The paper is organized as follows. In Section 2, we introduce the theoretical setup. In Section 3, we investigate the asymptotic properties of several model-assisted estimators: the GREG estimator as well as estimators based on ridge regression, lasso and elastic net. Section 4 contains an empirical comparison to assess the performance of several model-assisted estimators in terms of bias and efficiency. In our empirical experiments, we included model-assisted estimators based on ridge regression, lasso and elastic net, principal component regression as well as model-assisted estimators based on CART, random forests, XGBoost and CUBIST. We considered three sampling designs: simple random sampling without replacement, stratified simple random sampling without replacement and stratified fixed-size without replacement proportional to size sampling. We make some final remarks in Section 5. The technical details, including the proofs of some results, are relegated to the Supplementary Material.
2. The setup
Consider a finite population of size N. We are interested in estimating , the population total of the survey variable Y. We select a sample S from U according to a sampling design with first-order and second-order inclusion probabilities and , respectively. In the absence of nonsampling errors, the Horvitz–Thompson estimator
| (1) |
is design-unbiased for provided that for all ; that is, where denotes the expectation operator with respect to the sampling design . In the sequel, unless stated otherwise, the properties of estimators are evaluated with respect to the design-based approach. Under mild conditions [4,34], it can be shown that the Horvitz–Thompson estimator is design consistent for .
At the estimation stage, we assume that a collection of auxiliary variables, , is recorded for all . Moreover, we assume that the corresponding population totals are available from an external source (e.g. a census or an administrative file). Let be the -vector associated with unit i. Also, we denote by the design matrix and its sample counterpart.
Model-assisted estimation starts with postulating the following working model:
| (2) |
where is an unknown function and the errors are independent random variables such that and where is an unknown parameter. Although we assume a homoscedastic variance structure, our results can be easily extended to the case of unequal variances of the form for some known function .
The unknown function is estimated by from the sample data . The fitted model is then used to construct the model-assisted estimator
| (3) |
where denotes the prediction at under the working model (2). Whenever the predictor is sample dependent, the estimator is design biased, but can be shown to be asymptotically design unbiased and design consistent for a wide class of working models, as the population size N and the sample size n increase.
3. Least squares and penalized model-assisted estimators
3.1. The GREG estimator
Suppose that the regression function is approximated by a linear combination of . The working model (2) reduces to
| (4) |
where is a vector of unknown coefficients. Under a hypothetical census, where we observe and for all the vector would be estimated by through the ordinary least square criterion at the population level:
| (5) |
where Provided that the matrix is of full rank, the solution to (5) is unique and given by
| (6) |
In practice, the vector in (6) cannot be computed as the y-values are recorded for the sample units only. An estimator of denoted by is obtained from (6) by estimating each total separately using the corresponding Horvitz–Thompson estimator. Alternatively, the estimator can be obtained using the following weighted least square criterion at the sample level:
| (7) |
where and . Again, the solution to (7) is unique provided that is of full rank and it is given by
| (8) |
The prediction of at under the working model (4) is . Plugging in (3) leads to the well-known GREG estimator [36]:
| (9) |
If the intercept is included in the working model, the GREG estimator reduces to the population total of the fitted values that is, Also, the GREG estimator can be written as a weighted sum of the sample y-values:
| (10) |
where
These weights can be also obtained as the solution of a calibration problem [14]. More specifically, the weights are such that the generalized chi-square distance is minimized subject to the calibration constraints This attractive feature may not be shared by model-assisted estimators derived under more general working models.
3.2. Penalized least square estimators
While model-assisted estimators based on linear regression working models are easy to implement, they tend to breakdown when the number of auxiliary variables p is growing large. Also, when some of the predictors are highly related to each other, a problem known as multicollinearity, the ordinary least square estimator given by (6) may be highly unstable. As noted by Hoerl and Kennard [22], ‘the worse the conditioning of the more can be expected to be too long and the distance from to will tend to be large’. In survey sampling, the effect of multicollinearity on the stability of point estimators has first been studied by Bardsley and Chambers [1] under the model-based approach. Chambers [9] and Rao and Singh [33] studied this problem in the context of calibration. These authors noted that the use of a large number of calibration constraints may lead to highly dispersed calibration weights, potentially resulting in unstable estimators.
In a classical iid linear regression setting, penalization procedures such as ridge, lasso or elastic-net can be used to help circumvent some of the difficulties associated with the usual least squares estimator . Let be an estimator of obtained through the penalized least square criterion at the population level:
| (11) |
where , and are positive real numbers, is a given norm and t is a fixed positive integer representing the number of different norm constraints. The values of and t are typically predetermined. The tuning parameter controls the strength of the penalty that one wants to impose on the norm of . Most often, the value of is selected through a cross-validation procedure. The coefficients and are specific to the penalization method. Hence, they affect the properties of the resulting estimator Three special cases are considered below.
When and , the estimator is known as the ridge regression estimator [21]:
where is the usual Euclidean norm of The solution is given explicitly by
| (12) |
where denotes the identity matrix.
When and the estimator is known as the lasso estimator [39]:
| (13) |
where is the -norm of As for the ridge, the lasso has the effect of shrinking the coefficients but, unlike the ridge, it can set some coefficients to zero. Except when the auxiliary variables are orthogonal, there is no closed-form formula for the lasso estimator [20]. In survey sampling, McConville et al. [28] investigated the design-based properties of the lasso model-assisted estimator for fixed p.
The elastic-net estimator, that was suggested by Zou and Hastie [43], combines two norms: the Euclidean norm and the norm, . If, in (11), we set , , and , the resulting estimator is the elastic-net estimator, which can be viewed as a trade-off between the ridge estimator and the lasso estimator, realizing variable selection and regularization simultaneously:
for and a parameter that is usually chosen using a grid of multiple values of α. The penalized regression estimator in (11) is unknown as the y-values are not observed for the non-sample units. To overcome this issue, we use the following weighted penalized least square criterion at the sample level:
| (14) |
A model-assisted estimator based on a penalized regression procedure is obtained from (3) by replacing with leading to
| (15) |
where is a generic notation used to denote the estimated regression coefficient obtained through either lasso, ridge or elastic net. Unlike the GREG estimator, the penalized model-assisted estimator is sensitive to unit change of the X-variables because is sensitive to unit change. This is why, as in the classical regression setting, standardization of the X-variables is recommended before computing If the intercept is included in the model, then it is usually left un-penalized.
Remark 3.1
In the case of ridge regression, the estimator is given by
(16) Using (16) in (15) leads to the ridge model-assisted estimator that can be expressed as a weighted sum of sampled y-values, with weights given by
These weights can also be obtained through a penalized calibration problem. It can be shown that they minimize the penalized generalized chi-square distance, [2,9]. If some X-variables are left un-penalized in (11), the resulting weights ensure consistency between the survey estimates and their corresponding population totals associated with these variables.
We end this section by noting that the penalized model-assisted estimator is sensitive to the choice of the penalty parameter In the case of ridge regression, Bardsley and Chambers [1] suggested the ridge trace method for selecting the penalty parameter λ. This method consists of plotting the weights for values of λ from a pre-determined grid values and to choose the value of λ for which the weights are positive for all and is the smallest difference among all the differences considered for the grid values of λ. Using the fact that the modified penalty lies between 0 and 1 and is an increasing function of λ, Beaumont and Bocci [2] proposed a method based on the bisection algorithm to first determine and then, λ. Guggemos and Tillé [19] implemented a Fisher scoring algorithm in order to find the value of λ which maximizes a design-based estimated log-likelihood criterion. In case of the lasso model-assisted estimator, McConville et al. [28] used a cross-validation procedure to choose the best value of λ. More research is needed to suggest a unified criterion in order to find the best penalty in a sample-based framework. This is beyond the scope of the article. Most of the computer software use a cross-validation criterion to choose the best penalty parameter.
3.3. Consistency of the GREG and penalized GREG estimators in a high-dimensional setting
We adopt the asymptotic framework of [24] and consider an increasing sequence of embedded finite populations of size . In each finite population , a sample, of size is selected according to a sampling design with first-order inclusion probabilities and second-order inclusion probabilities . While the finite populations are considered to be embedded, we do not require this property to hold for the samples . This asymptotic framework assumes that v goes to infinity, so that both the finite population sizes , the samples sizes and the number of auxiliary variables go to infinity. To improve readability, we shall use the subscript v only in the quantities , and ; for instance, quantities such as shall be simply denoted by .
The following assumptions are required to establish the consistency of the GREG and penalized GREG estimators in a high-dimensional setting.
(H1) We assume that there exists a positive constant such that .
(H2) We assume that .
(H3) There exist a positive constant c such that also, we assume that .
(H4) We assume that there exists a positive constant such that, for all , , where denotes the usual Euclidean norm.
(H5) We assume that where is the least square estimator given in (8) and denotes the norm.
The assumptions (H1), (H2) and (H3) were used by Breidt and Opsomer [4] in a nonparametric setting and similar assumptions were used by Robinson and Särndal [34] to establish the consistency of the GREG estimator in a fixed dimensional setting. These assumptions hold for many usual sampling designs such as simple random sampling without replacement, stratified designs [4], or high-entropy sampling designs. Assumptions (H4) and (H5) can be viewed, respectively, as extensions of Assumption A.1 and Assumption A.3 in [34] to -dimensional vectors with growing to infinity. Assumption (H5) is not very restrictive in this high-dimensional setting as it requires that components of are all bounded. When is fixed, then our assumptions essentially reduce to those of [34].
Result 3.1
Assume (H1)–(H5). Consider a sequence of GREG estimators of . Then,
If the numbers of auxiliary variables and the sample sizes satisfy , then
The -consistency obtained by Robinson and Särndal [34] is a special case of Result 3.1 with . Result 3.1 highlights the fact that the rate of convergence decreases as the number of auxiliary variables increases. Yet, this result guarantees the existence of a consistent GREG estimator, even when the number of auxiliary variables is allowed to diverge. An improved consistency rate for may be obtained if, in (H5), the usual Euclidean norm is used instead of -norm. Establishing the rate of convergence of the sampling error may also be utilized to obtain a lower consistency rate for ; e.g. see [10].
The next result establishes the design-consistency of model-assisted penalized regression estimators. The proof is similar to that of Result 3.1 and is given in the Supplementary Material.
Result 3.2
Assume (H1)–(H4). Consider a sequence of penalized model-assisted estimators of obtained by either ridge, lasso or elastic-net. Then,
If the numbers of auxiliary variables and the sample sizes satisfy , then
The above result makes no use of the asymptotic convergence rate of which depends on the penalization method. For example, if one can establish that , then Alternatively, improved consistency rates of may be obtained if one can establish the magnitude of the sampling error in a high-dimension setting. In other words, obtaining these improved rates requires additional assumptions, unlike Result 3.2 which is obtained under relatively mild assumptions.
Next, we show that, under additional assumptions on the auxiliary variables, the model-assisted ridge estimator is design-consistent for if goes to zero and that it has the usual -consistency rate if with which constitutes a significant improvement over Result 3.2.
Result 3.3
Assume (H1)–(H4). Also, assume that there exists a positive constant such that where is the largest eigenvalue of Assume also that
Then, there exists a positive constant C such that andIf the numbers of auxiliary variables and the sample sizes satisfy , then
Thus, if then
We have the following asymptotic equivalence:where
and
If with then
and
It follows from Result 3.3 that, for with the asymptotic variance of the model-assisted ridge estimator is equal to the variance of the generalized difference estimator For we note that the model-assisted estimator is still -design consistent but the remainder term is no longer negligible with respect to and the variability of this term should be consider to compute the asymptotic variance of The case of model-assisted estimators based on lasso and elastic-net is more intricate. This is due to the fact that both estimators involve the -norm. As a result, a closed-form expression of these estimators cannot be obtained. However, if the predictors are orthogonal, a closed-form expression exists for the lasso and elastic-net estimators and improved consistency rates can be obtained, see Proposition 3.1 below. The case of non-orthogonal predictors is more challenging and is beyond the scope of this article.
Proposition 3.1
Suppose assumptions (H1)–(H3) and that the sampling design and the X-variables are such that the columns of are orthogonal. Suppose also that there exist positive quantities and such that and Then, and where denotes either the lasso or the elastic-net estimator.
4. Simulation study
In this section, we provide an empirical comparison of several model-assisted estimators, in addition to the estimators discussed in Section 3. In addition, we considered model-assisted estimators based on principal component regression [8], regression trees [5], random forests [6], k-nearest neighbours, XGBoost [11] and Cubist [32]. For a description of these methods, see [13,20] and the references therein.
We used data from the Irish Commission for Energy Regulation (CER) Smart Metering Project that was conducted in 2009–2010 (CER, 2011)1 [8]. This project focused on energy consumption and energy regulation. About 6000 smart meters were installed to collect the electricity consumption of Irish residential and business customers every half an hour over a period of about 2 years.
We considered a period of 14 consecutive days and a population of N = 6291 smart meters (households and companies). Each day consisted of 48 measurements, leading to 672 measurements for each household. We denote by the electricity consumption (in kW) at instant and by the value of recorded by the ith smart meter for It should be noted that the matrix was ill-conditioned with a condition number equal to . This suggests that some of the X-variables were highly correlated with each other.
We generated four survey variables based on these auxiliary variables according to the following models:
where denotes the empirical variance and the errors in the model for were generated from an and these errors were centred so as to obtain a mean equal to zero.
Our goal was to estimate the population totals From the population, we selected R = 2500 samples, of size which corresponds to a sampling fraction n/N of about 10%. We considered three sampling schemes: simple random sampling without replacement, stratified simple random sampling without replacement with optimal allocation and stratified without replacement proportional to size sampling with proportional allocation.
In each sample, we computed 12 model-assisted estimators of the form
where the predictors , were obtained using the following procedures:
-
Procedure 1:
‘LR’ : Deterministic linear regression, leading to the GREG estimator.
-
Procedure 2:
‘CART’: Classification and regression tree algorithm [5], leading to an estimator closely related to that of [29] and implemented with the R-package rpart.
-
Procedure 3:
‘RF’: Random forests with the algorithm of [6] with B = 1000 trees, a minimal number of elements in each terminal node and variables selected randomly at each split, where denotes the customary floor function. The algorithm leads to the estimator described in [12]. Simulations were implemented with the R-package ranger.
-
Procedure 4:
‘Ridge’: Ridge regression with a regularization parameter determined by cross-validation and implemented with the R-package glmnet. The estimator was studied by [18].
-
Procedure 5:
‘Lasso’: Lasso regression with a regularization parameter determined by cross-validation and implemented with the R-package glmnet [28].
-
Procedure 6:
‘EN’: Elastic net regression with penalization coefficients determined by cross-validation with the R-package glmnet.
-
Procedure 7:
‘XGB’: XGBoost algorithm [20] with 50 trees in the additive model, each tree being with a depth of at most 6 and a learning rate . Simulations were implemented with the R-package XGBoost.
-
Procedure 8:
‘5NN’: 5-nearest neighbours predictor with the Euclidean distance and implemented with the R-package caret.
-
Procedure 9:
‘Cubist’: A cubist algorithm [25] with 5 models in each predictor, implemented with the R-package cubist; the algorithm and its adaptation for survey data are described in [13].
-
Procedure 10:
‘PCR1’: Principal component regression based on the first components kept and implemented with the R-package pls [8].
-
Procedure 11:
‘PCR2’: Principal component regression based on the first components kept.
-
Procedure 12:
‘PCR3’: Principal component regression based on the first components kept.
As a measure of bias of the model-assisted estimators , , we computed the Monte Carlo percent relative bias defined as
where denotes the estimator at the rth iteration, . As a measure of efficiency, we computed the relative of efficiency, using the Horvitz–Thompson estimator given by (1), as the reference. That is,
where and is defined similarly.
We were also interested in investigating to which extent the model-assisted estimators were affected by the inclusion of a large number of predictors in the working models. To that end, in addition to the variables we included predictors in the working models. These predictors were available in the Irish data set. We used the following values for : 5, 10, 20, 50, 100, 200, 300 and 400.
4.1. Simple random sampling without replacement
In this section, we present the results obtained under simple random sampling without replacement (SRSWOR) of size n = 600. All the point estimators exhibited a negligible or small percent RB with a maximum value of about (obtained in the case of the GREG estimator). For this reason, results pertaining to relative bias are not reported here.
Figures 1–4 display the relative efficiency of the model-assisted estimators as a function of the number of auxiliary variables incorporated in the working models. To improve readability, we have truncated some large values of RE, when applicable.
Figure 1.
Relative efficiency of model-assisted estimators for the estimation of the total of with SRSWOR (n = 600) and increasing number of auxiliary variables.
Figure 4.
Relative efficiency of model-assisted estimators for the estimation of the total of with SRSWOR, n = 600 and increasing number of auxiliary variables.
We begin by discussing the results on relative efficiency pertaining to the estimation of the total of the survey variable . For low-dimensional settings, the GREG estimator was very efficient with values of RE below . These results can be explained by the fact that was linearly related to the -variables. However, as the number of variables increased, the efficiency of the GREG estimator rapidly deteriorated, suggesting that the performance of the GREG estimator is sensitive to the dimension of the -vector. As expected, model-assisted estimators based on regularization methods such as ridge, lasso, elastic-net or dimension reduction methods such as principal components regression, performed generally very well. Unlike the GREG, these estimators were not much affected by the number of auxiliary variables incorporated in the model. Turning to the model-assisted estimator based on a 5-nn, we note that it was less efficient than most competitors and that its efficiency got worse as increased, a phenomenon referred to as the curse of dimensionality. The model estimators based on XGBoost, Cubist and random forests performed quite well and did not seem to be affected by the number of auxiliary variables incorporated in the model. Finally, the estimators based on CART were less efficient than those obtained through the other machine learning methods.
The results pertaining to the survey variable and displayed in Figure 2 were fairly consistent with those obtained for the survey variable with one exception: the Cubist algorithm was significantly more efficient than the other procedures in all the scenarios.
Figure 2.
Relative efficiency of model-assisted estimators for the estimation of the total of with SRSWOR, n = 600 and increasing number of auxiliary variables.
Turning to the survey variable (see Figure 3), the model-assisted estimator based on random forests was significantly more efficient than the Horvitz–Thompson estimator, especially for large values of . The other procedures led to estimators less efficient than the Horvitz–Thompson estimator with values of RE above 100. In particular, the GREG estimator broke down as the number of auxiliary variable increased. The performance of model-assisted estimators based on CART and XGBoost algorithms deteriorated as the dimension increased. In a high-dimension setting with highly correlated predictors, random forests improved over CART due to the random subsampling of variables among the p variables, generating then decorrelated trees [20].
Figure 3.
Relative efficiency of model-assisted estimators for the estimation of the total of with SRSWOR, n = 600 and increasing number of auxiliary variables.
The results in Figure 4 about the survey variable were similar to the ones in previous figures. Most estimators remained mostly unaffected by the number of auxiliary variables . Again, the model-assisted estimator based on the Cubist algorithm was the best in all the scenarios.
4.2. Stratified simple random sampling with optimal allocation
In the second simulation study, we partitioned the Irish residential and business customer population into four strata using an equal quantile method with respect to the variable, the electricity consumption at instant From the population, we selected R = 2500 stratified simple random samples, of size The stratum sample sizes were determined using an -optimal allocation, where denotes the electricity consumption recorded at instant This led to and The first-order inclusion probabilities, and the sampling weights are shown in Table 1.
Table 1.
First-order inclusion probabilities and sampling weights within strata.
| Stratum | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| 0.012 | 0.022 | 0.028 | 0.316 | |
| 77.85 | 43.83 | 35.11 | 3.16 |
We confined to the survey variables and only and we aimed at estimating and . It is worth pointing out that the resulting sampling design was informative as the variables used at the design stage ( and ) were also related to the survey variables and . In fact, the Monte Carlo coefficient of correlation between the sampling weights and was approximately equal to We do not report the coefficient of correlation between the sampling weights and as the relationship between and the set of predictors is not linear.
Again, in each sample we computed twelve model-assisted estimators for each of and Since most machine learning software packages do not take the sampling weights into account, we have included the design variables and in the set of predictors.
We begin by discussing the results pertaining to the estimation of the total of the survey variable Figures 5 and 6 display the Monte Carlo percent relative bias and the Monte Carlo relative efficiency as a function of the number of variables . Except for the model-assisted estimators based on 5-nn and random forest, the other estimators exhibit a small value of RB for all values of . Again, the 5-nn model-assisted estimator suffered from the curse of dimensionality. Turning to the estimator based on random forests, we note from Figure 5 that the bias increased as the number of predictors increased. For instance, for the value of RB was just above 10%. This significant bias may be explained by the fact that random forests is the only procedure among the ones considered in our simulation that randomly selects variables among the initial p predictors at each split. For instance, for , only 20 variables are randomly selected at each split. As a result, most predictions obtained through a random forests algorithm were based on misspecified working models, leading to potentially bad fits and large residuals. Also, each prediction corresponds to a weighted mean computed within each node with observations only. Therefore, each predictions corresponds to a ratio-type estimate based on five observations only. This, together with the fact that the sampling weights are highly variable, constitutes a conducive ground for the occurrence of small sample bias. In terms of efficiency, except for the GREG, the 5-nn and the random forest estimators, the other procedures performed well with values of RE ranging from 60% to 80%. The best procedures were Cubist and Lasso.
Figure 5.
Relative bias of model-assisted estimators for the estimation of the total of with stratified simple random sampling with -optimal allocation, n = 600 with increasing number of auxiliary variables.
Figure 6.
Relative efficiency of model-assisted estimators for the estimation of the total of with stratified simple random sampling with -optimal allocation, n = 600 and increasing number of auxiliary variables.
We now turn to the survey variable First, the Monte Carlo relative bias was negligible for all the estimation procedures and are not reported here. Results about relative efficiency are plotted in Figure 7. Random forests performed extremely well and their performance improved as increased. This suggests that the method was able to extract the information contained in the predictors. This was also true for Cubist and XGBoost, although to a lesser extent.
Figure 7.
Relative efficiency of model-assisted estimators for the estimation of the total of with stratified simple random sampling with -optimal allocation, n = 600 and increasing number of auxiliary variables.
To get a better understanding of the performance of random forests for the estimation of the total of the survey variable we conducted additional scenarios based on different values of the hyper parameters the number of observations within each terminal nodes, and the number of variables randomly selected at each split among the initial p model variables. We used the following values for and :
observations and variables which are the default choices in the R-package ranger;
observations and variables;
observations and variables, with, in addition, the design variables , , as well as the vector of inclusion probabilities and the vector of strata that were selected with probability 1, at each split, besides the variables;
observations and variables.
The Monte Carlo percent relative bias is displayed in Figure 8. We note that relative bias was much smaller (always less than ) when the design variables were considered besides variables at each split. To a lesser extent, the bias decreased when more observations were allowed in each terminal node. These results suggest, that, when the sampling design is informative, to avoid significant small sample bias, we recommend to force the design variables to be selected at each split. This option is available in the R package ranger.
Figure 8.
Comparison of different configurations of hyper-parameters for for the estimation of the total of with stratified simple random sampling and -optimal allocation, n = 600.
4.3. Stratified inclusion probability proportional-to-size sampling without replacement
We consider the stratified population described in Section 4.2. In each stratum, we selected units according to a fixed-size inclusion probability proportional-to-size sampling without replacement using the electricity consumption at instant as the size variable. In each stratum, we used the sample size were determined according to proportional allocation; i.e. . The first-order inclusion probabilities were then given by
As in Section 4.2, we focused on estimating and and we computed the same twelve model-assisted estimators . The inclusion probabilities were highly correlated with the survey variable , with a correlation coefficient of about 0.62; we do not report the coefficient of correlation in the case of as the underlying relationship was nonlinear. Based on findings from Section 4.2, we adopted the following configuration for the random forest algorithm: we considered observations in each terminal node and, at each split, we randomly selected variables. Note that the design variables and as well as the vector of inclusion probabilities and the vector of stratum indicators were selected with probability 1 at each split in addition to the variables.
All the estimators exhibited a negligible relative bias (less than ). Figures 9 and 10 show the relative efficiency corresponding to respectively.
Figure 9.
Relative efficiency of model-assisted estimators for the estimation of the total of with stratified without replacement -proportional to size sampling, n = 600 and increasing number of auxiliary variables.
Figure 10.
Relative efficiency of model-assisted estimators for the estimation of the total of with stratified without replacement -proportional to size sampling, n = 600 and increasing number of auxiliary variables.
From Figure 9, we note that most estimators exhibited a behaviour similar to that obtained in the case the stratified simple random sampling based on an -optimal allocation (see Section 4.2). However, we note that the estimators PCR1 and PCR2 did poorly unlike in the case stratified simple random sampling based on an -optimal allocation. This poor behaviour may be due to the fact that the sampling design was now much more informative and keeping a few principal components only may have led to a loss of information. The estimator PCR3 based on more principal components did better than PCR1 and PCR2. From Figure 10, we note that the use of model-assisted estimators led to significant improvement over the Horvitz–Thompson estimator, with value of relative efficiency ranging from to .
4.4. Stratified simple random sampling with proportional allocation
In this section, we consider a more realistic scenario based again on the Irish residential and business customer data. As a stratification variable, we used the mean electricity consumption recorded during the first week. Again, we constructed four strata using an equal-quantile method based, this time, on the mean electricity consumption; see also [7] who used a similar design. The mean trajectories during the first week within each stratum are plotted in Figure 11. From Figure 11, we note that Stratum 1 corresponds to consumers with low global levels of electricity consumption, whereas Stratum 4 consists of consumers who have high levels of electricity consumption.
Figure 11.
Average electricity consumption on each stratum during first week.
Our aim was to estimate the total electricity consumption recorded on the Monday of the second week and given by where is the electricity consumption recorded for the i th unit at the jth instant. Within each stratum, we selected a sample, of size according to simple random sampling without replacement. The 's were determined according to proportional allocation; i.e. with n = 600. In each of the 2500 samples, we computed the same 12 model-assisted estimators as in the previous sections. Again, we computed the Monte Carlo percent relative bias and the relative efficiency for each the 12 estimators. The results are presented in Table 2.
Table 2.
Monte Carlo percent relative bias and relative efficiency of several model-assisted estimators under stratified simple random sampling with proportional allocation.
| Estimator | Relative bias | Relative efficiency |
|---|---|---|
| LR | 0.2 | 9.3 |
| CART | −0.1 | 41.0 |
| RF | −1.1 | 17.0 |
| Ridge | 0.1 | 4.0 |
| Lasso | 0.2 | 4.1 |
| EN | 0.2 | 4.1 |
| XGB | −1.7 | 24.9 |
| NN5 | −4.0 | 65.6 |
| Cubist | −0.0 | 4.3 |
| PCR1 | 0.1 | 4.9 |
| PCR2 | 0.1 | 4.2 |
| PCR3 | 0.1 | 4.2 |
From Table 2, we note that the 5-nn model-assisted estimator was the only estimator to exhibit a non-negligible bias. Although it was less efficient than its competitors, it was more efficient than the Horvitz–Thompson estimator with a value of RE of about . The ridge estimator was the most efficient with a value of RE equal to and was closely followed by lasso, elastic-net, Cubist and principal components model-assisted estimators. The GREG estimator performed very well with a value of RE of about 9.3%. Random forests led to considerable improvement over the CART model-assisted estimator with values of RE of 17% and 41%, respectively. Still, random forests were less efficient than the GREG estimator, which is not surprising as the relationship between the survey variable and the auxiliary variables was linear.
5. Final remarks
In this paper, we have examined a number of model-assisted estimation procedures in a high-dimensional setting both theoretically and empirically. If the relationship between the survey variable and the auxiliary information can be well described by a linear model, our results suggest that penalized estimators such as ridge, lasso and elastic net perform very well in terms of bias and efficiency, even in the case p = n. Model-assisted estimators based on random forests, Cubist and XGBoost methods were mostly unaffected by the number of predictors incorporated in the working model, even in the case of complex relationships between the study and the auxiliary variables. As expected, the GREG estimator suffered from poor performances in the case of a large number of auxiliary variables.
The procedure Cubist stood out from the other machine learning procedure with very good performances in virtually all the scenarios. Further work is needed to establish the theoretical properties of model-assisted estimators based on Cubist in both low-dimensional and high-dimensional settings.
Variance estimation is an important stage of the estimation process. Further research includes identifying the regularity conditions under which the variance estimators are design consistent in a high-dimensional setting.
We end this article by mentioning that virtually all the machine learning software packages cannot handle design features such as unequal weights and stratification. For instance, some random forests algorithms may involve a bootstrapping procedure and/or a cross-validation procedure. To fully account for the sampling design, both procedures must be modified so as to account for the design features. One notable exception is the R package RPMS [40] that has the ability to incorporate sampling weights for CART and random forests. Not fully accounting for the sampling design may be viewed as a form of model misspecification. However, model-assisted estimation procedures remain design consistent even if the model is misspecified. In our experiments, several machine learning procedures (e.g. random forests, Cubist, XGboost) performed very well in most scenarios even though we did not modify the bootstrapping and cross-validation procedures to account for design features. In other words, it seems that, accounting for predictors that are highly predictive of the Y-variable, seems to be the preponderant factor with respect to the efficiency aspect of model-assisted estimators. We conjecture that fully accounting for the sampling design will likely lead to additional efficiency gains but that the predictive power of the model likely constitutes the ‘determining factor’. Developing machine learning procedures that fully account for the sampling design is currently under investigation.
Supplementary Material
Acknowledgments
We thank the Editor, an Associate Editor, and two referees for their comments and suggestions, which helped improving the paper substantially. We are very grateful to Professor Patrick Tardivel from the Université de Bourgogne for enlightening discussions about the lasso method.
Funding Statement
The work of Mehdi Dagdoug was supported by grants of the Franche-Comté region and Médiamétrie. The work of David Haziza was supported by a grant of the Natural Sciences and Engineering Research Council of Canada.
Note
The data are available on request at: https://www.ucd.ie/issda/data/commissionforenergyregulationcer/.
Disclosure statement
No potential conflict of interest was reported by the author(s).
References
- 1.Bardsley P. and Chambers R., Multipurpose estimation from unbalanced samples, Appl. Stat. 33 (1984), pp. 290–299. [Google Scholar]
- 2.Beaumont J.-F. and Bocci C., Another look at ridge calibration, Metron-Int. J. Statist. 66 (2008), pp. 260–262. [Google Scholar]
- 3.Breidt F., Claeskens G., and Opsomer J., Model-assisted estimation for complex surveys using penalized splines, Biometrika 92 (2005), pp. 831–846. [Google Scholar]
- 4.Breidt F.-J. and Opsomer J.-D., Local polynomial regression estimators in survey sampling, Ann. Statist. 28 (2000), pp. 1023–1053. [Google Scholar]
- 5.Breiman L., Friedman J.H., Olshen R.A., and Stone C.J., Classification and Regression Trees, Routledge, New York, 1984. [Google Scholar]
- 6.Breiman L., Random forests, Mach. Learn. 45 (2001), pp. 5–32. [Google Scholar]
- 7.Cardot H., Dessertaine A., Goga C., Josserand E., and Lardin P., Comparison of different sample designs and construction of confidence bands to estimate the mean of functional data : an illustration on electricity consumption, Surv. Methodol. 39 (2013), pp. 283–301. [Google Scholar]
- 8.Cardot H., Goga C., and Shehzad M.-A., Calibration and partial calibration on principal components when the number of auxiliary variables is large, Statist. Sin. 27 (2017), pp. 243–260. [Google Scholar]
- 9.Chambers R., Robust case-weighting for multipurpose establishment surveys, J. Off. Stat. 12 (1996), pp. 3–32. [Google Scholar]
- 10.Chauvet G. and Goga C., Asymptotic efficiency of the calibration estimator in a high-dimensional data setting, J. Statist. Plann. Inference 217 (2021), pp. 177–187. [Google Scholar]
- 11.Chen T. and Guestrin C., XGBoost, in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining – KDD 16, 2016, ACM Press, New York.
- 12.Dagdoug M., Goga C., and Haziza D., Model-assisted estimation through random forests in finite population sampling, preprint (2022), to appear in J. Am. Stat. Assoc.
- 13.Dagdoug M., Goga C., and Haziza D., Imputation procedures in surveys using nonparametric and machine learning methods: an empirical comparison, to appear in J. Survey Statist. Methodol. (2021). [Google Scholar]
- 14.Deville J.-C. and Särndal C.-E., Calibration estimators in survey sampling, J. Am. Stat. Assoc. 87 (1992), pp. 376–382. [Google Scholar]
- 15.Firth D. and Bennett K., Robust models in probability sampling, J. R. Statist. Soc. Ser. B 60 (1998), pp. 3–21. [Google Scholar]
- 16.Goga C., Réduction de la variance dans les sondages en présence d'information auxiliaire: une approche non paramétrique par splines de régression, Canad. J. Statist. 33 (2005), pp. 163–180. [Google Scholar]
- 17.Goga C. and Ruiz-Gazen A., Efficient estimation of non-linear finite population parameters by using non-parametrics, J. R. Statist. Soc. Ser. B 76 (2014), pp. 113–140. [Google Scholar]
- 18.Goga C. and Shehzad M.A., Overview of ridge regression estimators in survey sampling, Université de Bourgogne: Dijon, France, 2010.
- 19.Guggemos F. and Tillé Y., Penalized calibration in survey sampling: design-based estimation assisted by mixed models, J. Statist. Plann. Inference 140 (2010), pp. 3199–3212. [Google Scholar]
- 20.Hastie T., Tibshirani R., and Friedman J., The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer; New York: 2011. [Google Scholar]
- 21.Hoerl A.E. and Kennard R.W., Ridge regression: biased estimation for nonorthogonal problems, Technometrics 12 (1970), pp. 55–67. [Google Scholar]
- 22.Hoerl A.E. and Kennard R.W., Ridge regression: biased estimation for nonorthogonal problems, J. Am. Stat. Assoc. 42 (2000), pp. 80–86. [Google Scholar]
- 23.Horvitz D. and Thompson D., A generalization of sampling without replacement from a finite universe, J. Am. Stat. Assoc. 47 (1952), pp. 663–685. [Google Scholar]
- 24.Isaki C.-T. and Fuller W.-A., Survey design under the regression superpopulation model, J. Am. Stat. Assoc. 77 (1982), pp. 49–61. [Google Scholar]
- 25.Kuhn M. and Johnson K., Applied Predictive Modelling, Springer, New York, 2013. [Google Scholar]
- 26.Lehtonen R. and Veijanen A., Logistic generalized regression estimators, Surv. Methodol. 24 (1998), pp. 51–56. [Google Scholar]
- 27.McConville K. and Breidt F.J., Survey design asymptotics for the model-assisted penalised spline regression estimator, J. Nonparametric Regress. 25 (2013), pp. 745–763. [Google Scholar]
- 28.McConville K.S., Breidt F.J., Lee T.C., and Moisen G.G., Model-assisted survey regression estimation with the lasso, J. Surv. Stat. Methodol. 5 (2017), pp. 131–158. [Google Scholar]
- 29.McConville K. and Toth D., Automated selection of post-strata using a model-assisted regression tree estimator, Scand. J. Statist. 46 (2019), pp. 389–413. [Google Scholar]
- 30.Montanari G.E. and Ranalli M.G., Nonparametric model calibration in survey sampling, J. Am. Stat. Assoc. 100 (2005), pp. 1429–1442. [Google Scholar]
- 31.Opsomer J.D., Breidt F.J., Moisen G., and Kauermann G., Model-assisted estimation of forest resources with generalized additive models, J. Am. Stat. Assoc. 102 (2007), pp. 400–409. [Google Scholar]
- 32.Quinlan J., Learning with continuous classes, in 5th Australian Joint Conference on Artificial Intelligence, Vol. 92, World Scientific, Singapore, 1992, pp. 343–348.
- 33.Rao J. and Singh A.C., A ridge-shrinkage method for range-restricted weight calibration in survey sampling, in Proceedings of the Section on Survey Research Methods, American Statistical Association, Alexandria, 1997.
- 34.Robinson P.M. and Särndal C. -E., Asymptotic properties of the generalized regression estimator in probability sampling, Sankhyā Series B 45 (1983), pp. 240–248. [Google Scholar]
- 35.Särndal C.-E., On the π-inverse weighting best linear unbiased weighting in probability sampling, Biometrika 67 (1980), pp. 639–650. [Google Scholar]
- 36.Särndal C.-E., Swensson B., and Wretman J, Model Assisted Survey Sampling, Springer Series in Statistics, Springer-Verlag, New York, 1992.
- 37.Särndal C.-E. and Wright R., Cosmetic form of estimators in survey sampling, Scand. J. Statist. 11 (1984), pp. 146–156. [Google Scholar]
- 38.Ta T., Shao J., Li Q., and Wang L., Generalized regression estimators with high-dimensional covariates, Stat. Sin. 30 (2020), pp. 1135–1154. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Tibshirani R., Regression shrinkage and selection via the lasso, J. R. Statist. Soc. Ser. B 58 (1996), pp. 267–288. [Google Scholar]
- 40.Toth D., rpms: Recursive Partitioning for Modeling Survey Data, 2021. R package version 0.5.1.
- 41.Toth D. and Eltinge J.L., Building consistent regression trees from complex sample data, J. Am. Stat. Assoc. 106 (2011), pp. 1626–1636. [Google Scholar]
- 42.Wang L. and Wang S., Nonparametric additive model assisted estimation for survey data, J. Multivar. Anal. 102 (2011), pp. 1126–1140. [Google Scholar]
- 43.Zou H. and Hastie T., Regularization and variable selection via the elastic net, J. R. Statist. Soc. Ser. B 67 (2005), pp. 301–320. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.











