Abstract
Modeling and inference for heterogeneous data have gained great interest recently due to rapid developments in personalized marketing. Most existing regression approaches are based on the conditional mean and may require additional cluster information to accommodate data heterogeneity. In this paper, we propose a novel nonparametric resolution-wise regression procedure to provide an estimated distribution of the response instead of one single value. We achieve this by decomposing the information of the response and the predictors into resolutions and patterns respectively based on marginal binary expansions. The relationships between resolutions and patterns are modeled by penalized logistic regressions. Combining the resolution-wise prediction, we deliver a histogram of the conditional response to approximate the distribution. Moreover, we show a sure independence screening property and the consistency of the proposed method for growing dimensions. Simulations and a real estate valuation dataset further illustrate the effectiveness of the proposed method.
Keywords: Binary Expansion, Data heterogeneity, Nonparametric Statistics, SSANOVA, Sure independence screening
1. Introduction
A common nonparametric regression model establishes the effects of the explanatory variables on the response variable in the form of
| (1) |
where Y is the response variable, X = (X1, …, Xq)T is the q-dimensional explanatory variable vector, and ε is the random error, which is often assumed with mean 0 and variance σ2, and is independent of X.
In recent years there has been a growing demand for exploring regression methods for heterogeneous populations, which has broad applications in personalized marketing and other fields. One characteristic of data heterogeneity is the existence of subpopulations in the data. In practice, the heterogeneity can be regarded as the result of some latent variables. This happens frequently since it is difficult to collect all the explanatory variables for the response. For example, in the real estate data in Section 6, a river and a highway through the city create subpopulations and heterogeneous distributions of housing prices. However, the information of this river and this highway is not available in the data.
Denote the unobserved categorical variable by Z taking values 1, … T, where T is unknown. Suppose the potential true relationship of the response and all the explanatory variables can be expressed by
| (2) |
with unknown functions ft’s, t = 1, …, T. In this paper, our goal is to relate Y with X without knowing Z. However, this differs from fitting model (1), since the true relationship between Y and X may not even be a function. As an illustration, the housing prices on the two sides of Tamsui river with respect to longitude and latitude are shown in Figure 1. The plot shows a mixture of two subgroups: the housing prices on the west and the east of the river behave differently. The latent variable Z, that is, the indicator for which side the position on, determines the two subgroups. Without knowing Z, the relationship between Y and X cannot be captured by a single function. Hence new methods to model the effects of X on Y with such a challenging heterogeneous population are in great needs.
Fig. 1.

Left panel: An illustration of heterogeneous data in the housing prices on two sides of Tamsui River (light blue curve). Right panel: The housing prices follow different distributions: The housing prices on the west monotonically increase with latitude, while those on the east are concave and parabolic.
One possible idea is to use smoothing splines (Green and Silverman, 1994) which can capture local behaviors. A more general setting is the smoothing spline analysis of variance (SSANOVA) (Wahba, 1990; Gu, 2002), which fits an additive model for main effects and interactions. These approaches use regression to estimate the overall conditional mean function, which tends to fit a compromised effect of the subgroups. Hence they may fail to identify the subgroups of the population, and the results might not be informative for either of the subgroups. Moreover, the estimated distribution might not really reflect the pattern of the true one.
Another possible strategy is to cluster the data first, then fit a regression model within each subgroup. Existing model-based clustering approaches include Jacobs et al. (1991), Pan and Shen (2006), Raftery and Dean (2006), and Guo et al. (2010). As alternative approaches, Lindsten et al. (2011), Hocking et al. (2011), and Pan et al. (2013) formulated clustering as a penalized regression problem with fusion-type penalties. However, these methods focus on finding the groups based on the similarity of the explanatory variables, instead of identifying groups with different effects on the response.
In the literature, some individualized methods were proposed to handle heterogeneity. Ma and Huang (2017) employed subject-specific intercepts to model the unobserved factors which leads to the heterogeneity. They used a concave pairwise fusion penalty to shrink some intercepts to be the same, which can produce a partition of subgroups. Chen, Tran-Dinh, Kosorok and Liu (2021) considered a more general fusing method than that of Ma and Huang (2017) to identify subgroups. Tang and Qu (2017) proposed a multi-directional penalty to shrink individuals to different groups. The performance of these methods depends on how well the subgroups are separated. If the subgroups are close to each other, the performance can be less accurate.
In this paper, we tackle the heterogeneity from a new perspective. Instead of estimating ft’s in (2) through nonparametric regressions, we propose to estimate the conditional distribution of the response variable given observed explanatory variables. The estimated distribution provides an overall picture of the response variable, and can indicate the heterogeneity by the modes of the probability density function (PDF). To achieve this goal, one single regression is not enough, because the pattern of two or more possible values of the response corresponding to one observation of predictor variables cannot be expressed by an explicit function. Our idea is to consider binary expansion statistics proposed in Zhang (2019) and to decompose the response variable into several resolutions which can capture the local information. By establishing a set of logistic regressions, we relate the resolution information to the predictors. The set of regressions can model the heterogeneity since various estimations can be obtained from different local logistic regression models. To achieve the localization, we decompose the response variable by marginal binary expansions, which provides a balanced design and orthogonal resolutions. Our method eventually estimates the distribution of the response variable by a histogram, which shows the possible heterogeneity and even more complicated distributions, without any assumption of subgroup patterns. We show that the method has a sure independence screening property (Fan and Lv, 2008; Fan and Song, 2010) and provides consistent estimates for cell probabilities of the histogram for growing dimensions.
The rest of this paper is organized as follows. In Section 2, we introduce resolution-wise regression, including the decomposition of the response variable and the establishment of the logistic regressions. Section 3 extends the proposed method to high-dimensional settings. In Section 4, we show the consistency of the estimated histogram. In Sections 5 and 6, we demonstrate the performance of our method by the simulated data and the real estate valuation dataset. Section 7 concludes this paper. Some technical proofs and additional simulation results are presented in the Appendix and Supplement.
2. Methodology
A distribution estimation provides more information than a point estimation for heterogeneous data, as the estimated distribution can identify the subgroups by the shape of the PDF. For the case that the subpopulations are not obviously distinguishable from each other, the estimated distribution can still reflect the dispersion of the data.
A direct idea is splitting the range of Y by a partition min(Y) = a0 ≤ a1 ⩽ ⋯ ⩽ aB = max(Y) and modeling the probability of Y falling into each interval with X. This idea handles the heterogeneity by decomposing the information of Y into several non-overlapping intervals. These intervals capture the local information of Y and work together to show the whole histogram. However, a drawback of this approach is the possible loss of information from negligence of joint information in two intervals and insufficient samples in each interval. In this paper, we propose to construct overlapped resolutions based on binary expansions, where each resolution groups the distribution information of the union of several intervals, and includes all corresponding samples. In essence, the proposed construction leads to a balanced design and has a non-redundant orthogonality property. A histogram can be obtained by a transformation from resolution probabilities to cell probabilities. Here we refer to a cell as a bin of the histogram. In this section, we consider the one-dimensional X, and then extend to the high-dimensional case in Section 3. The rest of this section is organized as follows. In Section 2.1, we introduce the construction of the resolutions. In Section 2.2, a set of resolution-wise penalized logistic regressions are established. In Section 2.3, we introduce the binary interaction design (BID) equation to accomplish the transformation from the frequency domain to the probability domain.
2.1. Frequency domain from binary expansions
To overcome the imbalance of non-overlapping intervals, we use a balanced design based on binary expansions. A classical result on the binary expansion of a uniform random variable (Kac 1959) is given as follows:
Lemma 1. For [U ~ Uniform[0, 1], we have , where V1, V2, …, Vk, … i.i.d. follow Bernoulli(1 / 2).
Denote the cumulative distribution function (CDF) transformation of Y by UY. By Lemma 1, we have
| (3) |
Through the expansion, the information of Y is decomposed into the information of Bk’s. In (3), binary variables Bk’s can be regarded as indicator functions:
Figure 2 shows the binary variables B1, B2, B3 respectively with respect to UY. As a finite approximation of the infinite binary expansion, we can truncate the binary expansion of UY up to the dY-th order
| (4) |
Fig. 2.

Binary variables B1, B2, B3 from binary expansions of UY. Regions with Bk = 1, k = 1, 2, 3 are in white and regions with Bk = 0, k = 1, 2, 3 are in blue.
Now we introduce the notations of resolutions. Using the binary variables taking values {− 1, 1} instead of {0, 1} by the transformation , the interactions of Bk’s can be written as products. For example, the event {B1 = 1, B2 = 1} ∪ {B1 = 0, B2 = 0} is equivalent to . In the remainder of this paper, we shall use .
To approximate the information given by Y, say the σ-field σ(Y), we can use the σ-field generated by ’s, denoted by . For the truncation up to the dY-th order, we can find a basis with variables
| (5) |
We shall refer to the binary variables in as resolutions of Y, and the set of all possible values of these resolutions as the frequency domain. Figure 3 shows the variables in with UY expanded up to the second order. Through this resolution decomposition, each variable takes value one on half of [0, 1], and value negative one on the other half.
Fig. 3.

Basis binary variables in , where UY is expanded up to the second order.
2.2. Logistic regression in the frequency domain
With the resolutions decomposed from the binary expansion, we aim to model the relationship between each resolution and the predictors. The resolutions constructed by the binary expansion are independent with each other, thus they can be modeled marginally. Since the resolutions are binary, it is essentially a classification problem. Note that for every resolution, Y is divided into two classes with groups of intervals according to the sign of the binary interaction. Hence the decision boundary can be nonlinear. Therefore, we propose to use binary expansions of predictors as a nonparametric basis to fit a logistic regression on each resolution. Similar to the construction of UY, we have the binary expansion of UX, up to the dX-th order to be . Denote . The σ-field generated by ’s, denoted by , has a basis with variables We shall refer to elements in as patterns of X. After the construction of the patterns, the complicated effect of X on Y can be captured by logistic regression, which enjoys efficiency from the orthogonality of the patterns. We establish a set of penalized logistic regressions with the ℓ1 penalty (Tibshirani, 1996) on each resolution in , with all patterns in as predictors. Denote the pattern vector corresponding to by . Similarly, denote the resolution vector corresponding to by . Let . be n independent observations of (X, Y). Denote and as the i-th pattern vector and the i-th resolution vector obtained by binary expansions of the empirical CDF transformation of xi and yi respectively. The m-th logistic regression, , which models the effect of on , is established as
| (6) |
where βm is the coefficient vector. We employ a ℓ1 regularization to give the estimator
| (7) |
The conditional expectation of given , denoted by , can be estimated by .
2.3. Binary interaction design: from frequency domain back to probability domain
As a final step of estimating the distribution of Y, we aim to transform the conditional expectations of resolutions into the conditional cell probabilities of the corresponding histogram. First, we simplify the notation of the conditional expectation by using a dY-dimensional binary index. Namely, denote the conditional expectation , p ∈ {1, …, dy}, {k1, …, kp} ⊂ {1, …, dy}, by Eb, where stands for the expectation, , is a vector of length dY with value one at k1, …, kp and zero otherwise. Let E be the dY-dimensional conditional expectation vector whose entries are sorted in an ascending order according to the binary system. Hence in some sense, b identifies the resolutions. Note that we set as the first entry. The -th entry is Eb. For example, the expectation with dY = 3 is
We denote the conditional probabilities of the cells in terms of the index as above. Define the conditional cell probability pb as the conditional probability of ’s taking values clarified by b given , i.e., . As an example, for dY = 3, . Let p be the dY-dimensional conditional probability vector of the cells whose entries are sorted by a descending order according to the binary system, i.e., the -th entry is pb. For example, the conditional probability vector of the cells with dY = 3 is
With the above notations, we establish the binary interaction design (BID) equation (Zhang, 2019) to transform the expectations of resolutions into cell probabilities. The equation is established by the Sylvester’s construction of Hadamard matrix (Sylvester, 1867).
Lemma 2 (BID equation). Let E be the conditional expectation vector of the resolutions from the binary expansion, and p be the conditional probability vector of the cells. Then
| (8) |
where H is the Hadamard matrix (Sylvester, 1867).
From the BID equation, the conditional probabilities of Y given X falling into each cell can be obtained by estimating the conditional expectation of resolutions. Denoting the estimator of E by E. In a common sense, p can be estimated by p = H−1 E, since the Hadamard matrix H is invertible and . However, this p may not be a probability measure. From the structure of the Hadamard matrix, the following lemma shows the summation of the cell probabilities is one.
Lemma 3. For any E, the estimation of p by p = H−1 E has a sum of entries of one.
Proof of Lemma 3. Denote . Since , the summation of p is
□
Note that we cannot guarantee to be positive. Instead of p = H−1 E, we consider the following optimization problem to solve p:
| (9) |
Since the cells can be viewed as the bins of the histogram of Y, we essentially estimate the distribution of Y as the resolutions are decomposed in an arbitrary delicate fashion. In practice, a finite dY is used, and we can smooth the histogram to approximate the distribution.
3. Multivariate extensions
Now we extend our framework to the multivariate case. For a q-dimensional X = (X1, …, Xq)T, we perform a binary expansion to every marginal CDF-transformation variable, denoted by , up to the dX-th order, and we have
| (10) |
Denote , j = 1, …, q, k = 1, … dx. The σ-field
is generated by the binary filtration of all q covariates. We can find a basis of this σ-field with totally variables. This basis set includes all possible patterns with respect to Xj, j = 1, …, q. These patterns can be divided into two groups. One group has terms involving binary variables from only one dimension of X, which capture the marginal patterns. For example, both the terms and are marginal patterns with respect to X1. The second group has terms with binary variables from at least two covariates, which reflect interactions of the corresponding covariates. For example, corresponds to the interaction of (X1, X2), and are terms with respect to the three-way interaction (X1, X2, X3). Therefore we refer to interaction terms as the patterns reflecting interaction of X, instead of product of some ’s. Note that the basis set considers until q-way interaction terms. However, three-way and higher-order interactions often contribute little to the model, and they are quite complex and difficult to interpret. Thus we only consider the main effects and two-way interaction terms in the basis set. Including main effect terms for each of the q explanatory variables, and interaction terms of each pair of the explanatory variables, there are patterns in total.
To cope with high dimensional data, some pre-screening procedures can be performed to reduce the dimension of the patterns. We do not rashly reduce the maximum number of variables in a pattern term, since each of them includes information of some specific pattern, due to the orthogonal property of the binary expansion. Instead, an approach to pairwisely test the effect of each pattern on the response variable is reasonable. To this end, we modify the binary expansion testing (BET) method (Zhang, 2019), which was originally developed to test independence of two variables, as a pre-screening method for patterns. We also extend BET to a general version, which tests the independence of multiple variables and is used to pre-screen interactions.
In the following, we first revisit the BET method, and extend it as a method of pattern pre-screening in Section 3.1. In Section 3.2, we generalize BET to pre-screen interactions by testing the independence of the response and the interaction patterns.
3.1. BET as a pre-screening approach
BET is a nonparametric method of testing dependence between two continuous variables in a distribution-free setting. Hence BET can be used on Y and X. With the binary expansion on UY and UX, the interactions of the basis of the σ-field and show all possible dependence patterns. Similar as the definition of b in Section 2.3, we use to identify the patterns of X. We denote the interaction pattern of a and b by . The interaction pattern ab can partition the unit square [0, 1]2 with half positive regions and half negative regions. The difference of the counts in the two regions reflects whether Y and X are independent in terms of the particular interaction pattern. When UY and UX are independent, the counts of the observations in the positive and negative regions should be similar. When they are not independent, there will be significant difference of counts. Denote Sab as the difference of the counts with respect to the interaction pattern ab. Zhang (2019) gave the result of the distribution of Sab in the following lemma.
Lemma 4.
- When marginal distributions are known, UY and UX are independent if and only if
- When marginal distributions are unknown, UY and UX are estimated by the empirical CDF transformations and respectively, then and are independent if and only if
where denotes the difference of the counts with respect to the interaction pattern ab according to and .
In this way, BET decomposes the information of the relationship between Y and X into interaction patterns. Figure 4 shows all the 9 interaction patterns with depth dX = 2 and dY = 2. An obvious dependence pattern is , which includes most points in the white region.
Fig. 4.

The dependence of Y on X is from the model Y = X2 + ε, where X ~ Uniform (−2, 2) and ε ~ N (0, 0.25). BET detects the dependence through the nine patterns. The pattern shows the most obvious difference of counts of blue and white regions.
With dX and dY large enough, BET can detect arbitrarily complicated dependence. BET also helps indicate the pattern of dependence, since the significant patterns from BET imply how Y depends on X. This inspires us to focus on the detection of the significant patterns and regard BET as a pattern-screening approach. Performing BET pairwisely on {(Y, Xj), j = … 1, q}, one can reduce all the patterns on Xj to only those dependent ones. We regard the patterns of X that are detected to be dependent with at least one resolution of Y as relevant variables in the penalized logistic regressions. Namely, denoting the pattern vector of Xj by , we obtain the sets of relevant patterns , j = 1, …, q. Note that we consider the patterns in as predictors in regressions for all resolutions of Y, rather than only in the regression for the particular such that is dependent. This can help to avoid false negatives. False positives can be controlled by the lasso shrinkage.
3.2. A generalized BET as an interaction pre-screening method
The pairwise BET procedure selects dependent marginal patterns. Furthermore, aiming to select interaction patterns, we first generalize the original BET to test independence of Y and the joint distribution of Xi and Xj, i, j = 1, …, q. With marginal binary expansions on respectively, the σ-field generated by UY, and are and . We aim to test all possible dependence patterns from each pair of the two σ-fields. Denote the pattern of Xj, j = 1, …, q, by , and the three-way interaction pattern of ai, aj and b by . Similar to the idea of the original BET, ai, aj b can be viewed as a partition of the cube [0, 1]3 with half positive and half negative regions. One can test the dependence of Y and the joint Xi, Xj by the different counts of the two regions . Figure 5 shows three aspects of a significant interaction pattern. Hence we use the generalized BET to pre-screen the interaction predictors in the regressions. We perform the generalized BET pairwisely on {(Y, Xi, Xj), j = 1, …, q, i = 1, …, j − 1} and obtain the sets of significant interactions
Fig. 5.

The dependence of Y and X1, X2 is from the model Y = X1X2 + ε, where and ε ~ N (0, 0.25). The generalized BET detects the pattern with the most obvious difference in counts of the positive region (white region with red points) and the negative region (blue region with yellow points).
With the two pre-screening procedures, eventually, the predictor set is
| (11) |
We refer to the pre-screening based on BET and the generalized BET as the BET screening. Algorithm 1 gives the procedure of resolution-wise regression, including the pre-screening and the framework of the estimation.

4. Theoretical Studies
In this section, we first show that the BET screening is a sure independence screening approach (Fan and Song, 2010), which reduces the number of patterns L from exponential growth to O(n). The consistency result of the estimated cell probabilities is also established with the random design and a fixed dY. We allow the dimension q and dX to grow with n.
For the m-th logistic regression, assume that the binary data from the marginal empirical CDF transformation observations are i.i.d. copies of , where is the i-th sample of the L-dimensional binary random vector , and is the i-th sample of the binary response . Denote , i = 1, …, n, as the samples of the covariates that are standardized to have mean zero and standard deviation one for each covariate. We have . Denote as the i-th sample of taking values from {0, 1}. We have . The maximum marginal likelihood estimator (MMLE) for the logistic regression (6), which is a special case of the models in Fan and Song (2010), is defined as the minimizer of the negative log-likelihood of the component-wise regression,
| (12) |
We correspondingly define the population version of the MMLE by
Denote the true regression coefficient vector by . Let be the true index set of non-zero coefficients. We remark here that our overall goal of the analysis is prediction of the response rather than inference of slopes. Therefore, although when dY and dX are large, and the overall paramerization might become unidentifiable, it will not harm the prediction results, as studied in Greenshtein and Ritov (2004).
We now provide the theoretical justifications of our method.
Assumption 1. for with constants c1,m > 0 and 0 < κm < 1 / 2.
Assumption 1 is analogous to Condition E of Fan and Song (2010). It ensures that the marginal signals are stronger than the stochastic noise. Within the selected set R, denote a(j) b(m) as the pattern corresponding to and , and let the index set of selected variables using BET screening be , for some threshold δn,m. The following theorem shows that BET screening possesses the sure independence screening property.
Theorem 1. For any c2,m > 0, there exists a positive constant c3,m such that
where kn,m, Kn,m, h0,m, h1,m, αm are some positive constants. If, in addition, Assumption 1 holds, the BET screening possesses a sure independence screening property. By taking , we have
where , the number of nonsparse elements.
Assumption 2. The variance is bounded from above and below.
Assumption 2 is analogous to Condition F of Fan and Song (2010). The following theorem shows that the BET screening can reduce the dimension from to .
Theorem 2. Under Assumption 2, we have for any , and the same constants c3,m, kn,m, Kn,m, h0,m, h1,m, αm as in Theorem 1 such that
Here, we briefly describe the results, whose details are given in the Appendix. Let and is the predictor vector including the selected r patterns. Denote the true coefficient vector of βm by . The estimate of βm is
where ||·||1 is the ℓ1-norm, λm is a tuning parameter , is the binary expansion corresponding to for the ith observation, and fm is the mth logistic regression function . Let be the true function between and . Denote the index set of non-zero coefficients by , and the cardinality of by .
According to the BID equation, we estimate p by solving the optimization (9). From the optimization Hm+1 p, is an approximation of , where Hm +1 is the (m + 1) -th row of H, since is the (m + 1) -th entry of E. Hence g(Hm+1 p) is the estimated m-th regression function corresponding to p. The following theorem gives the consistency of cell probability vector p in terms of excess risk of g(Hm+1 p).
Theorem 3. Assume Assumptions (1) and (2), and (3) to (5) given in the Appendix hold, where Assumption (3) in the Appendix holds with the set . For the logistic regression with covariates corresponding to the BET screening set , suppose that λm satisfies . Then on the set 𝒯m, we have,
where K > 0 is a constant, , , , is a compatibility constant, and .
5. Simulation Studies
In this section, we perform simulations to show the performance of resolution-wise regression approach. We compare our method with the following four methods:
Naive method, which first finds a small neighborhood of each test sample in the training set, where ∥X − Xnew∥2 is bounded by a constant, and predicts the distribution of Y | Xnew by the kernel density estimation of the responses in this neighborhood.
SSANOVA, which fits a cubic spline with all main effects and interaction effects. Its prediction distribution is , where is the SSANOVA estimation of Y given a new Xnew, and is the estimated variance of the random error.
Random Forest, which fits a multitude of regression trees and then averages the predictions. Its prediction distribution is , where Yrf | Xnew is the estimation of Y from random forest given a new Xnew, and is the standard error.
Regression mixture model (only for Example 2), which identifies the subgroups of dataset and fits multiple linear regression models. Its prediction distribution is , where is the estimation of Y from the regression mixture model given a new Xnew which is randomly assigned into subgroups with the weights derived from training data, and is the standard error.
We study the following four examples with 1024 samples for both training and testing sets.
Example 1. (Crossing lines) The predictor , i = 1, …, n. For the example with one cross on the plane, the response yi is generated by yi = xiI(gi = 0) − xiI(gi = 1) + εi, where the error , i = 1, …, n, and I(·) is the indicator function with , i = 1, …, n. For the example with multiple crosses, the response yi = (xiI(g1i = 1) − xiI(g1i = 2) + (xi − 10)I(g1i = 3) + (−xi + 10)I(g1i = 4)) I(xi ≥ 0) + (xiI(g2i = 1) − xiI(g2i = 2) + (−xi − 10)I(g2i = 3) + (xi = 10)I(g2i = 4))I(xi < 0) + εi, where the error , i = 1, …, n, and I(·) is the indicator function with , i = 1, …, n, k = 1, 2.
Example 2. (A mixture of linear and quadratic effects) The predictor vector (xi1, …, xiq)T is generated by , i = 1, …, n, j = 1, …, q. with q = 1, 5, 10. The response yi is generated by , which depends on only the first variable xi1 and other variables are regarded as noise. The error , i = 1, …, n, and I(·) is the indicator function with , i = 1, …, n.
Example 3. (Circular and spherical implicit functional relationship) The predictor vector (xi1, …, xiq)T has q = 5. For the circle example, the predictors and the responses are generated from the polar coordinates xi1 = sin(θi), yi = cos(θi) + εi, where the latent variable , i = 1, …, n, and the error , i = 1, …, n. The noise variables (xi2,…, xiq)T are generated by i = 1, …, n, j = 2, …, q. For the sphere example, where the latent variables , the predictors and the responses are generated from xi1 = sin(θi)cos(ϕi), xi2 = sin(θi)sin(ϕi), yi = cos(θi) + εi, where , i = 1, …, n, and the error , i = 1, …, n. The noise variables are generated by , i = 1, …, n, j = 3, …, q.
Example 4. (Heterogeneous mean vesus heteroscedastic error) The predictor , i = 1, …, n. The response yi is generated by , where the error , i = 1, …, n, σ2 = 0.05, 0.25, 0.5.
In each example, BET with depth 5 and a threshold of for symmetry statistics, where p is the total number of interactions and n is the sample size, is performed for main effects screening, while generalized BET with depth 4 and the same threshold is performed for interaction effects screening. For the selection of depth, with a small depth 3, BET reaches a high power (Zhang, Zhao and Zhou, 2021). We perform simulation with different depths and the results are reported in the Appendix. We pick a depth 5 and 4, which is high enough. Two types of smoothing approaches are considered: fixed smoothing parameter (“Fixed smoothness”), and tuning the smoothing parameter by cross-validation (“CV”).
We repeat the simulation 100 times for each example. To measure the test error, we calculate the differences of prediction distributions from different methods and the underlying true distributions, the following distance measures are used: (1) Kolmogorov-Smirnov statistic , (2) Kullback-Leibler divergence , (3) L1 distance , where P and Q are two distributions with corresponding PDFs p (·) and q (·) respectively.
Here we display the results of Example 1–4, which are heterogeneous data, and the result from a case of a nonlinear functional relationship is in supplementary materials. Tables 1 – 4 list the results of testing errors for the three simulation examples. Figures 6–9 show the heatmaps of the prediction distributions of all test data, where the x axis is the involved variable of the predictors. We discard the heatmap of the spherical case, since it has two involved variables and cannot be shown explicitly in a heatmap.
Table 1.
Comparison of average test errors (and corresponding standard errors in parentheses) for Example 1 with respect to one or multiple crosses and three distance measures. The results for naive method, mixture of regression, SSANOVA, resolution-wise regression with fixed smoothness, and resolution-wise regression with CV are listed in the columns from left to right respectively.
| Example | Measure | Naive | Mixreg | SSANOVA | Random Forest | Fixed smoothness | CV |
|---|---|---|---|---|---|---|---|
| one cross | KS | 0.151 | 0.351 | 0.294 | 0.377 | 0.165 | 0.129 |
| (0.015) | (0.010) | (0.016) | (0.017) | (0.009) | (0.005) | ||
| KL | 0.362 | 1.396 | 1.031 | 1.835 | 0.386 | 0.265 | |
| (0.032) | (0.067) | (0.062) | (0.156) | (0.025) | (0.014) | ||
| L 1 | 0.639 | 1.086 | 1.181 | 1.120 | 0.717 | 0.364 | |
| (0.034) | (0.063) | (0.073) | (0.069) | (0.035) | (0.019) | ||
| multiple | KS | 0.188 | 0.405 | 0.229 | 0.394 | 0.165 | 0.145 |
| (0.007) | (0.026) | (0.016) | (0.020) | (0.010) | (0.008) | ||
| KL | 0.319 | 2.511 | 0.548 | 1.615 | 0.338 | 0.257 | |
| (0.016) | (0.128) | (0.036) | (0.084) | (0.024) | (0.016) | ||
| L 1 | 0.653 | 1.212 | 0.850 | 1.054 | 0.658 | 0.397 | |
| (0.033) | (0.052) | (0.042) | (0.057) | (0.034) | (0.019) |
Table 4.
Comparison of average test errors (and corresponding standard errors in parentheses) for Example 4 with respect to different variances of random errors and three distance measure. The results for naive method, SSANOVA, Random Forest, resolution-wise regression with fixed smoothness, and resolution-wise regression with CV are listed in the columns from left to right respectively.
| Example | Measure | Naive | SSANOVA | Random Forest | Fixed smoothness | CV |
|---|---|---|---|---|---|---|
| σ2 = 0.05 | KS | 0.192 | 0.189 | 0.243 | 0.153 | 0.154 |
| (0.008) | (0.009) | (0.016) | (0.005) | (0.005) | ||
| KL | 0.435 | 0.417 | 0.512 | 0.229 | 0.743 | |
| (0.022) | (0.026) | (0.030) | (0.012) | (0.034) | ||
| L 1 | 0.656 | 0.622 | 0.652 | 0.424 | 0.569 | |
| (0.034) | (0.035) | (0.033) | (0.022) | (0.024) | ||
| σ2 = 0.25 | KS | 0.156 | 0.131 | 0.239 | 0.123 | 0.123 |
| (0.006) | (0.006) | (0.014) | (0.009) | (0.010) | ||
| KL | 0.258 | 0.245 | 0.430 | 0.166 | 0.330 | |
| (0.016) | (0.014) | (0.026) | (0.009) | (0.020) | ||
| L 1 | 0.491 | 0.452 | 0.571 | 0.346 | 0.400 | |
| (0.025) | (0.030) | (0.033) | (0.017) | (0.023) | ||
| σ2 = 0.5 | KS | 0.139 | 0.121 | 0.236 | 0.098 | 0.110 |
| (0.008) | (0.007) | (0.018) | (0.007) | (0.009) | ||
| KL | 0.201 | 0.208 | 0.421 | 0.142 | 0.167 | |
| (0.013) | (0.017) | (0.035) | (0.009) | (0.012) | ||
| L 1 | 0.422 | 0.408 | 0.554 | 0.296 | 0.308 | |
| (0.035) | (0.024) | (0.038) | (0.019) | (0.020) |
Fig. 6.

Heatmaps of prediction distributions for Example 1 with respect to one or multiple crosses and six methods: naive method, regression mixture, SSANOVA, Random Forest, resolution-wise regression with fixed smoothness, and resolution-wise regression with CV, from left to right respectively. A darker color indicates a larger PDF value at the corresponding predicted response.
Fig. 9.

Heatmaps of prediction distributions for Example 4 with respect to different variances of random errors and five methods: naive method, SSANOVA, random forest, resolution-wise regression with fixed smoothness, and resolution-wise regression with CV, from left to right respectively. A darker color indicates a larger PDF value at the corresponding predicted response.
The results indicate the best performance of resolution-wise regression. For Example 1, resolution-wise regression especially with cross-validation smoothness can identify the subgroups, while the regression mixture model does not perform well as the number of subgroups increases. The naive method has a good performance since the dependence is linear. For Example 2, resolution-wise regression can predict the probabilities around the two subgroups, thus has the best performance. SSANOVA does not perform well since it cannot recognize the subgroups. The naive method performs well only in the low-dimensional case, because in the high-dimensional case, it is difficult to find a small neighborhood with substantial train data. Random forest does not perform well because it averages the predictions from multiple regression trees and mixes the two subgroups up. Resolution-wise regression performs well in the high-dimensional case, and the distance to the true distribution only increases slightly with the effect of noise variables. For Example 3, resolution-wise regression performs the best, while the naive method has poor performance due to the dimension issue; SSANOVA fails to capture the relationship, since it cannot be expressed in an explicit regression function form; Random forest gives an averaged prediction and fails to recognize the multiple patterns at one position. For Example 4, only the resolution-wise regression successfully distinguishes the two subgroups of the heterogeneous mean with homogeneous error model and the homogeneous mean with a heteroscedastic error model.
Based on the simulation results, we can see that resolution-wise regression has a better distribution prediction when the variance σ2 of the error is larger, which seems paradoxical with the point prediction of some common regression methods. In fact, for point prediction, the loss comes from the random error term. A large variance leads to a large loss. However, for distribution prediction, the loss comes from the accumulated probability of possible response values.
The variance affects the shape of the distribution of the response. For a smaller variance, the data are concentrated, and more precise resolutions are needed, which bring the difficulties for the estimation of the corresponding logistic regressions. Hence for the same expansion order dY, our method gives less accurate results on smaller variances. This also shows some insights for a general prediction problem that distribution prediction is a good alternative to capture the whole picture if the random error is relatively large.
6. Real Data Analysis
We analyze the real estate valuation dataset (Yeh and Hsu, 2018), obtained from UCI machine learning Repository (https://archive.ics.uci.edu). The dataset contains the unit-area price and the corresponding six explanatory variables of 414 houses collected from Sindian District, New Taipei City. The six explanatory variables include the transaction date, the house age, the distance to the nearest MRT station, the number of convenience stores in the living circle on foot, the latitude, and the longitude. We are interested in how the house price can be explained by these variables. We consider three methods: (1) Naive method with the small neighborhood of the nearest ten samples, where the distance is measured by the Mahalanobis distance (Rosenbaum, 1995) of the rank vector within each predictor; (2) SSANOVA with cubic splines on each main effects and interactions; (3) Resolution-wise regression with dX = 5 and dY = 5.
There are three interesting results we find from the application of resolution-wise regression.
House prices rely on some features through a nonlinear relationship. Such a nonlinear pattern can be detected by the BET screening, as shown in Figures 10 and 11. The latitude and the longitude show skewed quadratic effects which are captured by the relevant patterns in depth 1 and depth 2 respectively. For the screening of interaction patterns, all the interactions are significant. As an example shown in Figure 11, the interaction of the latitude and the longitude shows a concentration of high-price houses, which can be pinpointed around the downtown area. There is also some linear pattern found in the data. As shown in Figure 10, the distance to the nearest MRT station and the number of convenience stores in the living circle on foot has a linear effect on the house price which is captured by the relevant pattern in depth 1.
Fig. 10.

Relevant variables and the corresponding most significant patterns. For the distance to the nearest MRT station and the latitude, there exists a linear relationship to the housing prices. For the longitude, the most asymmetric interaction is A1A2B1, which implies a nonlinear dependence.
Fig. 11.

Interaction of latitude and longitude. The high-price houses concentrate around the downtown area.
Potential heterogeneity can be detected by resolution-wise regression. The proposed method clearly predicts the house price concentrating on two groups around 30 and 60, which is new information that is not provided by existing methods. For examples, the SSANOVA completely misses such heterogeneity. The naive method provides a distribution which vaguely suggests probability mass towards the right tail. However, the subgroup information identified by the naive method is not very clear.
The detected heterogeneity can also be demonstrated through the additional information from the map of the city. As shown in Figure 13, the nearest ten samples measured by the Mahalanobis distance form two groups classified by the river and the highway. The two groups differ in the house price, and both contribute to the distribution estimation. The effect of the river and the highway play the role of unobserved variable, which leads to the bimodal shape of the prediction distribution. The prediction from our method suggests that the price of the particular house is more likely to be close to the three houses on the lower-left side of the river, which has an average price of 25.75. This is verified as correct prediction through the actual map. In particular, the location of the house is indeed on the lower-left side of the river. This example thus illustrates the advantage of the proposed method. Specifically, it can detect heterogeneity in the data and provide accurate probability statements about subgroup information.
Fig. 13.

The testing sample (red pin) and the nearest 10 samples measured by the Mahalanobis distance of the rank vector within every predictor (blue pins, where two samples on the power left side have the same location, and two samples on the lower left side have the same location) on the map. In these 10 houses, four are on the lower left side of the river with an average price of 25.75, and six are on the upper right side of river with an average price of 43.87.
In summary, this real data analysis indicates that resolution-wise regression model can capture the heterogeneous pattern, thus can deliver more detailed prediction information than traditional methods. Since no distribution assumption is required, our method is rather general and robust.
7. Conclusion
In this paper, we propose resolution-wise regression model to predict the distribution of the response with heterogeneous data. The complicated relationship between the response and the explanatory variables can be decomposed into the relationship of resolutions of the response and patterns of the predictors based on binary expansions. A set of penalized logistic regressions establish the effect of patterns having on the resolutions. By BID transformation, our method can estimate the cell probability of the histogram of the response, which is an approximation of the distribution of the response. We also show the consistency of the cell probabilities. Numerical studies demonstrate the effectiveness of the proposed method.
Supplementary Material
Fig. 7.

Heatmaps of prediction distributions for Example 2 with respect to dimension q = 1, 5, 10 and five methods: naive method, SSANOVA, Random Forest, resolution-wise regression with fixed smoothness, and resolution-wise regression with CV, from left to right respectively. A darker color indicates a larger PDF value at the corresponding predicted response.
Fig. 8.

Heatmaps of prediction distributions for Example 3 with respect to circular implicit functional relationship and five methods: naive method, SSANOVA, Random Forest, resolution-wise regression with fixed smoothness, and resolution-wise regression with CV, from left to right respectively. A darker color indicates a larger PDF value at the corresponding predicted response.
Fig. 12.

Predicted distributions by naive method by the nearest ten samples, SSANOVA, and resolution-wise regression.
Table 2.
Comparison of average test errors (and corresponding standard errors in parentheses) for Example 2 with respect to dimension q = 1, 5, 10 and three distance measures. The results for naive method, SSANOVA, Random Forest, resolution-wise regression with fixed smoothness, and resolution-wise regression with CV are listed in the columns from left to right respectively.
| Example | Measure | Naive | SSANOVA | Random Forest | Fixed smoothness | CV |
|---|---|---|---|---|---|---|
| q = 1 | KS | 0.103 | 0.215 | 0.256 | 0.169 | 0.167 |
| (0.006) | (0.005) | (0.008) | (0.013) | (0.014) | ||
| KL | 0.261 | 0.605 | 0.787 | 0.089 | 0.693 | |
| (0.010) | (0.021) | (0.049) | (0.008) | (0.037) | ||
| L 1 | 0.383 | 0.765 | 0.716 | 0.283 | 0.517 | |
| (0.018) | (0.033) | (0.035) | (0.015) | (0.027) | ||
| q = 5 | KS | 0.345 | 0.219 | 0.222 | 0.180 | 0.179 |
| (0.019) | (0.016) | (0.012) | (0.015) | (0.014) | ||
| KL | 0.701 | 0.589 | 0.730 | 0.146 | 0.532 | |
| (0.034) | (0.025) | (0.037) | (0.011) | (0.028) | ||
| L 1 | 0.937 | 0.774 | 0.693 | 0.348 | 0.506 | |
| (0.039) | (0.035) | (0.028) | (0.015) | (0.022) | ||
| q = 10 | KS | 0.344 | 0.210 | 0.213 | 0.169 | 0.183 |
| (0.006) | (0.006) | (0.015) | (0.018) | (0.018) | ||
| KL | 0.767 | 0.589 | 0.666 | 0.185 | 0.572 | |
| (0.033) | (0.024) | (0.050) | (0.022) | (0.035) | ||
| L 1 | 0.952 | 0.735 | 0.693 | 0.346 | 0.498 | |
| (0.038) | (0.032) | (0.035) | (0.015) | (0.026) |
Table 3.
Comparison of average test errors (and corresponding standard errors in parentheses) for Example 3 with respect to circular and spherical implicit functional relationship and three distance measures. The results for naive method, SSANOVA, Random Forest, resolution-wise regression with fixed smoothness, and resolution-wise regression with CV are listed in the columns from left to right respectively.
| Example | Measure | Naive | SSANOVA | Random Forest | Fixed smoothness | CV |
|---|---|---|---|---|---|---|
| Circle | KS | 0.175 | 0.201 | 0.398 | 0.172 | 0.156 |
| (0.010) | (0.014) | (0.030) | (0.010) | (0.008) | ||
| KL | 0.378 | 0.436 | 4.180 | 0.264 | 0.404 | |
| (0.019) | (0.032) | (0.322) | (0.008) | (0.027) | ||
| L 1 | 0.677 | 0.761 | 1.384 | 0.515 | 0.535 | |
| (0.039) | (0.034) | (0.108) | (0.022) | (0.025) | ||
| Sphere | KS | 0.170 | 0.184 | 0.413 | 0.183 | 0.161 |
| (0.012) | (0.010) | (0.037) | (0.009) | (0.006) | ||
| KL | 0.369 | 0.411 | 4.553 | 0.310 | 0.339 | |
| (0.015) | (0.021) | (0.355) | (0.009) | (0.016) | ||
| L 1 | 0.664 | 0.735 | 1.439 | 0.580 | 0.570 | |
| (0.026) | (0.038) | (0.080) | (0.024) | (0.025) |
Acknowledgments
The authors would like to thank the editor, the associate editor, and two anonymous referees for their valuable comments and suggestions.
Funding
Research of Qizhai Li was partially supported by Beijing Natural Science Foundation (Z180006) and National Nature Science Foundation of China (11722113). Kai Zhang was supported in part by NSF grants DMS-1613112, IIS-1633212, DMS-1916237. Yufeng Liu was supported in part by NSF grant DMS-2100729 and NIH grant R01GM126550.
Appendix
Proof of Theorem 1:
We split the whole proof into two steps: (1) the selected variables from the BET screening are equivalent to those from the sure independence screening based on the MMLE, (2) the proposed logistic regression satisfies the conditions in Fan and Song (2010) to achieve the sure independence screening property.
-
The BET test statisticBy the definition in (12), and , the MMLE can be obtained by the optimization with respect to ’s, i.e.,Setting the derivative of the above objective function with respect to βm,j to be zero, we have that satisfiesThe second equation holds since there are n / 2 samples with and the other n / 2 samples with , due to the binary expansion from the empirical CDF transformation. Denote . Differentiating with respect to , we obtain . Hence is strictly increasing with respect to . For γm,n > 0, if , . If ,Since , we have . Hence we have the variable selection index set . Similarly, we have . Denote . Taking for some c4,m > 0, we have
Hence the BET screening is equivalent to the sure independence screening based on MMLE. This ensures that the estimation methods are the same as those in Fan and Song (2010).
-
Under the binary expansion, the variables are bounded in [0,1] after empirical CDF transformation, so that the conditions A – C in Fan and Song (2010) are naturally satisfied. Assumption 1 is analogous to Condition E. So we only need to check Condition D. For the proposed logistic regression, let w0 = 1, and we have
Taking w1 = 2, h1,m = 3, h0,m = 1, α = 1 satisfies Condition D.
Hence by Theorem 4 in Fan and Song (2010), for , the sure independence screening property is achieved.
Proof of Theorem 2:
For logistic regression, the function b (·) of Condition G in Fan and Song (2010) is b(x) = log(1 + ex). Since . Condition G in Fan and Song (2010) is met. Together with Assumption 2, by Theorem 5 in Fan and Song (2010), the proof is completed.
Let ℱ be a normed real vector space. For the m-th logistic regression function , where , the negative log-likelihood loss is , where is the predictor vector including the selected r patterns. For a loss function , define the empirical risk for by , and the theoretical risk by . Consider the collection ℱ to be a linear-model class, i.e., , where β ↦ fβ is linear. For the m-th regression, note that the true coefficient vector is the minimizer of the theoretical risk
| (13) |
and . We assume for simplicity that the minimum exists and unique. For , the excess risk is defined by . The lasso estimator is , , where ∥·∥1 is the ℓ1-norm and λm is a tuning parameter. The estimation of the regression function is .
Denote and . We have . Hence based on the link function of logistic regression, we can define a functional g mapping em (·) to the regression function :
| (14) |
Denoting as the true expectation corresponding to , by (14), we have . Similarly, recall that is the estimated expectation, and thus we have .
For a given index set Sm ⊂ {1, …, r}, define , , j = 1, …, r. Denote the estimator restricted to by . Write . Restricted to ’s, the best approximation of is , where .
The following assumption requires a certain compatibility of ℓ1-norm with the norm on ℱ, which is a regular assumption for the theoretical framework for lasso.
Assumption 3. (Compatibility condition) We say that the compatibility condition is met for the set Sm with constant ϕm > 0, if for all βm satisfying , it holds that .
Next, we show the definition of the margin condition (Bülmann and van de Geer, 2011) and demonstrate that the penalized logistic regression satisfies the condition with a quadratic margin.
Definition 1 (Margin condition). Denote a “neighborhood” of by with constant ηm > 0. We say that the margin condition holds with a strictly convex function G, if for all , we have , where ∥·∥ is the norm defined on ℱ.
Assumption 4. For any fixed , there exists some constant such that , .
Lemma 5. Under Assumption 4, the margin condition holds for all penalized logistic regressions with a quadratic margin, i.e., Gm(u) = cmu2 for the m-th regression.
The technical proof of Lemma 5 can be found in supplement. For the m-th regression, the oracle (Bülmann and van de Geer, 2011) is defined by
| (15) |
where Sβ ≔ {j: βj ≠ 0}, sβ ≔ |Sβ| denotes the cardinality of Sβ, is a compatibility constant, and Ψ is a suitable large collection of index sets. Denote the index set of non-zero coefficients by , and the cardinality of by . Assuming is linear, we can take . Hence the definition of is consistent with the definition of in (13), since the second term of (15) does not rely on β. In this context, we only use the notation . Denote the minimum of (15) by . Define , where is the empirical process. Set , and . Bülmann and van de Geer (2011) showed that one can choose such that the set 𝒯m holds with large probability.
Assumption 5. For some constant ηm > 0, for all , as well as .
According to the BID equation, we estimate p by solving the optimization (9). From the optimization, Hm+1 p is an approximation of , where Hm+1 is the (m + 1)-th row of H, since is the (m + 1)-th entry of E. Hence g(Hm+1 p) is the estimated m-th regression function corresponding to p. The following theorem gives the consistency of cell probability vector p in terms of excess risk of g (Hm+1 p).
We now turn to the proof of Theorem 3. Below is its statement again for convenience.
Theorem 4. Assume Assumptions 1–5 hold, where Assumption 3 holds with the set . For the logistic regression with covariates corresponding to the BET screening set , suppose that λm satisfie . Then on the set 𝒯m, we have,
where K > 0 is a constant,
Before the proof of Theorem 3, we first state the oracle inequality for penalized logistic regression by Bülmann and van de Geer (2011) as follows.
Lemma 6. Assume Assumptions 3–5 hold, where 3 holds with the set . Suppose that λm satisfies the inequality . Then on the set 𝒯m, we have
where .
Then we have the proof for Theorem 3 as follows.
Proof of Theorem 3:
For the excess risk of g (Hm+1 p), we have
It can be shown that the function is Lipschitz continuous, with the Lipschitz constant Km obtained from the first derivative
For part I, denoting the true expectation vector by E0 and the corresponding true cell probability vector by p0 = H−1 E0, we have
By Lemma 6, . Since with predictors taking values from {− 1, 1}, we have . One can similarly show that the function g−1 (·) is Lipschitz continuous with Lipschitz constant two. Hence we have , and thus , where , , , and .
For the true expectation vector and the true cell probability vector, we have ∥H p0 − E0∥1 = 0. Denote . Together with the inequality in Lemma 6 to handle the part II, the proof is completed.
Footnotes
Conflict of Interest Statement
The authors declare that they have no competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References
- Bülmann P & van de Geer S (2011). Statistics for High-Dimensional Data. Springer. [Google Scholar]
- Chen J, Tran-Dinh Q, Kosorok MR, and Liu Y (2021). Identifying Heterogeneous Effect using Latent Supervised Clustering with Adaptive Fusion. Journal of Computational and Graphical Statistics, 30, 1, 43–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cox DR & Reid N (2000). The theory of the design of experiments. CRC Press. [Google Scholar]
- Fan J & Song R (2010). Fan, J and Song, R. Sure independence screening in generalized linear models with NP-dimensionality. Ann. Stat 38, 3567–604. [Google Scholar]
- Fan J & Lv J (2008). Fan, J and Lv, J. Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B Stat. Methodol 70, 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Golubov B, Efimov A & Skvortsov V (2012). Walsh series and transforms: theory and applications. Springer Science & Business Media. [Google Scholar]
- Green P & Silverman B (1994). Nonparametric regression and generalized linear models: a roughness penalty approach. Chapman and Hall, London. [Google Scholar]
- Greenshtein E and Ritov Y (2004). Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli. 10, 971–988. [Google Scholar]
- Gu C (2002). Smoothing spline ANOVA models. Springer-Verlag, New York. [Google Scholar]
- Guo FJ, Levina E, Michailidis G & Zhu J (2010). Pairwise variable selection for high-dimensional model-based clustering. Biometrics 66, 793–804. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Harmuth H (2013). Transmission of Information by Orthogonal Functions. Springer Berlin Heidelberg. [Google Scholar]
- Hocking T, Joulin A, Bach F & Vert JP (2011). Clusterpath: An algorithm for clustering using convex fusion penalties. In Proceedings of the 28th International Conference on Machine Learning (ICML’11), Getoor L and Scheffer T, eds., New York: Omnipress, pp. 745–52. [Google Scholar]
- Jacobs RA, Jordan MI, Nowlan SJ & Hinton GE (1991). Adaptive mixtures of local experts. Neural Comp. 3, 79–87. [DOI] [PubMed] [Google Scholar]
- Kac M (1959). Statistical independence in probability, analysis and number theory. Mathematical Association of America. [Google Scholar]
- Lindsten F, Ohlsson H & Ljung L (2011). Clustering using sum-of-norms regularization: With application to particle filter output computation. In 2011 IEEE Statistical Signal Processing Workshop (SSP), pp. 201–4. [Google Scholar]
- Lynn PA (1973). An introduction to the analysis and processing of signals. London: Macmillan. [Google Scholar]
- MA S & Huang J (2017). A concave pairwise fusion approach to subgroup analysis. J. Am. Statist. Assoc 112, 410–23. [Google Scholar]
- MacQueen J (1967). Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, LeCam LM and Neyman J eds., Berkeley, CA: University of California Press, pp. 281–97. [Google Scholar]
- Pan W and Shen X (2006). Penalized model-based clustering with application to variable selection. J. Mach. Learn. Res 8, 1145–64. [Google Scholar]
- Pan W, Shen X & Liu B (2013). Cluster analysis: Unsupervised learning via supervised learning with a non-convex penalty. J. Mach. Learn. Res 14, 1865–89. [PMC free article] [PubMed] [Google Scholar]
- Pearl J (1971). Application of Walsh Transform to Statistical Analysis. IEEE Trans. Syst. Man. Cybern SMC-1, 111–9. [Google Scholar]
- Raftery A & Dean N(2006). Variable selection for model-based clustering. J. Am. Statist. Assoc 101, 168–78. [Google Scholar]
- Rosenbaum P (1995). Observational Studies. Springer. [Google Scholar]
- Sylvester JJ (1867). LX. Thoughts on inverse orthogonal matrices, simultaneous signsuccessions, and tessellated pavements in two or more colours, with applications to Newton’s rule, ornamental tile-work, and the theory of numbers. Lond. Edinb. Dubl. Phil. Mag 34, 461–75. [Google Scholar]
- Tang X & Qu A (2017). Individualized multi-directional variable selection. arXiv: 1709.05062. [Google Scholar]
- Tibshirani R (1996). Regression shrinkage and selection via the lasso. J. Royal Stat. Soc. B 58, 267–88. [Google Scholar]
- Wahba G (1990). Spline Models for Observational Data. SIAM, Philadelphia. [Google Scholar]
- Yeh IC & Hsu TK (2018). Building real estate valuation models with comparative approach through case-based reasoning. Appl. Soft Comput 65, 260–71. [Google Scholar]
- Zhang K (2019). BET on independence. J. Am. Statist. Assoc DOI: 10.1080/01621459.2018.1537921. [DOI] [Google Scholar]
- Zhang K, Zhao Z, and Zhou W (2021). BEAUTY Powered BEAST, arXiv: 2103.00674. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
