Nonparametric prediction distribution from resolution-wise regression with heterogeneous data

Jialu Li; Wan Zhang; Peiyao Wang; Qizhai Li; Kai Zhang; Yufeng Liu

doi:10.1080/07350015.2022.2115498

. Author manuscript; available in PMC: 2024 Jan 1.

Published in final edited form as: J Bus Econ Stat. 2022 Oct 6;41(4):1157–1172. doi: 10.1080/07350015.2022.2115498

Nonparametric prediction distribution from resolution-wise regression with heterogeneous data

Jialu Li ^a, Wan Zhang ^b, Peiyao Wang ^b, Qizhai Li ^c, Kai Zhang ^b, Yufeng Liu ^d,^*

PMCID: PMC10691808 NIHMSID: NIHMS1853448 PMID: 38046827

Abstract

Modeling and inference for heterogeneous data have gained great interest recently due to rapid developments in personalized marketing. Most existing regression approaches are based on the conditional mean and may require additional cluster information to accommodate data heterogeneity. In this paper, we propose a novel nonparametric resolution-wise regression procedure to provide an estimated distribution of the response instead of one single value. We achieve this by decomposing the information of the response and the predictors into resolutions and patterns respectively based on marginal binary expansions. The relationships between resolutions and patterns are modeled by penalized logistic regressions. Combining the resolution-wise prediction, we deliver a histogram of the conditional response to approximate the distribution. Moreover, we show a sure independence screening property and the consistency of the proposed method for growing dimensions. Simulations and a real estate valuation dataset further illustrate the effectiveness of the proposed method.

Keywords: Binary Expansion, Data heterogeneity, Nonparametric Statistics, SSANOVA, Sure independence screening

1. Introduction

A common nonparametric regression model establishes the effects of the explanatory variables on the response variable in the form of

Y = f (X) + ε,

(1)

where Y is the response variable, X = (X₁, …, X^q)^T is the q-dimensional explanatory variable vector, and ε is the random error, which is often assumed with mean 0 and variance σ^2, and is independent of X.

In recent years there has been a growing demand for exploring regression methods for heterogeneous populations, which has broad applications in personalized marketing and other fields. One characteristic of data heterogeneity is the existence of subpopulations in the data. In practice, the heterogeneity can be regarded as the result of some latent variables. This happens frequently since it is difficult to collect all the explanatory variables for the response. For example, in the real estate data in Section 6, a river and a highway through the city create subpopulations and heterogeneous distributions of housing prices. However, the information of this river and this highway is not available in the data.

Denote the unobserved categorical variable by Z taking values 1, … T, where T is unknown. Suppose the potential true relationship of the response and all the explanatory variables can be expressed by

Y = \sum_{t = 1}^{T} f_{t} (X) I (Z = t) + ε,

(2)

with unknown functions ft’s, t = 1, …, T. In this paper, our goal is to relate Y with X without knowing Z. However, this differs from fitting model (1), since the true relationship between Y and X may not even be a function. As an illustration, the housing prices on the two sides of Tamsui river with respect to longitude and latitude are shown in Figure 1. The plot shows a mixture of two subgroups: the housing prices on the west and the east of the river behave differently. The latent variable Z, that is, the indicator for which side the position on, determines the two subgroups. Without knowing Z, the relationship between Y and X cannot be captured by a single function. Hence new methods to model the effects of X on Y with such a challenging heterogeneous population are in great needs.

Fig. 1 — Left panel: An illustration of heterogeneous data in the housing prices on two sides of Tamsui River (light blue curve). Right panel: The housing prices follow different distributions: The housing prices on the west monotonically increase with latitude, while those on the east are concave and parabolic.

One possible idea is to use smoothing splines (Green and Silverman, 1994) which can capture local behaviors. A more general setting is the smoothing spline analysis of variance (SSANOVA) (Wahba, 1990; Gu, 2002), which fits an additive model for main effects and interactions. These approaches use regression to estimate the overall conditional mean function, which tends to fit a compromised effect of the subgroups. Hence they may fail to identify the subgroups of the population, and the results might not be informative for either of the subgroups. Moreover, the estimated distribution might not really reflect the pattern of the true one.

Another possible strategy is to cluster the data first, then fit a regression model within each subgroup. Existing model-based clustering approaches include Jacobs et al. (1991), Pan and Shen (2006), Raftery and Dean (2006), and Guo et al. (2010). As alternative approaches, Lindsten et al. (2011), Hocking et al. (2011), and Pan et al. (2013) formulated clustering as a penalized regression problem with fusion-type penalties. However, these methods focus on finding the groups based on the similarity of the explanatory variables, instead of identifying groups with different effects on the response.

In the literature, some individualized methods were proposed to handle heterogeneity. Ma and Huang (2017) employed subject-specific intercepts to model the unobserved factors which leads to the heterogeneity. They used a concave pairwise fusion penalty to shrink some intercepts to be the same, which can produce a partition of subgroups. Chen, Tran-Dinh, Kosorok and Liu (2021) considered a more general fusing method than that of Ma and Huang (2017) to identify subgroups. Tang and Qu (2017) proposed a multi-directional penalty to shrink individuals to different groups. The performance of these methods depends on how well the subgroups are separated. If the subgroups are close to each other, the performance can be less accurate.

In this paper, we tackle the heterogeneity from a new perspective. Instead of estimating ft’s in (2) through nonparametric regressions, we propose to estimate the conditional distribution of the response variable given observed explanatory variables. The estimated distribution provides an overall picture of the response variable, and can indicate the heterogeneity by the modes of the probability density function (PDF). To achieve this goal, one single regression is not enough, because the pattern of two or more possible values of the response corresponding to one observation of predictor variables cannot be expressed by an explicit function. Our idea is to consider binary expansion statistics proposed in Zhang (2019) and to decompose the response variable into several resolutions which can capture the local information. By establishing a set of logistic regressions, we relate the resolution information to the predictors. The set of regressions can model the heterogeneity since various estimations can be obtained from different local logistic regression models. To achieve the localization, we decompose the response variable by marginal binary expansions, which provides a balanced design and orthogonal resolutions. Our method eventually estimates the distribution of the response variable by a histogram, which shows the possible heterogeneity and even more complicated distributions, without any assumption of subgroup patterns. We show that the method has a sure independence screening property (Fan and Lv, 2008; Fan and Song, 2010) and provides consistent estimates for cell probabilities of the histogram for growing dimensions.

The rest of this paper is organized as follows. In Section 2, we introduce resolution-wise regression, including the decomposition of the response variable and the establishment of the logistic regressions. Section 3 extends the proposed method to high-dimensional settings. In Section 4, we show the consistency of the estimated histogram. In Sections 5 and 6, we demonstrate the performance of our method by the simulated data and the real estate valuation dataset. Section 7 concludes this paper. Some technical proofs and additional simulation results are presented in the Appendix and Supplement.

2. Methodology

A distribution estimation provides more information than a point estimation for heterogeneous data, as the estimated distribution can identify the subgroups by the shape of the PDF. For the case that the subpopulations are not obviously distinguishable from each other, the estimated distribution can still reflect the dispersion of the data.

A direct idea is splitting the range of Y by a partition min(Y) = a₀ ≤ a₁ ⩽ ⋯ ⩽ a_B = max(Y) and modeling the probability of Y falling into each interval with X. This idea handles the heterogeneity by decomposing the information of Y into several non-overlapping intervals. These intervals capture the local information of Y and work together to show the whole histogram. However, a drawback of this approach is the possible loss of information from negligence of joint information in two intervals and insufficient samples in each interval. In this paper, we propose to construct overlapped resolutions based on binary expansions, where each resolution groups the distribution information of the union of several intervals, and includes all corresponding samples. In essence, the proposed construction leads to a balanced design and has a non-redundant orthogonality property. A histogram can be obtained by a transformation from resolution probabilities to cell probabilities. Here we refer to a cell as a bin of the histogram. In this section, we consider the one-dimensional X, and then extend to the high-dimensional case in Section 3. The rest of this section is organized as follows. In Section 2.1, we introduce the construction of the resolutions. In Section 2.2, a set of resolution-wise penalized logistic regressions are established. In Section 2.3, we introduce the binary interaction design (BID) equation to accomplish the transformation from the frequency domain to the probability domain.

2.1. Frequency domain from binary expansions

To overcome the imbalance of non-overlapping intervals, we use a balanced design based on binary expansions. A classical result on the binary expansion of a uniform random variable (Kac 1959) is given as follows:

Lemma 1. For [U ~ Uniform[0, 1], we have $U = \sum_{k = 1}^{\infty} \frac{V_{k}}{2^{k}}$ , where V₁, V₂, …, V_k, … i.i.d. follow Bernoulli(1 / 2).

Denote the cumulative distribution function (CDF) transformation of Y by U_Y. By Lemma 1, we have

U_{Y} = \sum_{k = 1}^{\infty} \frac{B_{k}}{2^{k}}, B_{1}, B_{2}, \dots, B_{k}, \dots \overset{i . i . d .}{~} Bernoulli (1 / 2) .

(3)

Through the expansion, the information of Y is decomposed into the information of Bk’s. In (3), binary variables Bk’s can be regarded as indicator functions:

B_{k} = I (U_{Y} \in [\frac{1}{2^{k}}, \frac{2}{2^{k}}) \cup [\frac{3}{2^{k}}, \frac{4}{2^{k}}) \cup \dots \cup [\frac{2^{k} - 1}{2^{k}}, 1]); k = 1, 2, \dots

Figure 2 shows the binary variables B₁, B₂, B₃ respectively with respect to U_Y. As a finite approximation of the infinite binary expansion, we can truncate the binary expansion of U_Y up to the d_Y-th order

U_{Y} = \sum_{k = 1}^{d_{Y}} \frac{B_{k}}{2^{k}} .

(4)

Fig. 2 — Binary variables B₁, B₂, B₃ from binary expansions of U_Y. Regions with B_k = 1, k = 1, 2, 3 are in white and regions with B_k = 0, k = 1, 2, 3 are in blue.

Now we introduce the notations of resolutions. Using the binary variables taking values {− 1, 1} instead of {0, 1} by the transformation ${\dot{B}}_{k} = 2 B_{k} - 1$ , the interactions of Bk’s can be written as products. For example, the event {B₁ = 1, B₂ = 1} ∪ {B₁ = 0, B₂ = 0} is equivalent to ${{\dot{B}}_{1} {\dot{B}}_{2} = 1}$ . In the remainder of this paper, we shall use ${\dot{B}}_{k} \in {- 1, 1}$ .

To approximate the information given by Y, say the σ-field σ(Y), we can use the σ-field generated by ${\dot{B}}_{k}$ ’s, denoted by $σ ({\dot{B}}_{1}, \dots, {\dot{B}}_{d_{Y}})$ . For the truncation up to the d_Y-th order, we can find a basis with $2^{d_{Y}} - 1$ variables

W_{\dot{B}} = {{\dot{B}}_{1}, \dots, {\dot{B}}_{d_{Y}}, {\dot{B}}_{1} {\dot{B}}_{2}, \dots, {\dot{B}}_{d_{Y} - 1} {\dot{B}}_{d_{Y}}, \dots, \prod_{i = 1}^{d_{Y}} {\dot{B}}_{i}} .

(5)

We shall refer to the binary variables in $W_{\dot{B}}$ as resolutions of Y, and the set of all possible values of these resolutions as the frequency domain. Figure 3 shows the variables in $σ ({\dot{B}}_{1}, {\dot{B}}_{2})$ with U_Y expanded up to the second order. Through this resolution decomposition, each variable takes value one on half of [0, 1], and value negative one on the other half.

Fig. 3 — Basis binary variables in $σ ({\dot{B}}_{1}, {\dot{B}}_{2})$ , where U_Y is expanded up to the second order.

2.2. Logistic regression in the frequency domain

With the resolutions decomposed from the binary expansion, we aim to model the relationship between each resolution and the predictors. The resolutions constructed by the binary expansion are independent with each other, thus they can be modeled marginally. Since the resolutions are binary, it is essentially a classification problem. Note that for every resolution, Y is divided into two classes with groups of intervals according to the sign of the binary interaction. Hence the decision boundary can be nonlinear. Therefore, we propose to use binary expansions of predictors as a nonparametric basis to fit a logistic regression on each resolution. Similar to the construction of U_Y, we have the binary expansion of U_X, up to the d_X-th order to be $U_{X} = \sum_{k = 1}^{d_{X}} \frac{A_{k}}{2^{k}}$ . Denote ${\dot{A}}_{k} = 2 A_{k} - 1$ . The σ-field generated by ${\dot{A}}_{k}$ ’s, denoted by $σ ({\dot{A}}_{1}, \dots, {\dot{A}}_{d_{X}})$ , has a basis with $2^{d_{x}} - 1$ variables $W_{\dot{A}} = {{\dot{A}}_{1}, \dots, {\dot{A}}_{d_{X}}, {\dot{A}}_{1} {\dot{A}}_{2}, \dots, {\dot{A}}_{d_{X} - 1} {\dot{A}}_{d_{x}}, \dots, \prod_{i = 1}^{d_{x}} {\dot{A}}_{i}}$ We shall refer to elements in $W_{\dot{A}}$ as patterns of X. After the construction of the patterns, the complicated effect of X on Y can be captured by logistic regression, which enjoys efficiency from the orthogonality of the patterns. We establish a set of $2^{d_{Y}} - 1$ penalized logistic regressions with the ℓ₁ penalty (Tibshirani, 1996) on each resolution in $W_{\dot{B}}$ , with all patterns in $W_{\dot{A}}$ as predictors. Denote the pattern vector corresponding to $W_{\dot{A}}$ by ${\dot{A}}^{F} = {({\dot{A}}_{(1)}^{F}, \dots, {\dot{A}}_{(2^{d_{X}} - 1)}^{F})}^{T} ≜ {({\dot{A}}_{1}, \dots, {\dot{A}}_{d_{X}}, {\dot{A}}_{1} {\dot{A}}_{2}, \dots, {\dot{A}}_{d_{X} - 1} {\dot{A}}_{d_{X}}, \dots, \prod_{i = 1}^{d_{X}} {\dot{A}}_{i})}^{T}$ . Similarly, denote the resolution vector corresponding to $W_{\dot{B}}$ by $\dot{B} = {({\dot{B}}_{(1)}, \dots, {\dot{B}}_{(2^{d_{Y}} - 1)})}^{T} ≜ {({\dot{B}}_{1}, \dots, {\dot{B}}_{d_{Y}}, {\dot{B}}_{1} {\dot{B}}_{2}, \dots, {\dot{B}}_{d_{Y} - 1} {\dot{B}}_{d_{Y}}, \dots, \prod_{i = 1}^{d_{Y}} {\dot{B}}_{i})}^{T}$ . Let ${(x_{i}, y_{i})}_{i = 1}^{n}$ . be n independent observations of (X, Y). Denote ${\dot{A}}_{i}^{F} = {({\dot{A}}_{(1), i}^{F}, \dots, {\dot{A}}_{(2^{d_{X}}), i}^{F})}^{T}$ and ${\dot{B}}_{i} = {({\dot{B}}_{(1), i}, \dots, {\dot{B}}_{(2^{d_{Y}}), i})}^{T}$ as the i-th pattern vector and the i-th resolution vector obtained by binary expansions of the empirical CDF transformation of x_i and y_i respectively. The m-th logistic regression, $m = 1, \dots, 2^{d_{Y}} - 1$ , which models the effect of ${\dot{A}}^{F}$ on ${\dot{B}}_{(m)}$ , is established as

log \frac{P ({\dot{B}}_{(m), i} = 1 ∣ {\dot{A}}_{i}^{F})}{P ({\dot{B}}_{(m), i} = - 1 ∣ {\dot{A}}_{i}^{F})} = {\dot{A}}_{i}^{F^{T}} β_{m}, m = 1, \dots, 2^{d_{Y}} - 1,

(6)

where β_m is the coefficient vector. We employ a ℓ₁ regularization to give the estimator

{\hat{β}}_{m} = \underset{β}{arg m i n} \sum_{i = 1}^{n} log (1 + e^{- {\dot{B}}_{(m), i} {\dot{A}}_{i}^{F^{T}} β}) + λ ‖ β ‖_{1}, m = 1, \dots, 2^{d_{Y}} - 1.

(7)

The conditional expectation of ${\dot{B}}_{(m)}$ given ${\dot{A}}^{F}$ , denoted by $e_{m} ({\dot{A}}^{F})$ , can be estimated by ${\hat{e}}_{m} ({\dot{A}}^{F}) = \frac{exp ({\dot{A}}^{F^{T}} {\hat{β}}_{m})}{1 + exp ({\dot{A}}^{F^{T}} {\hat{β}}_{m})}$ .

2.3. Binary interaction design: from frequency domain back to probability domain

As a final step of estimating the distribution of Y, we aim to transform the conditional expectations of resolutions into the conditional cell probabilities of the corresponding histogram. First, we simplify the notation of the conditional expectation by using a d_Y-dimensional binary index. Namely, denote the conditional expectation $E ({\dot{B}}_{k_{1}} \dots {\dot{B}}_{k_{p}} ∣ {\dot{A}}^{F})$ , p ∈ {1, …, d_y}, {k₁, …, k_p} ⊂ {1, …, d_y}, by E_b, where $E (\cdot)$ stands for the expectation, $b = (b_{1}, \dots, b_{d_{Y}})$ , is a vector of length d_Y with value one at k₁, …, k_p and zero otherwise. Let E be the d_Y-dimensional conditional expectation vector whose entries are sorted in an ascending order according to the binary system. Hence in some sense, b identifies the resolutions. Note that we set $E_{0} = E (1) = 1$ as the first entry. The $(\sum_{i = 1}^{d_{Y}} b_{i} 2^{d_{Y} - i} + 1)$ -th entry is E_b. For example, the expectation with d_Y = 3 is

E = {(E_{000}, E_{001}, E_{010}, E_{011}, E_{100}, E_{101}, E_{110}, E_{111})}^{T}, = (E (1), E ({\dot{B}}_{3} ∣ {\dot{A}}^{F}), E ({\dot{B}}_{2} ∣ {\dot{A}}^{F}), E ({\dot{B}}_{2} {\dot{B}}_{3} ∣ {\dot{A}}^{F}), E ({\dot{B}}_{1} ∣ {\dot{A}}^{F}), {E ({\dot{B}}_{1} {\dot{B}}_{3} ∣ {\dot{A}}^{F}), E ({\dot{B}}_{1} {\dot{B}}_{2} ∣ {\dot{A}}^{F}), E ({\dot{B}}_{1} {\dot{B}}_{2} {\dot{B}}_{3} ∣ {\dot{A}}^{F}))}^{T} .

We denote the conditional probabilities of the $2^{d_{Y}}$ cells in terms of the index $b = (b_{1}, \dots, b_{d_{Y}})$ as above. Define the conditional cell probability p_b as the conditional probability of ${\dot{B}}_{k}$ ’s taking values clarified by b given ${\dot{A}}^{F}$ , i.e., $p_{b} = p_{(b_{1}, \dots, b_{d_{Y}})} = P ({\dot{B}}_{1} = 2 b_{1} - 1, \dots, {\dot{B}}_{d_{Y}} = 2 b_{d_{Y}} - 1 ∣ {\dot{A}}^{F})$ . As an example, for d_Y = 3, $p_{101} = P ({\dot{B}}_{1} = 1, {\dot{B}}_{2} = - 1, {\dot{B}}_{3} = 1 ∣ {\dot{A}}^{F})$ . Let p be the d_Y-dimensional conditional probability vector of the cells whose entries are sorted by a descending order according to the binary system, i.e., the $(2^{d_{Y}} - \sum_{i = 1}^{d_{Y}} b_{i} 2^{d_{Y} - i})$ -th entry is p_b. For example, the conditional probability vector of the cells with d_Y = 3 is

p = {(p_{111}, p_{110}, p_{101}, p_{100}, p_{011}, p_{010}, p_{001}, p_{000})}^{T} .

With the above notations, we establish the binary interaction design (BID) equation (Zhang, 2019) to transform the expectations of resolutions into cell probabilities. The equation is established by the Sylvester’s construction of Hadamard matrix $H = H_{2^{d_{Y}}}$ (Sylvester, 1867).

Lemma 2 (BID equation). Let E be the conditional expectation vector of the resolutions from the binary expansion, and p be the conditional probability vector of the cells. Then

E = H p,

(8)

where H is the Hadamard matrix (Sylvester, 1867).

From the BID equation, the conditional probabilities of Y given X falling into each cell can be obtained by estimating the conditional expectation of resolutions. Denoting the estimator of E by E. In a common sense, p can be estimated by p = H⁻¹ E, since the Hadamard matrix H is invertible and $H^{- 1} = \frac{1}{2^{d_{Y}}} H$ . However, this p may not be a probability measure. From the structure of the Hadamard matrix, the following lemma shows the summation of the cell probabilities is one.

Lemma 3. For any E, the estimation of p by p = H⁻¹ E has a sum of entries of one.

Proof of Lemma 3. Denote $p = {({\hat{p}}_{1}, \dots, {\hat{p}}_{2^{d_{Y}}})}^{T}$ . Since $E = (1, {\hat{e}}_{1} ({\dot{A}}^{F}), \dots, {\hat{e}}_{2^{d_{Y}} - 1} ({\dot{A}}^{F}))$ , the summation of p is

\sum_{i = 1}^{2^{d_{Y}}} {\hat{p}}_{i} = 1_{2^{d_{Y}}}^{T} p = \frac{1}{2^{d_{Y}}} 1_{2^{d_{Y}}}^{T} H^{- 1} E = \frac{1}{2^{d_{Y}}} (2^{d_{Y}}, 0, \dots, 0) E = 1 .

□

Note that we cannot guarantee ${\hat{p}}_{i}, i = 1, \dots, 2^{d_{Y}}$ to be positive. Instead of p = H⁻¹ E, we consider the following optimization problem to solve p:

min ‖ H p - E ‖_{1}, s . t . p_{i} ⩾ 0, i = 1, \dots, 2^{d_{Y}}, \sum_{i} p_{i} = 1 .

(9)

Since the cells can be viewed as the bins of the histogram of Y, we essentially estimate the distribution of Y as the resolutions are decomposed in an arbitrary delicate fashion. In practice, a finite d_Y is used, and we can smooth the histogram to approximate the distribution.

3. Multivariate extensions

Now we extend our framework to the multivariate case. For a q-dimensional X = (X₁, …, X_q)^T, we perform a binary expansion to every marginal CDF-transformation variable, denoted by $U_{X_{j}}, j = 1, \dots, q$ , up to the d_X-th order, and we have

U_{X_{j}} = \sum_{k = 1}^{d_{x}} \frac{A_{j k}}{2^{k}}, j = 1, \dots, q .

(10)

Denote ${\dot{A}}_{j k} = 2 A_{j k} - 1$ , j = 1, …, q, k = 1, … d_x. The σ-field

σ (U_{X_{1}}, \dots, U_{X_{q}}) = σ ({\dot{A}}_{11}, \dots, {\dot{A}}_{1 d_{X}}, \dots, {\dot{A}}_{q 1}, \dots, {\dot{A}}_{q d_{X}})

is generated by the binary filtration of all q covariates. We can find a basis of this σ-field with totally $2^{q d_{x}} - 1$ variables. This basis set includes all possible patterns with respect to X_j, j = 1, …, q. These patterns can be divided into two groups. One group has terms involving binary variables from only one dimension of X, which capture the marginal patterns. For example, both the terms ${\dot{A}}_{11}$ and ${\dot{A}}_{11} {\dot{A}}_{12}$ are marginal patterns with respect to X₁. The second group has terms with binary variables from at least two covariates, which reflect interactions of the corresponding covariates. For example, ${\dot{A}}_{11} {\dot{A}}_{21}$ corresponds to the interaction of (X₁, X₂), and ${\dot{A}}_{11} {\dot{A}}_{21} {\dot{A}}_{31}, {\dot{A}}_{11} {\dot{A}}_{21} {\dot{A}}_{31} {\dot{A}}_{32}$ are terms with respect to the three-way interaction (X₁, X₂, X₃). Therefore we refer to interaction terms as the patterns reflecting interaction of X, instead of product of some ${\dot{A}}_{j k}$ ’s. Note that the basis set considers until q-way interaction terms. However, three-way and higher-order interactions often contribute little to the model, and they are quite complex and difficult to interpret. Thus we only consider the main effects and two-way interaction terms in the basis set. Including $2^{d_{x}} - 1$ main effect terms for each of the q explanatory variables, and ${(2^{d_{x}} - 1)}^{2}$ interaction terms of each pair of the explanatory variables, there are $L = q (2^{d_{x}} - 1) + C_{q}^{2} {(2^{d_{x}} - 1)}^{2}$ patterns in total.

To cope with high dimensional data, some pre-screening procedures can be performed to reduce the dimension of the patterns. We do not rashly reduce the maximum number of variables in a pattern term, since each of them includes information of some specific pattern, due to the orthogonal property of the binary expansion. Instead, an approach to pairwisely test the effect of each pattern on the response variable is reasonable. To this end, we modify the binary expansion testing (BET) method (Zhang, 2019), which was originally developed to test independence of two variables, as a pre-screening method for patterns. We also extend BET to a general version, which tests the independence of multiple variables and is used to pre-screen interactions.

In the following, we first revisit the BET method, and extend it as a method of pattern pre-screening in Section 3.1. In Section 3.2, we generalize BET to pre-screen interactions by testing the independence of the response and the interaction patterns.

3.1. BET as a pre-screening approach

BET is a nonparametric method of testing dependence between two continuous variables in a distribution-free setting. Hence BET can be used on Y and X. With the binary expansion on U_Y and U_X, the interactions of the basis of the σ-field $σ ({\dot{B}}_{1}, \dots, {\dot{B}}_{d_{Y}})$ and $σ ({\dot{A}}_{1}, \dots, {\dot{A}}_{d_{X}})$ show all possible dependence patterns. Similar as the definition of b in Section 2.3, we use $a = {(a_{1}, \dots, a_{d_{X}})}^{T}$ to identify the patterns of X. We denote the interaction pattern of a and b by $a b ≔ {(a_{1}, \dots, a_{d_{X}}, b_{1}, \dots, b_{d_{Y}})}^{T}$ . The interaction pattern ab can partition the unit square [0, 1]² with half positive regions and half negative regions. The difference of the counts in the two regions reflects whether Y and X are independent in terms of the particular interaction pattern. When U_Y and U_X are independent, the counts of the observations in the positive and negative regions should be similar. When they are not independent, there will be significant difference of counts. Denote S_ab as the difference of the counts with respect to the interaction pattern ab. Zhang (2019) gave the result of the distribution of S_ab in the following lemma.

Lemma 4.

When marginal distributions are known, U_Y and U_X are independent if and only if
$\frac{S_{a b} + n}{2} ~ Binomial (n, \frac{1}{2}), a \neq 0, b \neq 0 .$
When marginal distributions are unknown, U_Y and U_X are estimated by the empirical CDF transformations ${\hat{U}}_{Y}$ and ${\hat{U}}_{X}$ respectively, then ${\hat{U}}_{Y}$ and ${\hat{U}}_{X}$ are independent if and only if
$\frac{{\hat{S}}_{a b} + n}{4} ~ Hypergeometric (n, \frac{n}{2}, \frac{n}{2}), a \neq 0, b \neq 0,$
where ${\hat{S}}_{a b}$ denotes the difference of the counts with respect to the interaction pattern ab according to ${\hat{U}}_{Y}$ and ${\hat{U}}_{X}$ .

In this way, BET decomposes the information of the relationship between Y and X into interaction patterns. Figure 4 shows all the 9 interaction patterns with depth d_X = 2 and d_Y = 2. An obvious dependence pattern is ${\dot{A}}_{1} {\dot{A}}_{2} {\dot{B}}_{1}$ , which includes most points in the white region.

Fig. 4 — The dependence of Y on X is from the model Y = X² + ε, where X ~ *Uniform* (−2, 2) and ε ~ N (0, 0.25). BET detects the dependence through the nine patterns. The pattern ${\dot{A}}_{1} {\dot{A}}_{2} {\dot{B}}_{1}$ shows the most obvious difference of counts of blue and white regions.

With d_X and d_Y large enough, BET can detect arbitrarily complicated dependence. BET also helps indicate the pattern of dependence, since the significant patterns from BET imply how Y depends on X. This inspires us to focus on the detection of the significant patterns and regard BET as a pattern-screening approach. Performing BET pairwisely on {(Y, X_j⁾, j = … 1, q}, one can reduce all the patterns on X_j to only those dependent ones. We regard the patterns of X that are detected to be dependent with at least one resolution of Y as relevant variables in the penalized logistic regressions. Namely, denoting the pattern vector of X_j by ${\dot{A}}^{j} = {({\dot{A}}_{(1)}^{j}, \dots, {\dot{A}}_{(2^{d_{X}} - 1)}^{j})}^{T} ≜ {({\dot{A}}_{1}^{j}, \dots, {\dot{A}}_{d_{X}}^{j}, {\dot{A}}_{1}^{j} {\dot{A}}_{2}^{j}, \dots, {\dot{A}}_{d_{X} - 1}^{j} {\dot{A}}_{d_{X}}^{j}, \dots, \prod_{i = 1}^{d_{X}} {\dot{A}}_{i}^{j})}^{T}$ , we obtain the sets of relevant patterns $R_{j}^{main} ≔ {{\dot{A}}_{(l)}^{j} : \exists {\dot{B}}_{(m)} s . t . ({\dot{B}}_{(m)}, {\dot{A}}_{(l)}^{j}) i s dependent, l = 1, \dots, 2^{d_{x}} - 1}$ , j = 1, …, q. Note that we consider the patterns in $\cup_{j = 1}^{q} R_{j}^{main}$ as predictors in regressions for all resolutions of Y, rather than only in the regression for the particular ${\dot{B}}_{(m)}$ such that $({\dot{B}}_{(m)}, {\dot{A}}_{(l)}^{j})$ is dependent. This can help to avoid false negatives. False positives can be controlled by the lasso shrinkage.

3.2. A generalized BET as an interaction pre-screening method

The pairwise BET procedure selects dependent marginal patterns. Furthermore, aiming to select interaction patterns, we first generalize the original BET to test independence of Y and the joint distribution of X_i and X_j, i, j = 1, …, q. With marginal binary expansions on $U_{Y}, U_{X_{i}}, U_{X_{j}}$ respectively, the σ-field generated by U_Y, $U_{X_{i}}$ and $U_{X_{j}}$ are $σ (U_{Y}) = σ ({\dot{B}}_{1}, \dots, {\dot{B}}_{d_{Y}})$ and $σ (U_{X_{i}}, U_{X_{j}}) = σ ({\dot{A}}_{i 1}, \dots, {\dot{A}}_{i d_{X}}, {\dot{A}}_{j 1}, \dots, {\dot{A}}_{j d_{X}})$ . We aim to test all possible dependence patterns from each pair of the two σ-fields. Denote the pattern of X_j, j = 1, …, q, by $a_{j} = {(a_{j 1}, \dots, a_{j d_{X}})}^{T}$ , and the three-way interaction pattern of a_i, a_j and b by $a_{i} a_{j} b = {(a_{i 1}, \dots, a_{i d_{x}}, a_{j 1}, \dots, a_{j d_{x}}, b_{1}, \dots, b_{d_{Y}})}^{T}$ . Similar to the idea of the original BET, a_i, a_j b can be viewed as a partition of the cube [0, 1]³ with half positive and half negative regions. One can test the dependence of Y and the joint X_i, X_j by the different counts of the two regions $S_{a_{i} a_{j} b}$ . Figure 5 shows three aspects of a significant interaction pattern. Hence we use the generalized BET to pre-screen the interaction predictors in the regressions. We perform the generalized BET pairwisely on {(Y, X_i, X_j⁾, j = 1, …, q, i = 1, …, j − 1} and obtain the sets of significant interactions

R_{i j}^{interaction} = {({\dot{A}}_{(l_{1})}^{i}, {\dot{A}}_{(l_{2})}^{j}) : \exists {\dot{B}}_{(q)} s . t . ({\dot{B}}_{(m)}, {\dot{A}}_{(l_{1})}^{i}, {\dot{A}}_{(l_{2})}^{j}) i s dependent, l_{1}, l_{2} = 1, \dots, 2^{d_{X}} - 1}, j = 1, \dots, q, i = 1, \dots, j .

Fig. 5 — The dependence of Y and X₁, X₂ is from the model Y = X₁X₂ + ε, where $X_{1}, X_{2} \overset{i . i . d .}{~} Uniform (- 2, 2)$ and ε ~ N (0, 0.25). The generalized BET detects the pattern ${\dot{A}}_{11} {\dot{A}}_{21} {\dot{B}}_{1}$ with the most obvious difference in counts of the positive region (white region with red points) and the negative region (blue region with yellow points).

With the two pre-screening procedures, eventually, the predictor set is

R = (\cup_{i = 1}^{q} R_{i}^{main}) \cup (\cup_{i < j} R_{i j}^{interaction}) .

(11)

We refer to the pre-screening based on BET and the generalized BET as the BET screening. Algorithm 1 gives the procedure of resolution-wise regression, including the pre-screening and the framework of the estimation.

4. Theoretical Studies

In this section, we first show that the BET screening is a sure independence screening approach (Fan and Song, 2010), which reduces the number of patterns L from exponential growth to O(n). The consistency result of the estimated cell probabilities is also established with the random design and a fixed d_Y. We allow the dimension q and d_X to grow with n.

For the m-th logistic regression, assume that the binary data ${Z_{m, i}}_{i = 1}^{n} = {({\dot{A}}_{i}, {\dot{B}}_{(m), i})}_{i = 1}^{n}$ from the marginal empirical CDF transformation observations ${(x_{i}, y_{i})}_{i = 1}^{n}$ are i.i.d. copies of $(\dot{A}, {\dot{B}}_{(m)})$ , where ${\dot{A}}_{i} = {({\dot{A}}_{(1), i}, \dots, {\dot{A}}_{(L), i})}^{T}$ is the i-th sample of the L-dimensional binary random vector $\dot{A} = {({\dot{A}}_{(1)}, \dots, {\dot{A}}_{(L)})}^{T}$ , and ${\dot{B}}_{(m), i}$ is the i-th sample of the binary response ${\dot{B}}_{(m)}$ . Denote $A_{i} ≔ {({\tilde{A}}_{(1), i}, \dots, {\tilde{A}}_{(L), i})}^{T}$ , i = 1, …, n, as the samples of the covariates $A ≔ ({\tilde{A}}_{(1)}, \dots, {\tilde{A}}_{(L)})$ that are standardized to have mean zero and standard deviation one for each covariate. We have $A_{i} = {\dot{A}}_{i} / \sqrt{n}$ . Denote ${\tilde{B}}_{(m), i}$ as the i-th sample of ${\tilde{B}}_{(m)}$ taking values from {0, 1}. We have ${\tilde{B}}_{(m)} = ({\dot{B}}_{(m)} + 1) / 2$ . The maximum marginal likelihood estimator (MMLE) ${\tilde{β}}_{m, j}$ for the logistic regression (6), which is a special case of the models in Fan and Song (2010), is defined as the minimizer of the negative log-likelihood of the component-wise regression,

{\tilde{β}}_{m, j} = \underset{β_{m, j}}{arg min} \frac{1}{n} \sum_{i = 1}^{n} - {\tilde{B}}_{(m), i} {\tilde{A}}_{(j), i} β_{m, j} + log (1 + e^{{\tilde{A}}_{(j), i} β_{m, j}}), j = 1, \dots, L .

(12)

We correspondingly define the population version of the MMLE by

β_{m, j}^{M} = \underset{β_{m, j}}{arg min} \frac{1}{n} \sum_{i = 1}^{n} E [- {\tilde{B}}_{(m), i} {\tilde{A}}_{(j), i} β_{m, j} + log (1 + e^{{\tilde{A}}_{(j), i} β_{m, j}})], j = 1, \dots, L .

Denote the true regression coefficient vector by $β_{m}^{0} ≔ {(β_{m, 1}^{0}, \dots, β_{m, L}^{0})}^{T}$ . Let $ℳ_{m}^{*} ≔ {1 ⩽ j ⩽ L : β_{m, j}^{0} \neq 0}$ be the true index set of non-zero coefficients. We remark here that our overall goal of the analysis is prediction of the response rather than inference of slopes. Therefore, although when d_Y and d_X are large, and the overall paramerization might become unidentifiable, it will not harm the prediction results, as studied in Greenshtein and Ritov (2004).

We now provide the theoretical justifications of our method.

Assumption 1. $| c o v (\frac{e^{A^{T} β_{m}^{0}}}{1 + e^{A^{T} β_{m}^{0}}}, {\tilde{A}}_{(j)}) | ⩾ c_{1, m} n^{- κ_{m}}$ for $j \in ℳ_{m}^{*}$ with constants c_1,m > 0 and 0 < κ_m < 1 / 2.

Assumption 1 is analogous to Condition E of Fan and Song (2010). It ensures that the marginal signals are stronger than the stochastic noise. Within the selected set R, denote a_(j) b_(m) as the pattern corresponding to ${\dot{A}}_{(j)}$ and ${\dot{B}}_{(m)}$ , and let the index set of selected variables using BET screening be $ℳ_{m, δ_{n, m}} ≔ {1 ⩽ j ⩽ L : S_{a_{(j)} b_{(m)}} ⩾ δ_{n, m}}$ , for some threshold δ_n,m. The following theorem shows that BET screening possesses the sure independence screening property.

Theorem 1. For any c_2,m > 0, there exists a positive constant c_3,m such that

P (max_{1 ⩽ j ⩽ L} | {\tilde{β}}_{m, j} - β_{m, j}^{M} | ⩾ c_{2, m} n^{- κ_{m}}) ⩽ L {exp (- c_{3, m} n^{1 - 2 κ_{m}} / {(k_{n, m} K_{n, m})}^{2}) + n h_{1, m} exp (- h_{0, m} K_{n, m}^{α_{m}})},

where k_n,m, K_n,m, h_0,m, h_1,m, α_m are some positive constants. If, in addition, Assumption 1 holds, the BET screening possesses a sure independence screening property. By taking $δ_{n, m} = O (n^{\frac{1}{2} - κ_{m}})$ , we have

P (ℳ_{m}^{*} \subset ℳ_{m, δ_{n, m}}) ⩾ 1 - s_{m} {exp (- c_{3, m} n^{1 - 2 κ_{m}} / {(k_{n, m} K_{n, m})}^{2}) + n h_{1, m} exp (- h_{0, m} K_{n, m}^{α_{m}})},

where $s_{m} ≔ | ℳ_{m}^{*} |$ , the number of nonsparse elements.

Assumption 2. The variance $v a r (A^{T} β_{m}^{0})$ is bounded from above and below.

Assumption 2 is analogous to Condition F of Fan and Song (2010). The following theorem shows that the BET screening can reduce the dimension from $O (q^{2} 4^{d_{x}})$ to $O (n^{2 κ_{m}})$ .

Theorem 2. Under Assumption 2, we have for any $δ_{n, m} = O (n^{\frac{1}{2} - κ_{m}})$ , and the same constants c_3,m, k_n,m, K_n,m, h_0,m, h_1,m, α_m as in Theorem 1 such that

P (| ℳ_{m, δ_{n, m}} | ⩽ O (n^{2 κ_{m}})) ⩾ 1 - L {exp (- c_{3, m} n^{1 - 2 k_{m}} / {(k_{n, m} K_{n, m})}^{2}) + n h_{1, m} exp (- h_{0, m} K_{n, m}^{α_{m}})} .

Here, we briefly describe the results, whose details are given in the Appendix. Let $r ≔ {max}_{1 ⩽ m ⩽ 2^{d_{Y}} - 1} | ℳ_{m, δ_{n, m}} |$ and ${\dot{A}}^{s} ≔ {({\dot{A}}_{(1)}^{s}, \dots, {\dot{A}}_{(r)}^{s})}^{T}$ is the predictor vector including the selected r patterns. Denote the true coefficient vector of β_m by $β_{m}^{0}$ . The estimate of β_m is

{\hat{β}}_{m} = \underset{β}{arg min} \sum_{i = 1}^{n} log (1 + e^{- {\dot{B}}_{(m), i} f_{m} ({\dot{A}}_{i}^{s})}) + λ_{m} ‖ β ‖_{1}}, m = 1, \dots, 2^{d_{Y}} - 1,

where ||·||₁ is the ℓ₁-norm, λ_m is a tuning parameter ${\dot{A}}_{i}^{S}$ , is the binary expansion corresponding to ${\dot{A}}^{S}$ for the ith observation, and f_m is the mth logistic regression function $m = 1, \dots, 2^{d_{Y}} - 1$ . Let $f_{m}^{0}$ be the true function between ${\dot{B}}_{(m)}$ and ${\dot{A}}^{S}$ . Denote the index set of non-zero coefficients by $S_{m}^{0} ≔ {j : β_{m, j}^{0} \neq 0}$ , and the cardinality of $S_{m}^{0}$ by $s_{m} ≔ | S_{m}^{0} |$ .

According to the BID equation, we estimate p by solving the optimization (9). From the optimization H_m+1 p, is an approximation of ${\hat{e}}_{m}$ , where H_{m +1} is the (m + 1) -th row of H, since ${\hat{e}}_{m}$ is the (m + 1) -th entry of E. Hence g(H_m+1 p) is the estimated m-th regression function corresponding to p. The following theorem gives the consistency of cell probability vector p in terms of excess risk of g(H_m+1 p).

Theorem 3. Assume Assumptions (1) and (2), and (3) to (5) given in the Appendix hold, where Assumption (3) in the Appendix holds with the set $S_{m}^{0}$ . For the logistic regression with covariates corresponding to the BET screening set $ℳ_{m, δ_{n, m}}$ , suppose that λ_m satisfies $λ_{m} ⩾ 8 λ_{m}^{0}$ . Then on the set 𝒯_m, we have,

ℰ (g (H_{m + 1} p)) + λ_{m} {‖ {\hat{β}}_{m} - β_{m}^{0} ‖}_{1} ⩽ 6^{ℰ} (f_{m}^{0}) + \frac{16 λ_{m}^{2} s_{m}}{c_{m} ϕ_{m}^{2}} + \frac{32 λ K s 2^{d_{Y}}}{c ϕ^{2}},

where K > 0 is a constant, $λ = {max}_{1 ⩽ m ⩽ 2^{d_{Y}} - 1} λ_{m}$ , $s = {max}_{1 ⩽ m ⩽ 2^{d_{Y}} - 1} s_{m}$ , $c = {min}_{1 ⩽ m ⩽ 2^{d_{Y}} - 1} c_{m}$ , $ϕ_{m}^{2}$ is a compatibility constant, and $ϕ = {min}_{1 ⩽ m ⩽ 2^{d_{Y}} - 1} ϕ_{m}$ .

5. Simulation Studies

In this section, we perform simulations to show the performance of resolution-wise regression approach. We compare our method with the following four methods:

Naive method, which first finds a small neighborhood of each test sample in the training set, where ∥X − X_new∥₂ is bounded by a constant, and predicts the distribution of Y | X_new by the kernel density estimation of the responses in this neighborhood.
SSANOVA, which fits a cubic spline with all main effects and interaction effects. Its prediction distribution is $Y ∣ X_{n e w} ~ N ({\hat{Y}}_{ssanova} ∣ X_{n e w}, {\hat{σ}}^{2} + v a r ({\hat{Y}}_{ssanova} ∣ X_{n e w}))$ , where ${\hat{Y}}_{sanova} ∣ X_{n e w}$ is the SSANOVA estimation of Y given a new X_new, and ${\hat{σ}}^{2}$ is the estimated variance of the random error.
Random Forest, which fits a multitude of regression trees and then averages the predictions. Its prediction distribution is $Y ∣ X_{n e w} ~ N ({\hat{Y}}_{r f} ∣ X_{n e w}, σ_{s}^{2})$ , where Y_rf | X_new is the estimation of Y from random forest given a new X_new, and $σ_{s}^{2}$ is the standard error.
Regression mixture model (only for Example 2), which identifies the subgroups of dataset and fits multiple linear regression models. Its prediction distribution is $Y ∣ X_{n e w} ~ N ({\hat{Y}}_{mixreg} ∣ X_{n e w}, σ_{s}^{2})$ , where ${\hat{Y}}_{mixreg} ∣ X_{n e w}$ is the estimation of Y from the regression mixture model given a new X_new which is randomly assigned into subgroups with the weights derived from training data, and ${\hat{σ}}_{s}^{2}$ is the standard error.

We study the following four examples with 1024 samples for both training and testing sets.

Example 1. (Crossing lines) The predictor $x_{i} \overset{i . i . d .}{~} U (- 10, 10)$ , i = 1, …, n. For the example with one cross on the plane, the response y_i is generated by y_i = x_iI(g_i = 0) − x_iI(g_i = 1) + ε_i, where the error $ε_{i} \overset{i . i . d .}{~} N (0, 0.5)$ , i = 1, …, n, and I(·) is the indicator function with $g_{i} \overset{i . i . d .}{~} Bernoulli (1 / 2)$ , i = 1, …, n. For the example with multiple crosses, the response y_i = (x_iI(g_1i = 1) − x_iI(g_1i = 2) + (x_i − 10)I(g_1i = 3) + (−x_i + 10)I(g_1i = 4)) I(x_i ≥ 0) + (x_iI(g_2i = 1) − x_iI(g_2i = 2) + (−x_i − 10)I(g_2i = 3) + (x_i = 10)I(g_2i = 4))I(x_i < 0) + ε_i, where the error $ε_{i} \overset{i . i . d .}{~} N (0, 0.5)$ , i = 1, …, n, and I(·) is the indicator function with $g_{k i} \overset{i . i . d .}{~} Multi - Bern ({1, 2, 3, 4}, (1 / 4, 1 / 4, 1 / 4, 1 / 4))$ , i = 1, …, n, k = 1, 2.

Example 2. (A mixture of linear and quadratic effects) The predictor vector (x_i1, …, x_iq)^T is generated by $x_{i j} \overset{i . i . d .}{~} U (- 2, 2)$ , i = 1, …, n, j = 1, …, q. with q = 1, 5, 10. The response y_i is generated by $y_{i} = x_{i 1} I (g_{i} = 0) + x_{i 1}^{2} I (g_{i} = 1) + ε_{i}$ , which depends on only the first variable x_i1 and other variables are regarded as noise. The error $ε_{i} \overset{i . i . d .}{~} N (0, 0.05)$ , i = 1, …, n, and I(·) is the indicator function with $g_{i} \overset{i . i . d .}{~} Bernoulli (1 / 2)$ , i = 1, …, n.

Example 3. (Circular and spherical implicit functional relationship) The predictor vector (x_i1, …, x_iq^)T has q = 5. For the circle example, the predictors and the responses are generated from the polar coordinates x_i1 = sin(θ_i), yi = cos(θ_i) + ε_i, where the latent variable $θ_{i} \overset{i . i . d .}{~} U (0, 2 π)$ , i = 1, …, n, and the error $ε_{i} \overset{i . i . d .}{~} N (0, 0.05)$ , i = 1, …, n. The noise variables (x_i2,…, x_iq)^T are generated by $x_{i j} \overset{i . i . d .}{~} U (- 1, 1)$ i = 1, …, n, j = 2, …, q. For the sphere example, where the latent variables $θ_{i} \overset{i . i . d .}{~} U (0, π)$ , the predictors and the responses are generated from x_i1 = sin(θ_i)cos(ϕ_i), x_i2 = sin(θ_i)sin(ϕ_i), y_i = cos(θ_i) + ε_i, where $ϕ_{i} \overset{i . i . d .}{~} U (0, 2 π)$ , i = 1, …, n, and the error $ε_{i} \overset{i . i . d .}{~} N (0, 0.05)$ , i = 1, …, n. The noise variables ${(x_{i 3}, \dots, x_{i q})}^{T}$ are generated by $x_{i j} \overset{i . i . d .}{~} U (- 1, 1)$ , i = 1, …, n, j = 3, …, q.

Example 4. (Heterogeneous mean vesus heteroscedastic error) The predictor $x_{i} \overset{i . i . d .}{~} U (- 2, 2)$ , i = 1, …, n. The response y_i is generated by $y_{i} = (x_{i} + 2 + x_{i} ε_{i}) I (g_{i} = 0) + (x_{i 1}^{2} + ε_{i}) I (g_{i} = 1)$ , where the error $ε_{i} \overset{i . i . d .}{~} N (0, σ^{2})$ , i = 1, …, n, σ² = 0.05, 0.25, 0.5.

In each example, BET with depth 5 and a threshold of $\sqrt{\frac{2 log p}{n}}$ for symmetry statistics, where p is the total number of interactions and n is the sample size, is performed for main effects screening, while generalized BET with depth 4 and the same threshold is performed for interaction effects screening. For the selection of depth, with a small depth 3, BET reaches a high power (Zhang, Zhao and Zhou, 2021). We perform simulation with different depths and the results are reported in the Appendix. We pick a depth 5 and 4, which is high enough. Two types of smoothing approaches are considered: fixed smoothing parameter (“Fixed smoothness”), and tuning the smoothing parameter by cross-validation (“CV”).

We repeat the simulation 100 times for each example. To measure the test error, we calculate the differences of prediction distributions from different methods and the underlying true distributions, the following distance measures are used: (1) Kolmogorov-Smirnov statistic $D_{K S} (P, Q) = sup_{x} | P (x) - Q (x) |$ , (2) Kullback-Leibler divergence $D_{K L} = \int_{- \infty}^{\infty} p (x) log (\frac{p (x)}{q (x)}) d x$ , (3) L₁ distance $D_{L_{1}} = \int_{- \infty}^{\infty} | p (x) - q (x) | d x$ , where P and Q are two distributions with corresponding PDFs p (·) and q (·) respectively.

Here we display the results of Example 1–4, which are heterogeneous data, and the result from a case of a nonlinear functional relationship is in supplementary materials. Tables 1 – 4 list the results of testing errors for the three simulation examples. Figures 6–9 show the heatmaps of the prediction distributions of all test data, where the x axis is the involved variable of the predictors. We discard the heatmap of the spherical case, since it has two involved variables and cannot be shown explicitly in a heatmap.

Table 1.

Comparison of average test errors (and corresponding standard errors in parentheses) for Example 1 with respect to one or multiple crosses and three distance measures. The results for naive method, mixture of regression, SSANOVA, resolution-wise regression with fixed smoothness, and resolution-wise regression with CV are listed in the columns from left to right respectively.

Example	Measure	Naive	Mixreg	SSANOVA	Random Forest	Fixed smoothness	CV
one cross	KS	0.151	0.351	0.294	0.377	0.165	0.129
		(0.015)	(0.010)	(0.016)	(0.017)	(0.009)	(0.005)
	KL	0.362	1.396	1.031	1.835	0.386	0.265
		(0.032)	(0.067)	(0.062)	(0.156)	(0.025)	(0.014)
	L ₁	0.639	1.086	1.181	1.120	0.717	0.364
		(0.034)	(0.063)	(0.073)	(0.069)	(0.035)	(0.019)
multiple	KS	0.188	0.405	0.229	0.394	0.165	0.145
		(0.007)	(0.026)	(0.016)	(0.020)	(0.010)	(0.008)
	KL	0.319	2.511	0.548	1.615	0.338	0.257
		(0.016)	(0.128)	(0.036)	(0.084)	(0.024)	(0.016)
	L ₁	0.653	1.212	0.850	1.054	0.658	0.397
		(0.033)	(0.052)	(0.042)	(0.057)	(0.034)	(0.019)

Open in a new tab

Table 4.

Comparison of average test errors (and corresponding standard errors in parentheses) for Example 4 with respect to different variances of random errors and three distance measure. The results for naive method, SSANOVA, Random Forest, resolution-wise regression with fixed smoothness, and resolution-wise regression with CV are listed in the columns from left to right respectively.

Example	Measure	Naive	SSANOVA	Random Forest	Fixed smoothness	CV
σ² = 0.05	KS	0.192	0.189	0.243	0.153	0.154
		(0.008)	(0.009)	(0.016)	(0.005)	(0.005)
	KL	0.435	0.417	0.512	0.229	0.743
		(0.022)	(0.026)	(0.030)	(0.012)	(0.034)
	L ₁	0.656	0.622	0.652	0.424	0.569
		(0.034)	(0.035)	(0.033)	(0.022)	(0.024)
σ² = 0.25	KS	0.156	0.131	0.239	0.123	0.123
		(0.006)	(0.006)	(0.014)	(0.009)	(0.010)
	KL	0.258	0.245	0.430	0.166	0.330
		(0.016)	(0.014)	(0.026)	(0.009)	(0.020)
	L ₁	0.491	0.452	0.571	0.346	0.400
		(0.025)	(0.030)	(0.033)	(0.017)	(0.023)
σ² = 0.5	KS	0.139	0.121	0.236	0.098	0.110
		(0.008)	(0.007)	(0.018)	(0.007)	(0.009)
	KL	0.201	0.208	0.421	0.142	0.167
		(0.013)	(0.017)	(0.035)	(0.009)	(0.012)
	L ₁	0.422	0.408	0.554	0.296	0.308
		(0.035)	(0.024)	(0.038)	(0.019)	(0.020)

Open in a new tab

Fig. 6 — Heatmaps of prediction distributions for Example 1 with respect to one or multiple crosses and six methods: naive method, regression mixture, SSANOVA, Random Forest, resolution-wise regression with fixed smoothness, and resolution-wise regression with CV, from left to right respectively. A darker color indicates a larger PDF value at the corresponding predicted response.

Fig. 9 — Heatmaps of prediction distributions for Example 4 with respect to different variances of random errors and five methods: naive method, SSANOVA, random forest, resolution-wise regression with fixed smoothness, and resolution-wise regression with CV, from left to right respectively. A darker color indicates a larger PDF value at the corresponding predicted response.

The results indicate the best performance of resolution-wise regression. For Example 1, resolution-wise regression especially with cross-validation smoothness can identify the subgroups, while the regression mixture model does not perform well as the number of subgroups increases. The naive method has a good performance since the dependence is linear. For Example 2, resolution-wise regression can predict the probabilities around the two subgroups, thus has the best performance. SSANOVA does not perform well since it cannot recognize the subgroups. The naive method performs well only in the low-dimensional case, because in the high-dimensional case, it is difficult to find a small neighborhood with substantial train data. Random forest does not perform well because it averages the predictions from multiple regression trees and mixes the two subgroups up. Resolution-wise regression performs well in the high-dimensional case, and the distance to the true distribution only increases slightly with the effect of noise variables. For Example 3, resolution-wise regression performs the best, while the naive method has poor performance due to the dimension issue; SSANOVA fails to capture the relationship, since it cannot be expressed in an explicit regression function form; Random forest gives an averaged prediction and fails to recognize the multiple patterns at one position. For Example 4, only the resolution-wise regression successfully distinguishes the two subgroups of the heterogeneous mean with homogeneous error model and the homogeneous mean with a heteroscedastic error model.

Based on the simulation results, we can see that resolution-wise regression has a better distribution prediction when the variance σ² of the error is larger, which seems paradoxical with the point prediction of some common regression methods. In fact, for point prediction, the loss comes from the random error term. A large variance leads to a large loss. However, for distribution prediction, the loss comes from the accumulated probability of possible response values.

The variance affects the shape of the distribution of the response. For a smaller variance, the data are concentrated, and more precise resolutions are needed, which bring the difficulties for the estimation of the corresponding logistic regressions. Hence for the same expansion order d_Y, our method gives less accurate results on smaller variances. This also shows some insights for a general prediction problem that distribution prediction is a good alternative to capture the whole picture if the random error is relatively large.

6. Real Data Analysis

We analyze the real estate valuation dataset (Yeh and Hsu, 2018), obtained from UCI machine learning Repository (https://archive.ics.uci.edu). The dataset contains the unit-area price and the corresponding six explanatory variables of 414 houses collected from Sindian District, New Taipei City. The six explanatory variables include the transaction date, the house age, the distance to the nearest MRT station, the number of convenience stores in the living circle on foot, the latitude, and the longitude. We are interested in how the house price can be explained by these variables. We consider three methods: (1) Naive method with the small neighborhood of the nearest ten samples, where the distance is measured by the Mahalanobis distance (Rosenbaum, 1995) of the rank vector within each predictor; (2) SSANOVA with cubic splines on each main effects and interactions; (3) Resolution-wise regression with d_X = 5 and d_Y = 5.

There are three interesting results we find from the application of resolution-wise regression.

House prices rely on some features through a nonlinear relationship. Such a nonlinear pattern can be detected by the BET screening, as shown in Figures 10 and 11. The latitude and the longitude show skewed quadratic effects which are captured by the relevant patterns in depth 1 and depth 2 respectively. For the screening of interaction patterns, all the $C_{6}^{2}$ interactions are significant. As an example shown in Figure 11, the interaction of the latitude and the longitude shows a concentration of high-price houses, which can be pinpointed around the downtown area. There is also some linear pattern found in the data. As shown in Figure 10, the distance to the nearest MRT station and the number of convenience stores in the living circle on foot has a linear effect on the house price which is captured by the relevant pattern in depth 1.

Fig. 10 — Relevant variables and the corresponding most significant patterns. For the distance to the nearest MRT station and the latitude, there exists a linear relationship to the housing prices. For the longitude, the most asymmetric interaction is A₁A₂B₁, which implies a nonlinear dependence.

Fig. 11 — Interaction of latitude and longitude. The high-price houses concentrate around the downtown area.

Potential heterogeneity can be detected by resolution-wise regression. The proposed method clearly predicts the house price concentrating on two groups around 30 and 60, which is new information that is not provided by existing methods. For examples, the SSANOVA completely misses such heterogeneity. The naive method provides a distribution which vaguely suggests probability mass towards the right tail. However, the subgroup information identified by the naive method is not very clear.

The detected heterogeneity can also be demonstrated through the additional information from the map of the city. As shown in Figure 13, the nearest ten samples measured by the Mahalanobis distance form two groups classified by the river and the highway. The two groups differ in the house price, and both contribute to the distribution estimation. The effect of the river and the highway play the role of unobserved variable, which leads to the bimodal shape of the prediction distribution. The prediction from our method suggests that the price of the particular house is more likely to be close to the three houses on the lower-left side of the river, which has an average price of 25.75. This is verified as correct prediction through the actual map. In particular, the location of the house is indeed on the lower-left side of the river. This example thus illustrates the advantage of the proposed method. Specifically, it can detect heterogeneity in the data and provide accurate probability statements about subgroup information.

Fig. 13 — The testing sample (red pin) and the nearest 10 samples measured by the Mahalanobis distance of the rank vector within every predictor (blue pins, where two samples on the power left side have the same location, and two samples on the lower left side have the same location) on the map. In these 10 houses, four are on the lower left side of the river with an average price of 25.75, and six are on the upper right side of river with an average price of 43.87.

In summary, this real data analysis indicates that resolution-wise regression model can capture the heterogeneous pattern, thus can deliver more detailed prediction information than traditional methods. Since no distribution assumption is required, our method is rather general and robust.

7. Conclusion

In this paper, we propose resolution-wise regression model to predict the distribution of the response with heterogeneous data. The complicated relationship between the response and the explanatory variables can be decomposed into the relationship of resolutions of the response and patterns of the predictors based on binary expansions. A set of penalized logistic regressions establish the effect of patterns having on the resolutions. By BID transformation, our method can estimate the cell probability of the histogram of the response, which is an approximation of the distribution of the response. We also show the consistency of the cell probabilities. Numerical studies demonstrate the effectiveness of the proposed method.

Supplementary Material

Supp 1

NIHMS1853448-supplement-Supp_1.zip^{(265.9KB, zip)}

Fig. 7 — Heatmaps of prediction distributions for Example 2 with respect to dimension q = 1, 5, 10 and five methods: naive method, SSANOVA, Random Forest, resolution-wise regression with fixed smoothness, and resolution-wise regression with CV, from left to right respectively. A darker color indicates a larger PDF value at the corresponding predicted response.

Fig. 8 — Heatmaps of prediction distributions for Example 3 with respect to circular implicit functional relationship and five methods: naive method, SSANOVA, Random Forest, resolution-wise regression with fixed smoothness, and resolution-wise regression with CV, from left to right respectively. A darker color indicates a larger PDF value at the corresponding predicted response.

Fig. 12 — Predicted distributions by naive method by the nearest ten samples, SSANOVA, and resolution-wise regression.

Table 2.

Comparison of average test errors (and corresponding standard errors in parentheses) for Example 2 with respect to dimension q = 1, 5, 10 and three distance measures. The results for naive method, SSANOVA, Random Forest, resolution-wise regression with fixed smoothness, and resolution-wise regression with CV are listed in the columns from left to right respectively.

Example	Measure	Naive	SSANOVA	Random Forest	Fixed smoothness	CV
q = 1	KS	0.103	0.215	0.256	0.169	0.167
		(0.006)	(0.005)	(0.008)	(0.013)	(0.014)
	KL	0.261	0.605	0.787	0.089	0.693
		(0.010)	(0.021)	(0.049)	(0.008)	(0.037)
	L ₁	0.383	0.765	0.716	0.283	0.517
		(0.018)	(0.033)	(0.035)	(0.015)	(0.027)
q = 5	KS	0.345	0.219	0.222	0.180	0.179
		(0.019)	(0.016)	(0.012)	(0.015)	(0.014)
	KL	0.701	0.589	0.730	0.146	0.532
		(0.034)	(0.025)	(0.037)	(0.011)	(0.028)
	L ₁	0.937	0.774	0.693	0.348	0.506
		(0.039)	(0.035)	(0.028)	(0.015)	(0.022)
q = 10	KS	0.344	0.210	0.213	0.169	0.183
		(0.006)	(0.006)	(0.015)	(0.018)	(0.018)
	KL	0.767	0.589	0.666	0.185	0.572
		(0.033)	(0.024)	(0.050)	(0.022)	(0.035)
	L ₁	0.952	0.735	0.693	0.346	0.498
		(0.038)	(0.032)	(0.035)	(0.015)	(0.026)

Open in a new tab

Table 3.

Comparison of average test errors (and corresponding standard errors in parentheses) for Example 3 with respect to circular and spherical implicit functional relationship and three distance measures. The results for naive method, SSANOVA, Random Forest, resolution-wise regression with fixed smoothness, and resolution-wise regression with CV are listed in the columns from left to right respectively.

Example	Measure	Naive	SSANOVA	Random Forest	Fixed smoothness	CV
Circle	KS	0.175	0.201	0.398	0.172	0.156
		(0.010)	(0.014)	(0.030)	(0.010)	(0.008)
	KL	0.378	0.436	4.180	0.264	0.404
		(0.019)	(0.032)	(0.322)	(0.008)	(0.027)
	L ₁	0.677	0.761	1.384	0.515	0.535
		(0.039)	(0.034)	(0.108)	(0.022)	(0.025)
Sphere	KS	0.170	0.184	0.413	0.183	0.161
		(0.012)	(0.010)	(0.037)	(0.009)	(0.006)
	KL	0.369	0.411	4.553	0.310	0.339
		(0.015)	(0.021)	(0.355)	(0.009)	(0.016)
	L ₁	0.664	0.735	1.439	0.580	0.570
		(0.026)	(0.038)	(0.080)	(0.024)	(0.025)

Open in a new tab

Acknowledgments

The authors would like to thank the editor, the associate editor, and two anonymous referees for their valuable comments and suggestions.

Funding

Research of Qizhai Li was partially supported by Beijing Natural Science Foundation (Z180006) and National Nature Science Foundation of China (11722113). Kai Zhang was supported in part by NSF grants DMS-1613112, IIS-1633212, DMS-1916237. Yufeng Liu was supported in part by NSF grant DMS-2100729 and NIH grant R01GM126550.

Appendix

Proof of Theorem 1:

We split the whole proof into two steps: (1) the selected variables from the BET screening are equivalent to those from the sure independence screening based on the MMLE, (2) the proposed logistic regression satisfies the conditions in Fan and Song (2010) to achieve the sure independence screening property.

The BET test statistic

S_{a_{(j)} b_{(m)}} = | \sum_{i = 1}^{n} I ({\dot{A}}_{(j), i} {\dot{B}}_{(m), i} = 1) - \sum_{i = 1}^{n} I ({\dot{A}}_{(j), i} {\dot{B}}_{(m), i} = - 1) | = | \sum_{i = 1}^{n} {\dot{A}}_{(j), i} {\dot{B}}_{(m), i} | = \sqrt{n} | \sum_{i = 1}^{n} {\tilde{A}}_{(j), i} {\dot{B}}_{(m), i} | .

By the definition in (12), and

{\tilde{B}}_{(m)} = ({\dot{B}}_{(m)} + 1) / 2

, the MMLE can be obtained by the optimization with respect to

{\dot{B}}_{(m), i}

’s, i.e.,

{\tilde{β}}_{m, j} = \underset{β_{m, j}}{arg min} \frac{1}{n} \sum_{i = 1}^{n} - \frac{{\dot{B}}_{(m), i} + 1}{2} {\tilde{A}}_{(j), i} β_{m, j} + log (1 + e^{{\tilde{A}}_{(j), i} β_{m, j}}), j = 1, \dots, L .

Setting the derivative of the above objective function with respect to β_m,j to be zero, we have that

{\tilde{β}}_{m, j}

satisfies

\frac{1}{n} \sum_{i = 1}^{n} \frac{{\dot{B}}_{(m), i} {\tilde{A}}_{(j), i}}{2} = \frac{1}{n} \sum_{i = 1}^{n} (\frac{{\tilde{A}}_{(j), i} e^{{\tilde{A}}_{(j), i} {\tilde{β}}_{m, j}}}{1 + e^{{\tilde{A}}_{(j), i} {\tilde{β}}_{m, j}}} - \frac{{\tilde{A}}_{(j), i}}{2}) = \frac{1}{n} \sum_{i = 1}^{n} \frac{{\tilde{A}}_{(j), i} e^{{\tilde{A}}_{(j), i} {\tilde{β}}_{m, j}}}{1 + e^{{\tilde{A}}_{(j), i} {\tilde{β}}_{m, j}}} .

The second equation holds since there are n / 2 samples with

{\tilde{A}}_{(j), i} = 1 / \sqrt{n}

and the other n / 2 samples with

{\tilde{A}}_{(j), i} = - 1 / \sqrt{n}

, due to the binary expansion from the empirical CDF transformation. Denote

t ({\tilde{β}}_{m, j}) ≔ \sum_{i = 1}^{n} \frac{{\tilde{A}}_{(j), i} e^{{\tilde{A}}_{(j), i} {\tilde{β}}_{m, j}}}{1 + e^{{\tilde{A}}_{(j), i} {\tilde{β}}_{m, j}}}

. Differentiating with respect to

{\tilde{β}}_{m, j}

, we obtain

\frac{d t ({\tilde{β}}_{m, j})}{d β_{m, j}} = \sum_{i = 1}^{n} \frac{{\tilde{A}}_{(j), i}^{2} e^{{\tilde{A}}_{(j), i} {\tilde{β}}_{m, j}}}{{(1 + e^{{\tilde{A}}_{(j), i} {\tilde{β}}_{m, j}})}^{2}} > 0

. Hence

t ({\tilde{β}}_{m, j})

is strictly increasing with respect to

{\tilde{β}}_{m, j}

. For γ_m,n > 0, if

{\tilde{β}}_{m, j} ⩾ γ_{n, m}

t ({\tilde{β}}_{m, j}) ⩾ t (γ_{n, m})

. If

{\tilde{β}}_{m, j} ⩽ - γ_{n, m}

t ({\tilde{β}}_{m, j}) ⩽ t (- γ_{n, m}) = \sum_{i = 1}^{n} \frac{{\tilde{A}}_{(j), i} e^{- {\tilde{A}}_{(j), i} γ_{n, m}}}{1 + e^{- {\tilde{A}}_{(j), i} γ_{n, m}}} = \sum_{i = 1}^{n} \frac{{\tilde{A}}_{(j), i}}{1 + e^{{\tilde{A}}_{(j), i} γ_{n, m}}} = \sum_{i = 1}^{n} ({\tilde{A}}_{(j), i} - \frac{{\tilde{A}}_{(j), i} e^{{\tilde{A}}_{(j), i} γ_{n, m}}}{1 + e^{{\tilde{A}}_{(j), i} γ_{n, m}}}) = - t (γ_{n, m}) .

Since

S_{a_{(j)} b_{(m)}} = 2 \sqrt{n} | t ({\tilde{β}}_{m, j}) |

, we have

S_{a_{(j)} b_{(m)}} ⩾ 2 \sqrt{n} γ_{n, m}

. Hence we have the variable selection index set

{1 ⩽ j ⩽ L : S_{a_{(j)} b_{(m)}} ⩾ 2 \sqrt{n} t (γ_{n, m})} \supseteq {1 ⩽ j ⩽ L : | {\tilde{β}}_{m, j} | ⩾ γ_{n, m}}

. Similarly, we have

{1 ⩽ j ⩽ L : S_{a_{(j)} b_{(m)}} ⩾ 2 \sqrt{n} t (γ_{n, m})} \subseteq {1 ⩽ j ⩽ L : | {\tilde{β}}_{m, j} | ⩾ γ_{n, m}}

. Denote

I_{j}^{+} ≔ {1 ⩽ i ⩽ n : {\tilde{A}}_{(j), i} = 1 / \sqrt{n}}, I_{j}^{-} ≔ {1 ⩽ i ⩽ n : {\tilde{A}}_{(j), i} = - 1 / \sqrt{n}}

. Taking

γ_{n, m} = c_{4, m} n^{- κ_{m}}

for some c_4,m > 0, we have

2 \sqrt{n t} (γ_{n, m}) = 2 \sqrt{n} \sum_{i = 1}^{n} \frac{{\tilde{A}}_{(j), i} e^{{\tilde{A}}_{(j), i} c_{4, m} n^{- κ_{m}}}}{1 + e^{{\tilde{A}}_{(j), i} c_{4, m} n^{- κ_{m}}}} = 2 \sqrt{n} \sum_{i \in I_{j}^{+}} \frac{\frac{1}{\sqrt{n}} e^{c_{4, m} n^{- \frac{1}{2} - κ_{m}}}}{1 + e^{c_{4, m} n^{- \frac{1}{2} - κ_{m}}}} - 2 \sqrt{n} \sum_{i \in I_{j}^{-}} \frac{\frac{1}{\sqrt{n}} e^{- c_{4, m} n^{- \frac{1}{2} - κ_{m}}}}{1 + e^{- c_{4, m} n^{- \frac{1}{2} - κ_{m}}}} = \frac{n e^{c_{4, m} n^{- \frac{1}{2} - κ_{m}}}}{1 + e^{c_{4, m} n^{- \frac{1}{2} - κ_{m}}}} - \frac{n}{1 + e^{c_{4, m} n^{- \frac{1}{2} - κ_{m}}}} = \frac{n (e^{c_{4, m} n^{- \frac{1}{2} - κ_{m}}} - 1)}{1 + e^{c_{4, m} n^{- \frac{1}{2} - κ_{m}}}} = \frac{n O (n^{- \frac{1}{2} - κ_{m}})}{1 + e^{c_{4, m} n^{- \frac{1}{2} - κ_{m}}}} = O (n^{\frac{1}{2} - κ_{m}}) .

Hence the BET screening is equivalent to the sure independence screening based on MMLE. This ensures that the estimation methods are the same as those in Fan and Song (2010).

Under the binary expansion, the variables are bounded in [0,1] after empirical CDF transformation, so that the conditions A – C in Fan and Song (2010) are naturally satisfied. Assumption 1 is analogous to Condition E. So we only need to check Condition D. For the proposed logistic regression, let w₀ = 1, and we have

E_{exp (log (1 + e^{A^{T} β_{m}^{0} + w_{0}}) - log (1 + e^{A^{T} β_{m}^{0}}))} + E_{exp (log (1 + e^{A^{T} β_{m}^{0} - w_{0}}) - log (1 + e^{A^{T} β_{m}^{0}}))} = E \frac{1 + e^{A^{T} β_{m}^{0} + w_{0}}}{1 + e^{A^{T} β_{m}^{0}}} + E \frac{1 + e^{A^{T} β_{m}^{0} - w_{0}}}{1 + e^{A^{T} β_{m}^{0}}} = 2 + E \frac{{(e^{w_{0} / 2} - e^{- w_{0} / 2})}^{2}}{1 + e^{- A^{T} β_{m}^{0}}} ⩽ 2 + {(e^{w_{0} / 2} - e^{- w_{0} / 2})}^{2} = 2 + {(e^{1 / 2} - e^{- 1 / 2})}^{2} .

Taking w₁ = 2, h_1,m = 3, h_0,m = 1, α = 1 satisfies Condition D.

Hence by Theorem 4 in Fan and Song (2010), for $δ_{n, m} = O (n^{\frac{1}{2} - κ_{m}})$ , the sure independence screening property is achieved.

Proof of Theorem 2:

For logistic regression, the function b (·) of Condition G in Fan and Song (2010) is b(x) = log(1 + e^x). Since $0 < \frac{d^{2} b (x)}{d x^{2}} = \frac{e^{- x}}{{(1 + e^{- x})}^{2}} < 1$ . Condition G in Fan and Song (2010) is met. Together with Assumption 2, by Theorem 5 in Fan and Song (2010), the proof is completed.

Let ℱ be a normed real vector space. For the m-th logistic regression function $f_{m} \in ℱ : {- 1, 1}^{r} \to ℝ$ , where $r ≔ {max}_{1 ⩽ m ⩽ 2^{d_{Y}} - 1} | ℳ_{m, δ_{n, m}} |$ , the negative log-likelihood loss $ρ_{f_{m}} : {- 1, 1}^{r + 1} \to ℝ$ is $ρ_{f_{m}} (Z_{m}) = ρ_{f_{m}} ({\dot{A}}^{S}, {\dot{B}}_{(m)}) = log (1 + e^{- {\dot{B}}_{(m)} f ({\dot{A}}^{s})})$ , $m = 1, \dots, 2^{d_{Y}} - 1$ where ${\dot{A}}^{S} ≔ {({\dot{A}}_{(1)}^{S}, \dots, {\dot{A}}_{(r)}^{S})}^{T}$ is the predictor vector including the selected r patterns. For a loss function $ρ_{f_{m}}$ , define the empirical risk for $ρ_{f_{m}}$ by $P_{n} ρ_{f_{m}} ≔ \frac{1}{n} \sum_{i = 1}^{n} ρ_{f_{m}} (Z_{m, i})$ , and the theoretical risk by $P ρ_{f_{m}} ≔ \frac{1}{n} \sum_{i = 1}^{n} E ρ_{f_{m}} (Z_{m, i})$ . Consider the collection ℱ to be a linear-model class, i.e., $ℱ ≔ {f^{β} : β \in ℝ^{p}}$ , where β ↦ f_β is linear. For the m-th regression, note that the true coefficient vector $β_{m}^{0}$ is the minimizer of the theoretical risk

β_{m}^{0} ≔ \underset{β}{arg min} P ρ_{f_{m}^{β}},

(13)

and $f_{m}^{0} ≔ f_{m}^{β_{m}^{0}}$ . We assume for simplicity that the minimum exists and unique. For $f_{m}^{β} \in ℱ$ , the excess risk is defined by $ℰ (f_{m}^{β}) ≔ P (ρ_{f_{m}^{β}} - ρ_{f_{m}^{0}})$ . The lasso estimator is ${\hat{β}}_{m} = arg mi n_{β} {P_{n} ρ_{f_{m}^{β}} + λ_{m} ‖ β ‖_{1}}$ , $m = 1, \dots, 2^{d_{Y}} - 1$ , where ∥·∥₁ is the ℓ₁-norm and λ_m is a tuning parameter. The estimation of the regression function is ${\hat{f}}_{m} = f_{m}^{{\hat{β}}_{m}}$ .

Denote $π_{m} ({\dot{A}}^{S}) = P ({\dot{B}}_{(m)} = 1 ∣ {\dot{A}}^{S})$ and $e_{m} ({\dot{A}}^{S}) = E ({\dot{B}}_{(m)} ∣ {\dot{A}}^{S})$ . We have $e_{m} ({\dot{A}}^{S}) = 2 π_{m} ({\dot{A}}^{S}) - 1$ . Hence based on the link function of logistic regression, we can define a functional g mapping e_m (·) to the regression function $f_{m}^{β} (\cdot)$ :

g (e_{m}) (\cdot) ≔ f_{m}^{β} (\cdot) = log (\frac{π_{m} (\cdot)}{1 - π_{m} (\cdot)}) = log (\frac{\frac{e_{m} (\cdot) + 1}{2}}{1 - \frac{e_{m} (\cdot) + 1}{2}}) .

(14)

Denoting $e_{m}^{0} ({\dot{A}}^{S})$ as the true expectation corresponding to $β_{m}^{0}$ , by (14), we have $f_{m}^{0} = g (e_{m}^{0})$ . Similarly, recall that ${\hat{e}}_{m} ({\dot{A}}^{S})$ is the estimated expectation, and thus we have ${\hat{f}}_{m} = g ({\hat{e}}_{m})$ .

For a given index set S_m ⊂ {1, …, r}, define $β_{m, j, S_{m}} ≔ β_{m, j} 1 {j \in S_{m}}$ , $m = 1, \dots, 2^{d_{Y}} - 1$ , j = 1, …, r. Denote the estimator restricted to $β_{m, S_{m}} = {(β_{m, 1, S_{m}}, \dots, β_{m, r, S_{m}})}^{T}$ by ${\hat{β}}_{m, S_{m}} ≔ arg mi n_{β = β_{m, S_{m}}} P_{n} ρ_{f_{m}^{β}}$ . Write ${\hat{f}}_{m, S} ≔ f_{m}^{{\hat{β}}_{m, S_{m}}}$ . Restricted to $β_{m, S_{m}}$ ’s, the best approximation of $f_{m}^{0}$ is $f_{m, S}^{0} ≔ f_{m}^{β_{m, S_{m}}^{0}}$ , where $β_{m, S_{m}}^{0} ≔ arg {min}_{β = β_{m, S_{m}}} P ρ_{f_{m}^{β}}$ .

The following assumption requires a certain compatibility of ℓ₁-norm with the norm on ℱ, which is a regular assumption for the theoretical framework for lasso.

Assumption 3. (Compatibility condition) We say that the compatibility condition is met for the set S_m with constant ϕ_m > 0, if for all β_m satisfying ${‖ β_{m, S_{m}^{c}} ‖}_{1} ⩽ 3 {‖ β_{m, S_{m}} ‖}_{1}$ , it holds that ${‖ β_{m, S_{m}} ‖}_{1}^{2} ⩽ {‖ f_{m} ‖}^{2} s_{m} / ϕ_{m}^{2}$ .

Next, we show the definition of the margin condition (Bülmann and van de Geer, 2011) and demonstrate that the penalized logistic regression satisfies the condition with a quadratic margin.

Definition 1 (Margin condition). Denote a “neighborhood” of $f_{m}^{0} \in ℱ$ by $ℱ_{η_{m}} ≔ {f \in ℱ : {‖ f - f_{m}^{0} ‖}_{\infty} ⩽ η_{m}}$ with constant η_m > 0. We say that the margin condition holds with a strictly convex function G, if for all $f \in ℱ_{η_{m}}$ , we have $ℰ (f) ⩾ G (‖ f - f_{m}^{0} ‖)$ , where ∥·∥ is the norm defined on ℱ.

Assumption 4. For any fixed ${\dot{A}}^{S}$ , there exists some constant $0 ⩽ ε_{m}^{0} ⩽ 1$ such that $ε_{m}^{0} ⩽ π_{m} ({\dot{A}}^{S}) ⩽ 1 - ε_{m}^{0}$ , $m = 1, \dots, 2^{d_{Y}} - 1$ .

Lemma 5. Under Assumption 4, the margin condition holds for all $2^{d_{Y}} - 1$ penalized logistic regressions with a quadratic margin, i.e., G_m(u) = c_mu² for the m-th regression.

The technical proof of Lemma 5 can be found in supplement. For the m-th regression, the oracle $β_{m}^{*}$ (Bülmann and van de Geer, 2011) is defined by

β_{m}^{*} ≔ \underset{β : S_{β} \in Ψ}{arg m i n} {3 ℰ (f_{m}^{β}) + \frac{8 λ_{m}^{2} s_{β}}{c_{m} ϕ_{m}^{2}}},

(15)

where S_β ≔ {j: β_j ≠ 0}, s_β ≔ |S_β| denotes the cardinality of S_β, $ϕ_{m}^{2}$ is a compatibility constant, and Ψ is a suitable large collection of index sets. Denote the index set of non-zero coefficients by $S_{m}^{0} ≔ {j : β_{m, j}^{0} \neq 0}$ , and the cardinality of $S_{m}^{0}$ by $s_{m} ≔ | S_{m}^{0} |$ . Assuming $f_{m}^{0}$ is linear, we can take $Ψ = {S_{m}^{0}}$ . Hence the definition of $β_{m}^{*}$ is consistent with the definition of $β_{m}^{0}$ in (13), since the second term of (15) does not rely on β. In this context, we only use the notation $β_{m}^{0}$ . Denote the minimum of (15) by $2 ϵ_{m}^{*} ≔ 3 ℰ (f_{m}^{0}) + \frac{8 λ_{m}^{2} s_{m}}{c_{m} ϕ_{m}^{2}}$ . Define $Z_{M_{m}} ≔ sup_{‖ β - β_{m}^{0} ‖ ⩽ M_{m}} | v_{n} (β) - v_{n} (β_{m}^{0}) |$ , where $v_{n} (β_{m}) ≔ (P_{n} - P) ρ_{f_{m}^{β_{m}}}$ is the empirical process. Set $M_{m}^{*} ≔ ϵ_{m}^{*} / λ_{m}^{0}$ , and $𝒯_{m} ≔ {Z_{M_{m}^{*}} ⩽ λ_{m}^{0} M_{m}^{*}} = {Z_{M_{m}^{*}} ⩽ ϵ_{m}^{*}}$ . Bülmann and van de Geer (2011) showed that one can choose $λ_{m}^{0} ≍ \sqrt{log (n^{2 κ_{m}}) / n}$ such that the set 𝒯_m holds with large probability.

Assumption 5. For some constant η_m > 0, $f_{m}^{β_{m}} \in ℱ_{η_{m}} ≔ \{‖ f_{m}^{β_{m}} - f_{m}^{0} ‖_{\infty} ⩽ η_{m}\}$ for all ${‖ β_{m} - β_{m}^{0} ‖}_{1} ⩽ M^{*}$ , as well as $f_{m}^{0} \in ℱ_{η_{m}}$ .

According to the BID equation, we estimate p by solving the optimization (9). From the optimization, H_m+1 p is an approximation of ${\hat{e}}_{m}$ , where H_m+1 is the (m + 1)-th row of H, since ${\hat{e}}_{m}$ is the (m + 1)-th entry of E. Hence g(H_m+1 p) is the estimated m-th regression function corresponding to p. The following theorem gives the consistency of cell probability vector p in terms of excess risk of g (H_m+1 p).

We now turn to the proof of Theorem 3. Below is its statement again for convenience.

Theorem 4. Assume Assumptions 1–5 hold, where Assumption 3 holds with the set $S_{m}^{0}$ . For the logistic regression with covariates corresponding to the BET screening set $ℳ_{m, δ_{n, m}}$ , suppose that λ_m satisfie $λ_{m} ⩾ 8 λ_{m}^{0}$ . Then on the set 𝒯_m, we have,

ℰ (g (H_{m + 1} p)) + λ_{m} {‖ {\hat{β}}_{m} - β_{m}^{0} ‖}_{1} ⩽ 6 ℰ (f_{m}^{0}) + \frac{16 λ_{m}^{2} s_{m}}{c_{m} ϕ_{m}^{2}} + \frac{32 λ K s 2^{d_{Y}}}{c ϕ^{2}},

where K > 0 is a constant,

λ = {max}_{1 ⩽ m ⩽ 2^{d_{Y}} - 1} λ_{m}, s = {max}_{1 ⩽ m ⩽ 2^{d_{Y}} - 1} s_{m}, c = {min}_{1 ⩽ m ⩽ 2^{d_{Y}} - 1} c_{m}, ϕ = {min}_{1 ⩽ m ⩽ 2^{d_{Y}} - 1} ϕ_{m} .

Before the proof of Theorem 3, we first state the oracle inequality for penalized logistic regression by Bülmann and van de Geer (2011) as follows.

Lemma 6. Assume Assumptions 3–5 hold, where 3 holds with the set $S_{m}^{0}$ . Suppose that λ_m satisfies the inequality $λ_{m} ⩾ 8 λ_{m}^{0}$ . Then on the set 𝒯_m, we have

ℰ ({\hat{f}}_{m}) + λ_{m} {‖ {\hat{β}}_{m} - β_{m}^{0} ‖}_{1} ⩽ 6 ℰ (f_{m}^{0}) + \frac{16 λ_{m}^{2} s_{m}}{c_{m} ϕ_{m}^{2}},

where $c_{m} = {(\frac{e^{η_{m}}}{ε_{m}^{0}} + 1)}^{- 2}$ .

Then we have the proof for Theorem 3 as follows.

Proof of Theorem 3:

For the excess risk of g (H_m+1 p), we have

ℰ (g (H_{m + 1} p)) = P ρ_{g (H_{m + 1} p)} - P ρ_{f_{m}^{0}} = P ρ_{g (H_{m + 1} p)} - P ρ_{g (e_{m}^{0})} = (P ρ_{g (H_{m + 1} p)} - P ρ_{g ({\hat{e}}_{m})}) + (P ρ_{g ({\hat{e}}_{m})} - P ρ_{g (e_{m}^{0})}) = (P ρ_{g (H_{m + 1} p)} - P ρ_{g ({\hat{e}}_{m})}) + ℰ ({\hat{f}}_{m}) ≜ I + I I .

It can be shown that the function $P ρ_{g (e_{m})}$ is Lipschitz continuous, with the Lipschitz constant K_m obtained from the first derivative

| \frac{d P ρ_{g (e_{m})}}{d e_{m}} | = | (\frac{e^{f_{m}}}{1 + e^{f_{m}}} - π_{m}) (\frac{1}{1 + e_{m}} + \frac{1}{1 - e_{m}}) | ⩽ \frac{2 (1 - ε_{m}^{0})}{1 - 1 {(1 - 2 ε_{m}^{0})}^{2}} ≜ K_{m} .

For part I, denoting the true expectation vector by E⁰ and the corresponding true cell probability vector by p⁰ = H⁻¹ E⁰, we have

| I | ⩽ K_{m} | H_{m + 1} p - {\hat{e}}_{m} | ⩽ K_{m} ‖ H p - E ‖_{1} ⩽ K_{m} {‖ H p^{0} - E ‖}_{1} ⩽ K_{m} ({‖ H p^{0} - E^{0} ‖}_{1} + {‖ E^{0} - E ‖}_{1}) .

By Lemma 6, ${‖ {\hat{β}}_{m} - β_{m}^{0} ‖}_{1} ⩽ \frac{16 λ_{m} s_{m}}{c_{m} ϕ_{m}^{2}}$ . Since $| {\hat{f}}_{m} - f_{m}^{0} | ⩽ {‖ {\hat{β}}_{m} - β_{m}^{0} ‖}_{1}$ with predictors taking values from {− 1, 1}, we have $| {\hat{f}}_{m} - f_{m}^{0} | ⩽ \frac{16 λ_{m} s_{m}}{c_{m} ϕ_{m}^{2}}$ . One can similarly show that the function g⁻¹ (·) is Lipschitz continuous with Lipschitz constant two. Hence we have $| {\hat{e}}_{m} - e_{m}^{0} | ⩽ 2 | {\hat{f}}_{m} - f_{m}^{0} | ⩽ \frac{32 λ_{m} s_{m}}{c_{m} ϕ_{m}^{2}}$ , and thus ${‖ E^{0} - E ‖}_{1} ⩽ \frac{32 λ s 2^{d_{Y}}}{c ϕ^{2}}$ , where $λ = {max}_{1 ⩽ m ⩽ 2^{d_{Y}} - 1} λ_{m}$ , $s = {max}_{1 ⩽ m ⩽ 2^{d_{Y}} - 1} s_{m}$ , $c = {min}_{1 ⩽ m ⩽ 2^{d_{Y}} - 1} c_{m}$ , and $ϕ = {min}_{1 ⩽ m ⩽ 2^{d_{Y}} - 1} ϕ_{m}$ .

For the true expectation vector and the true cell probability vector, we have ∥H p⁰ − E⁰∥₁ = 0. Denote $K = m {ax}_{1 ⩽ m ⩽ 2^{d_{Y}} - 1} K_{m}$ . Together with the inequality in Lemma 6 to handle the part II, the proof is completed.

Footnotes

Conflict of Interest Statement

The authors declare that they have no competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Bülmann P & van de Geer S (2011). Statistics for High-Dimensional Data. Springer. [Google Scholar]
Chen J, Tran-Dinh Q, Kosorok MR, and Liu Y (2021). Identifying Heterogeneous Effect using Latent Supervised Clustering with Adaptive Fusion. Journal of Computational and Graphical Statistics, 30, 1, 43–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cox DR & Reid N (2000). The theory of the design of experiments. CRC Press. [Google Scholar]
Fan J & Song R (2010). Fan, J and Song, R. Sure independence screening in generalized linear models with NP-dimensionality. Ann. Stat 38, 3567–604. [Google Scholar]
Fan J & Lv J (2008). Fan, J and Lv, J. Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B Stat. Methodol 70, 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
Golubov B, Efimov A & Skvortsov V (2012). Walsh series and transforms: theory and applications. Springer Science & Business Media. [Google Scholar]
Green P & Silverman B (1994). Nonparametric regression and generalized linear models: a roughness penalty approach. Chapman and Hall, London. [Google Scholar]
Greenshtein E and Ritov Y (2004). Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli. 10, 971–988. [Google Scholar]
Gu C (2002). Smoothing spline ANOVA models. Springer-Verlag, New York. [Google Scholar]
Guo FJ, Levina E, Michailidis G & Zhu J (2010). Pairwise variable selection for high-dimensional model-based clustering. Biometrics 66, 793–804. [DOI] [PMC free article] [PubMed] [Google Scholar]
Harmuth H (2013). Transmission of Information by Orthogonal Functions. Springer Berlin Heidelberg. [Google Scholar]
Hocking T, Joulin A, Bach F & Vert JP (2011). Clusterpath: An algorithm for clustering using convex fusion penalties. In Proceedings of the 28th International Conference on Machine Learning (ICML’11), Getoor L and Scheffer T, eds., New York: Omnipress, pp. 745–52. [Google Scholar]
Jacobs RA, Jordan MI, Nowlan SJ & Hinton GE (1991). Adaptive mixtures of local experts. Neural Comp. 3, 79–87. [DOI] [PubMed] [Google Scholar]
Kac M (1959). Statistical independence in probability, analysis and number theory. Mathematical Association of America. [Google Scholar]
Lindsten F, Ohlsson H & Ljung L (2011). Clustering using sum-of-norms regularization: With application to particle filter output computation. In 2011 IEEE Statistical Signal Processing Workshop (SSP), pp. 201–4. [Google Scholar]
Lynn PA (1973). An introduction to the analysis and processing of signals. London: Macmillan. [Google Scholar]
MA S & Huang J (2017). A concave pairwise fusion approach to subgroup analysis. J. Am. Statist. Assoc 112, 410–23. [Google Scholar]
MacQueen J (1967). Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, LeCam LM and Neyman J eds., Berkeley, CA: University of California Press, pp. 281–97. [Google Scholar]
Pan W and Shen X (2006). Penalized model-based clustering with application to variable selection. J. Mach. Learn. Res 8, 1145–64. [Google Scholar]
Pan W, Shen X & Liu B (2013). Cluster analysis: Unsupervised learning via supervised learning with a non-convex penalty. J. Mach. Learn. Res 14, 1865–89. [PMC free article] [PubMed] [Google Scholar]
Pearl J (1971). Application of Walsh Transform to Statistical Analysis. IEEE Trans. Syst. Man. Cybern SMC-1, 111–9. [Google Scholar]
Raftery A & Dean N(2006). Variable selection for model-based clustering. J. Am. Statist. Assoc 101, 168–78. [Google Scholar]
Rosenbaum P (1995). Observational Studies. Springer. [Google Scholar]
Sylvester JJ (1867). LX. Thoughts on inverse orthogonal matrices, simultaneous signsuccessions, and tessellated pavements in two or more colours, with applications to Newton’s rule, ornamental tile-work, and the theory of numbers. Lond. Edinb. Dubl. Phil. Mag 34, 461–75. [Google Scholar]
Tang X & Qu A (2017). Individualized multi-directional variable selection. arXiv: 1709.05062. [Google Scholar]
Tibshirani R (1996). Regression shrinkage and selection via the lasso. J. Royal Stat. Soc. B 58, 267–88. [Google Scholar]
Wahba G (1990). Spline Models for Observational Data. SIAM, Philadelphia. [Google Scholar]
Yeh IC & Hsu TK (2018). Building real estate valuation models with comparative approach through case-based reasoning. Appl. Soft Comput 65, 260–71. [Google Scholar]
Zhang K (2019). BET on independence. J. Am. Statist. Assoc DOI: 10.1080/01621459.2018.1537921. [DOI] [Google Scholar]
Zhang K, Zhao Z, and Zhou W (2021). BEAUTY Powered BEAST, arXiv: 2103.00674. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp 1

NIHMS1853448-supplement-Supp_1.zip^{(265.9KB, zip)}

[R1] Bülmann P & van de Geer S (2011). Statistics for High-Dimensional Data. Springer. [Google Scholar]

[R2] Chen J, Tran-Dinh Q, Kosorok MR, and Liu Y (2021). Identifying Heterogeneous Effect using Latent Supervised Clustering with Adaptive Fusion. Journal of Computational and Graphical Statistics, 30, 1, 43–54. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Cox DR & Reid N (2000). The theory of the design of experiments. CRC Press. [Google Scholar]

[R4] Fan J & Song R (2010). Fan, J and Song, R. Sure independence screening in generalized linear models with NP-dimensionality. Ann. Stat 38, 3567–604. [Google Scholar]

[R5] Fan J & Lv J (2008). Fan, J and Lv, J. Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B Stat. Methodol 70, 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Golubov B, Efimov A & Skvortsov V (2012). Walsh series and transforms: theory and applications. Springer Science & Business Media. [Google Scholar]

[R7] Green P & Silverman B (1994). Nonparametric regression and generalized linear models: a roughness penalty approach. Chapman and Hall, London. [Google Scholar]

[R8] Greenshtein E and Ritov Y (2004). Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli. 10, 971–988. [Google Scholar]

[R9] Gu C (2002). Smoothing spline ANOVA models. Springer-Verlag, New York. [Google Scholar]

[R10] Guo FJ, Levina E, Michailidis G & Zhu J (2010). Pairwise variable selection for high-dimensional model-based clustering. Biometrics 66, 793–804. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Harmuth H (2013). Transmission of Information by Orthogonal Functions. Springer Berlin Heidelberg. [Google Scholar]

[R12] Hocking T, Joulin A, Bach F & Vert JP (2011). Clusterpath: An algorithm for clustering using convex fusion penalties. In Proceedings of the 28th International Conference on Machine Learning (ICML’11), Getoor L and Scheffer T, eds., New York: Omnipress, pp. 745–52. [Google Scholar]

[R13] Jacobs RA, Jordan MI, Nowlan SJ & Hinton GE (1991). Adaptive mixtures of local experts. Neural Comp. 3, 79–87. [DOI] [PubMed] [Google Scholar]

[R14] Kac M (1959). Statistical independence in probability, analysis and number theory. Mathematical Association of America. [Google Scholar]

[R15] Lindsten F, Ohlsson H & Ljung L (2011). Clustering using sum-of-norms regularization: With application to particle filter output computation. In 2011 IEEE Statistical Signal Processing Workshop (SSP), pp. 201–4. [Google Scholar]

[R16] Lynn PA (1973). An introduction to the analysis and processing of signals. London: Macmillan. [Google Scholar]

[R17] MA S & Huang J (2017). A concave pairwise fusion approach to subgroup analysis. J. Am. Statist. Assoc 112, 410–23. [Google Scholar]

[R18] MacQueen J (1967). Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, LeCam LM and Neyman J eds., Berkeley, CA: University of California Press, pp. 281–97. [Google Scholar]

[R19] Pan W and Shen X (2006). Penalized model-based clustering with application to variable selection. J. Mach. Learn. Res 8, 1145–64. [Google Scholar]

[R20] Pan W, Shen X & Liu B (2013). Cluster analysis: Unsupervised learning via supervised learning with a non-convex penalty. J. Mach. Learn. Res 14, 1865–89. [PMC free article] [PubMed] [Google Scholar]

[R21] Pearl J (1971). Application of Walsh Transform to Statistical Analysis. IEEE Trans. Syst. Man. Cybern SMC-1, 111–9. [Google Scholar]

[R22] Raftery A & Dean N(2006). Variable selection for model-based clustering. J. Am. Statist. Assoc 101, 168–78. [Google Scholar]

[R23] Rosenbaum P (1995). Observational Studies. Springer. [Google Scholar]

[R24] Sylvester JJ (1867). LX. Thoughts on inverse orthogonal matrices, simultaneous signsuccessions, and tessellated pavements in two or more colours, with applications to Newton’s rule, ornamental tile-work, and the theory of numbers. Lond. Edinb. Dubl. Phil. Mag 34, 461–75. [Google Scholar]

[R25] Tang X & Qu A (2017). Individualized multi-directional variable selection. arXiv: 1709.05062. [Google Scholar]

[R26] Tibshirani R (1996). Regression shrinkage and selection via the lasso. J. Royal Stat. Soc. B 58, 267–88. [Google Scholar]

[R27] Wahba G (1990). Spline Models for Observational Data. SIAM, Philadelphia. [Google Scholar]

[R28] Yeh IC & Hsu TK (2018). Building real estate valuation models with comparative approach through case-based reasoning. Appl. Soft Comput 65, 260–71. [Google Scholar]

[R29] Zhang K (2019). BET on independence. J. Am. Statist. Assoc DOI: 10.1080/01621459.2018.1537921. [DOI] [Google Scholar]

[R30] Zhang K, Zhao Z, and Zhou W (2021). BEAUTY Powered BEAST, arXiv: 2103.00674. [Google Scholar]

PERMALINK

Nonparametric prediction distribution from resolution-wise regression with heterogeneous data

Jialu Li

Wan Zhang

Peiyao Wang

Qizhai Li

Kai Zhang

Yufeng Liu

Abstract

1. Introduction

Fig. 1.

2. Methodology

2.1. Frequency domain from binary expansions

Fig. 2.

Fig. 3.

2.2. Logistic regression in the frequency domain

2.3. Binary interaction design: from frequency domain back to probability domain

3. Multivariate extensions

3.1. BET as a pre-screening approach

Fig. 4.

3.2. A generalized BET as an interaction pre-screening method

Fig. 5.

4. Theoretical Studies

5. Simulation Studies

Table 1.

Table 4.

Fig. 6.

Fig. 9.

6. Real Data Analysis

Fig. 10.

Fig. 11.

Fig. 13.

7. Conclusion

Supplementary Material

Fig. 7.

Fig. 8.

Fig. 12.

Table 2.

Table 3.

Acknowledgments

Funding

Appendix

Proof of Theorem 1:

Proof of Theorem 2:

Proof of Theorem 3:

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases