Abstract
Many existing sufficient dimension reduction methods are designed for regression with predictors that are elliptically distributed, which limits their application in real data analyses. Projection expectile regression (PER) is proposed as a new linear sufficient dimension reduction method for handling complex predictor structures, which includes continuous, discrete, and mixed predictor variables. PER requires the link function between the response and the predictor to be monotone, but not necessarily smooth, which makes it suitable for handling stratified response surfaces. By design, PER does not involve matrix inversion or high-dimensional smoothing. Thus, PER is ideal for controlling problems associated with multicollinearity, high dimensionality, and sparsity in the predictor. An extensive simulation study demonstrates the performance of projection expectile regression in synthetic data. A real data analysis of health insurance charges in the United States is also provided. The asymptotic properties of the PER estimator are included as well.
Keywords: expectile regression, complex predictor structure, dimension reduction, linearity assumption, discrete predictors
1. Introduction
Recently, there has been a growing interest in the fields of economics, health, advertising, and other social and physical sciences to collect and analyze data characterized by high volume, velocity, and variety. Implementing traditional regression on these complex and high-dimensional data sets is very challenging. Sufficient dimension reduction has emerged as a field of statistics that allows us to transform high-dimensional predictors to a low-dimensional representation without loss of relevant regression information.
Many existing sufficient dimension reduction methods require the linear conditional mean (LCM) assumption also referred to as the linearity assumption. This assumption imposes a probabilistic structure on the predictor and is shown to be satisfied when the distribution of the predictor is elliptically contoured, which excludes predictors with categorical variables; see Eaton (1986). Some of the methods that require the linearity condition include ordinary least squares [OLS; Li and Duan (1989)], slice inverse regression [SIR; Li (1991)], slice average variance estimation [SAVE; Cook and Weisberg (1991)], directional regression [DR; Li and Wang (2007)], principal quantile regression [PQR; Wang et al. (2018)], expectile-assisted inverse regression estimation [EA-IRE; Soale and Dong (2021)], and principal asymmetric least squares [PALS; Soale and Dong (2022)], to name a few. The restrictions imposed by the LCM, limits the application of these methods as most real data contain categorical predictor variables. In what follows, we will refer to the estimators in this family as the linearity-based methods.
To expand the scope of application of SIR, Chiaromonte et al. (2002) proposed the partial central space for predictors with a mix of quantitative and qualitative predictor variables. Their method was later extended to SAVE by Shao et al. (2009). The idea of a partial central space is to find the central space of the continuous quantitative predictors using the traditional SIR and SAVE within subpopulations of the categorical variables. The problem with this approach is that if we have many categorical variables with multiple levels, the number of subpopulations will grow exponentially. Hence, this method easily suffers from the curse of dimensionality. Also, the extension of these methods to only discrete-valued predictors is not clearly specified. Moreover, the partial methods inherit all the shortcomings of the original SIR and SAVE.
To avoid the linearity assumption, a new method called the minimum average variance estimation [MAVE; Xia et al. (2002)], which utilizes local kernel regression was proposed. Several extensions and improvements of the original MAVE including minimum average variance estimation based on the conditional density functions [dMAVE; Xia (2007)], sliced regression [SR; Wang and Xia (2008)], adaptive MAVE [Wang and Yao (2012)], and composite quantile outer-product of gradients method [qOPG, Kong and Xia (2014)] have been proposed as alternatives. Both dMAVE and SR can capture the central space exhaustively, but SR is more robust against extreme values. qOPG exploits the gradient of the local conditional quantiles and is able to handle heteroscedastic error variance. However, it is computationally intensive.
The aforementioned local kernel methods also struggle with discrete predictor variables. Discrete variables exacerbate the emptiness in the high dimensional space as some combinations of these variables contain fewer or no data points at all. As a result, they create temporal clusters of point clouds and stratification in the response surface. Thus, using interpoint distance can get tricky as it becomes difficult to ignore the noisy transitory features in the data. Kernel smoothing methods compensate for this problem by increasing the bandwidth in order to balance the squared bias and variance, which flattens the features and throws the baby out with the bathwater [Hall (1989)]. For the rest of the paper, we will call this class of estimators the MAVE-type methods.
Now, let denote a p-dimensional predictor and be a continuous response. For a generic vector, , the characteristic function of t⊤X equals the section of the characteristic function of X along t. Thus, the distribution of any X is uniquely characterized by its 1-dimensional projections [Huber (1985)]. We exploit this property of the characteristic function to solve the problem of predictors with complex structures in dimension reduction. The basic idea is to project X onto a random unit vector to obtain a univariate projected predictor Z, which will be used as the surrogate for X. Next, we regress Y on Z at different expectile levels. Then, we estimate the dimension reduction subspace in the direction of t from the outer product of the expectile slopes and t. We repeat the process for several unit vectors and average their corresponding subspaces to estimate the central space. We call our method projection expectile regression (PER).
Our proposal has some appealing features. Firstly, the univariate projections help to reveal interesting features such as outliers and temporal clusters in the data [Huber (1985)]. Thus, PER fares well with non-smooth monotone links, which means it can handle stratification and clustering in the response surface. Secondly, because PER utilizes expectiles, it is robust to extreme values. In addition, by regressing Y on a univariate projected predictor, PER avoids matrix inversion. Hence, it does not impose rank constraints on the covariance matrix of X. Our key contribution lies in the simultaneous estimation of the central space in the presence of both discrete and/or continuous variables in the predictor. Overall, our proposal accommodates complex predictor structures and is applicable in a wide variety of applications. However, we note that the projection technique is not suited for nonlinear structures, therefore, PER is confined to regression with monotone link functions.
The rest of the paper is organized as follows. In Section 2, we present the population level development and sample estimation algorithm for projection expectile regression. Then in Section 3, we give the large sample properties of the PER estimator. Next, we demonstrate the performance of PER on synthetic data in Section 4. In Section 5, we apply projection expectile regression to a real application on medical cost in the United States. We conclude the paper in Section 6. All proofs are relegated to the appendix.
2. Projection expectile regression
2.1. Definitions and notation
Let be a continuous response and be a p-dimensional predictor. Sufficient dimension reduction (SDR) seeks to find a subspace 𝒮 such that
| (1) |
where ⫫ means statistical independence and P𝒮X is the orthogonal projection of X onto 𝒮. We call 𝒮 a dimension reduction subspace if it satisfies (1). And if 𝒮 has the smallest column space, it is called the central space for the regression of Y on X, which we denote as 𝒮Y |X. Suppose with d < p is an orthogonal matrix such that span(η) = 𝒮Y |X, where span(.) denotes the column space. Then, we call η the basis for 𝒮Y |X. The dimension of the central space, dim(𝒮Y |X) = d is called the structural dimension.
2.2. Population level development
Consider the model
| (2) |
where f is an unknown smooth monotone link function, is an orthogonal matrix with d < p and η⊤η = Id×d, and E(ϵ|X) = 0 almost surely, which allows ϵ to depend on X. By proposition 1 of Dong (2021), suppose E(pX|η⊤X) is linear in η⊤X, then β0 ∈ 𝒮Y |X, where
| (3) |
is such that ρ(.) is a convex function, ψ(Y) is a transformation of Y, λ ⩾ 0, and Σ denotes the covariance matrix of X. Several SDR methods including ordinary least squares, sliced inverse regression, principal quantile regression, and principal asymmetric least squares, to name a few, follow (3) with different forms of ρ(.) and ψ(.). For example, if we set ρ: w ⟼ w2, ψ(Y) = Y, and λ = 0, we get the ordinary least squares estimator.
In practice, however, the assumption that E(X|η⊤X) is linear in η⊤X may not hold. A typical scenario is the case where X is a mix of continuous and categorical variables. Moreover, the link f may not be a smooth function. For instance, f may be a piecewise linear function in which case we need two or more linear regression functions to capture the underlying structure of the data. To overcome these challenges, we exploit the idea that there exist a unit vector and a constant such that β = bt. Hence, we can replace β in (3) with bt. And if we know t apriori, (3) becomes a simple optimization over α and b.
Assume t is a given unit vector in . Projection expectile regression first projects the centered p-dimensional X onto t to obtain a univariate predictor Z = t⊤(X−μ), where μ = E(X). Then we proceed to minimize the objective function
| (4) |
over α and b, where ρτ(w) = |τ − 1(w < 0)|w2 for τ ∈ (0, 1) [Newey and Powell (1987)]. 1(A) denotes the indicator function for the event A. Let and βτ(t) = bτt. Then, at any fixed τ, we can obtain multiple βτ(t) by changing t and by finding the corresponding bτ.
Combining the projection technique with expectiles (asymmetric least squares) gives PER a number of advantages. The initial projection step acts us a smoother and also helps to bypass the curse of dimensionality we would have incurred trying to smooth over the p-dimensional X. Also, because expectiles are robust to extreme values, PER is able to handle outliers and heteroscedastic error. Lastly, by synthesizing information across different expectile levels, PER is able to obtain complete information about the regression relationship between Y and X along any given t.
Remark 1.
t⊤X smooths the points of discontinuities in X and also represent a sample from the subspace that contains X.
If we choose oblique projection vectors, t⊤X will be able to reveal information in X missed by regular orthogonal projections of X such as principal components.
These properties of univariate projections along with many other interesting characteristics make the projection technique attractive. For instance, by Theorems 5 and 6 in Bickel et al. (2018), if we choose projection vectors on a unit hypersphere, all projections will be arbitrarily close or can be made arbitrarily close to the standard normal distribution under the certain conditions. For any p > 1, let be the unit hypersphere in . The following Lemma provides the conditions which yield asymptotically normally distributed univariate projections of X.
Lemma 1. Let x1, …, xn be a given sample of X. For any , define the empirical cumulative distribution of t⊤X as . Assume p, n → 8 with p/n → γ. Then, if
- γ = 0,
- γ > 0 and s/n → 0 with s << p,
where ts is a sparse unit vector with s coordinates.
The proof of Lemma 1 follows directly from the proofs of Theorems 5 and 6 in Bickel et al. (2018). Since the distribution of t⊤X for all projections under the conditions in (a) and (b) converge to the standard normal distribution, it is guaranteed that E(t⊤X|η⊤X) will be linear in η⊤X for all t under these conditions. Thus, linearity is given and not assumed under conditions (a) and (b) regardless of the distribution of X.
Another interesting property of univariate projections useful for PER is given in Theorem 1 of Bierens (1982). By this result, if Y has a finite first moment, E(Y |X) = 0 almost surely if and only if E(Y |t⊤X) = 0 for any . The next Lemma extends this property to conditional expectiles.
Lemma 2. Assume E|Y | < 8 and let ξY |X(τ) denote the τth conditional expectile of Y given X. For any , ξY |X(τ) = 0 almost surely if and only if for any .
Lemma 2 guarantees that for monotone response links, if the central space exists, we can recover a direction that belongs to it for any given t at any expectile level τ.
Theorem 1. Let τ ∈ (0, 1) and . Assume the conditions in Lemma 1 are satisfied. If E|Y | < 8 and bτ is the minimizer of (4), then βτ(t) ∈ 𝒮Y |X.
Theorem 1 implies that for a given and 0 < τ1 < … < τK < 1, Λ(t) is a candidate matrix for estimating 𝒮Y |X, where and
| (5) |
Therefore, we can average Λ(t) over to estimate 𝒮Y |X. However, since there are infinitely many vectors in , it will suffice to take a large random sample of vectors on , and the vectors need not be orthogonal.
Corollary 1. Let Tm = {t1, …, tm} be a collection of random vectors on for some large positive integer m. For 0 < τ1 < … < τK < 1 and tj, j = 1, …, m, we denote the minimizer of (4) as (, ) and . Then span[Λ(Tm)] ⊆ 𝒮Y |X, where .
2.3. Sample level estimation
Let (x1, y1), …, (xn, yn) be a random sample of (X, Y). Based on this sample, we propose the following algorithm for estimating 𝒮Y |X for a given structural dimension d.
-
Step 1:
Marginally standardize x = (x1, …, xn) as , where is the sample mean of x and is a diagonal matrix whose diagonal entries are the sample variances of the columns of x.
-
Step 2:
Choose a large sample size m and generate a random sample Tm = (t1, …, tm) from the unit hypersphere . For example, Tm = Z/∥Z∥, where Z = (Z1, …, Zm) with Zj ~ N(0, Ip) for j = 1, …, m.
-
Step 3:For j = 1, …, m, let
-
Step 4:
For a given structural dimension d, let denote the eigenvectors corresponding to the d leading eigenvalues of . Then the estimate of .
The marginal standardization in Step 1 ensures that variables measured on larger scales do not overshadow the contribution of those on the lower scale. The linear expectile model in Step 3 is implemented using the expectreg.ls() function in expectreg package in R.
3. Theoretical properties
The sample expectile estimate for 0 < τ1 < … < τK < 1 is shown by Newey and Powell (1987) to be consistent and asymptotically normal under some given regularity conditions. Thus, we focus on the theoretical properties of . Firstly, we note that since is positive semi-definite, the contribution of each t1, …, tm is taken into account in estimating . Therefore, if we choose m to be large enough, the weak law of large numbers guarantees that will converge to E[Λ(T)] for some random vector T uniformly distributed on . Furthermore, if is consistent for each t1, …, tm, then under some uniformity conditions, would also achieve consistency.
Now, for any , because is a function of the moments of the sample (t⊤x1, y1), …, (t⊤xn, yn), is a functional of the empirical distribution of the sample. Let F0(t) be the true cumulative distribution function of (t⊤X, Y), and Fn(t) denotes the empirical cumulative distribution function based on the sample (t⊤x1, y1), …, (t⊤xn, yn). Then , where ζ is some function which maps Fn(t) to some p×p positive semi-definite matrix. If ζ meets some regularity conditions and is differentiable to a certain order, we can expand ζ (Fn(t)) around F0(t). Here, we are interested in the von Mises expansion of ζ (Fn(t)), which requires ζ to be Hadamard differentiable. This requirement is mild and the interested reader may refer to Fernholz (1983) and von Mises (1947) for more on these conditions.
Assume for any t that has the first-order expansion
| (6) |
where ξ(X, Y, t) is some square integrable function with E[ξ(X, Y, t)] = 0. En(.) denotes the sample mean based on size n. The remainder Rn(t) is such that
| (7) |
where ∥.∥F is matrix Frobenius norm. Suppose the conditions for von Mises expansion are met, we state the expansion in terms of influence functions as
| (8) |
where is the influence function at F0(t) whose mean is 0 and has finite variance. The remainder Rem[Fn(t) − F0(t)] is a continuous function of t which varies over the compact set , and satisfies
| (9) |
We show in the next theorem that if the above conditions are satisfied, we can achieve a consistent estimator for 𝒮Y |X by estimating the central space for O(n) projection vectors. Moreover, if m → 8 faster than n, then is asymptotically normal with a mean of Vec(E[Λ(T)]).
Theorem 2. Assume the conditions in (6) and (7) are satisfied and that n = O(m). Then
| (10) |
| (11) |
where Ω = Var{E[Vec(ξ(X, Y, T))|X, Y]} and denotes convergence in distribution. Vec(.) denotes the vectorization of a matrix, i.e., if A = [a1 … ak], then .
4. Simulation studies
In this section, we compare the performance of our new projection expectile regression method to existing dimension reduction methods on synthetic data. We choose three linearity-based methods: sliced inverse regression (SIR), principal asymmetric least squares (PALS), and principal quantile regression (PQR). For the MAVE-type methods, we consider sliced regression (SR) and minimum average variance estimation based on the conditional density function (dMAVE). The selected methods are balanced in terms of performance and efficiency. For instance, Wang et al. (2018) showed that although PQR shows no significant improvement over qOPG for monotone links, PQR is much more computationally efficient than qOPG (about 100 times faster). Both SIR, PALS and PQR are designed for monotone links, while SR and dMAVE can both handle monotone and symmetric links. Lastly, both PALS and PQR can handle heteroscedastic errors.
The models considered are as follows:
where ϵ ~ N(0, 1) and . For model I, we let . The predictor X is generated as , where Σ is a positive definite matrix with Σij = 0.5|i−j| and ei ~ Unif(0, 5). This model is the same as example 1 in the sliced regression paper of Wang and Xia (2008). In addition, the model has some extreme values around X = 0 and does not satisfy the linearity assumption. In model II, we let and generate X ~ N(0, Ip). Thus, X has no extreme values and satisfies the linearity assumption. For model III, we let , . The predictor is generated as follows: Xi ~ discrete Unif(1, 60), i = 1, …, p/2, and for i = (p/2 + 1), …, p, Xi ~ Bern(0.5). Here, X does not satisfy the linearity assumption and it is very sparse. Lastly, in model IV, we let , and . We generate X as follows: X1 ~ Bern(0.7), X2 ~ exp(1), X3 ~ Unif(−1, 1), X4 ~ Pois(5), (X5, X6(⊤ ~ N(0, I2). For p > 6, the remaining variables are generated from Bern(0.5). Again, here X does not satisfy the linearity assumption and is sparse, especially in high dimensions (p > 6).
Next, we set the sample size and predictor dimension to satisfy conditions (a) and (b) in Lemma 1. For γ = p{n → 0, we consider combinations of (n, p) with n = {100, 200, 500} and p = {6, 10}. The results of the simulation studies based on these settings are summarized in Table 1 for p = 6 and Table 2 for p = 10. For γ > 0, we consider {n, p} = (100, 50) and (500, 200) and report the simulation results in Table 3. The results in all three Tables are based on 100 random replicates. For Tables 1 and 2, we implement PER with regular projection vectors while for Table 3, we use sparse projection vectors with s = {5, 10, 20} coordinates. To generate a sparse unit vector, we first generate a regular unit vector on . Then, we augment the vector with p−s zeros to get a sparse unit vector. The position of the elements in the augmented vector is then randomized to ensure that the position of the coordinates differ in at least one location for any pair of vectors.
Table 1:
Mean (standard deviation) of the estimation errors Δ with p = 6
| Model | n | SIR | SR | dMAVE | PALS | PQR | PER |
|---|---|---|---|---|---|---|---|
| I | 100 | 1.3465(0.0108) | 1.3434(0.0097) | 1.3585(0.0081) | 1.3275(0.0117) | 1.3562(0.0097) | 1.2525(0.0179) |
| 200 | 1.3279(0.0105) | 1.3372(0.0097) | 1.3265(0.0150) | 1.3070(0.0108) | 1.3334(0.0089) | 1.2094(0.01800) | |
| 500 | 1.2421(0.0198) | 1.3160(0.0127) | 1.3034(0.0117) | 1.1897(0.0179) | 1.2540(0.0148) | 1.0414(0.0227) | |
| II | 100 | 1.0030(0.0267) | 1.0338(0.0281) | 0.9750(0.0301) | 0.8142(0.0249) | 0.8225(0.0245) | 0.8145(0.0247) |
| 200 | 0.7653(0.0294) | 0.7969(0.0338) | 0.7778(0.0337) | 0.6787(0.0226) | 0.7015(0.0233) | 0.6851(0.0226) | |
| 500 | 0.4634(0.0136) | 0.4436(0.0185) | 0.4316(0.0199) | 0.4363(0.0129) | 0.4394(0.0125) | 0.4515(0.0129) | |
| III | 100 | 1.5888(0.0165) | 1.6193(0.0187) | 1.5866(0.0163) | 1.6136(0.0165) | 1.6067(0.0163) | 1.3572(0.0167) |
| 200 | 1.6067(0.0174) | 1.6077(0.0176) | 1.5856(0.0164) | 1.6452(0.0166) | 1.6406(0.0185) | 1.3372(0.0160) | |
| 500 | 1.5963(0.0184) | 1.5941(0.0193) | 1.5743(0.0186) | 1.5754(0.0163) | 1.6285(0.0170) | 1.2622(0.0204) | |
| IV | 100 | 1.4913(0.0141) | 1.4292(0.0222) | 1.4532(0.0188) | 1.4872(0.0141) | 1.4873(0.0098) | 1.4494(0.0175) |
| 200 | 1.4309(0.0114) | 1.2850(0.0243) | 1.3064(0.0230) | 1.4334(0.0107) | 1.4531(0.0085) | 1.3586(0.0208) | |
| 500 | 1.3946(0.0117) | 1.0522(0.0300) | 1.0772(0.0310) | 1.3818(0.0095) | 1.3982(0.0081) | 1.3377(0.0134) |
Table 2:
Mean (standard deviation) of the estimation errors Δ with p = 10
| Model | n | SIR | SR | dMAVE | PALS | PQR | PER |
|---|---|---|---|---|---|---|---|
| I | 100 | 1.3831(0.0044) | 1.3804(0.0051) | 1.3834(0.0048) | 1.3751(0.0046) | 1.3889(0.0036) | 1.2966(0.0124) |
| 200 | 1.3611(0.0076) | 1.3630(0.0079) | 1.3715(0.0068) | 1.3471(0.0083) | 1.3624(0.0062) | 1.2487(0.0149) | |
| 500 | 1.3150(0.0118) | 1.3699(0.0052) | 1.3564(0.0071) | 1.2695(0.0121) | 1.3011(0.0101) | 1.1068(0.0182) | |
| II | 100 | 1.1372(0.0222) | 1.1627(0.0216) | 1.1422(0.0221) | 1.0215(0.0218) | 1.0476(0.0207) | 1.0030(0.0216) |
| 200 | 0.9275(0.0205) | 1.0073(0.0252) | 1.0242(0.0261) | 0.8126(0.0168) | 0.8229(0.0177) | 0.7997(0.0159) | |
| 500 | 0.6344(0.0150) | 0.7091(0.0264) | 0.6978(0.0251) | 0.6032(0.0122) | 0.6168(0.0137) | 0.6030(0.0114) | |
| III | 100 | 1.8939(0.0080) | 1.9079(0.0062) | 1.9018(0.0064) | 1.9042(0.0065) | 1.9030(0.0073) | 1.4888(0.0121) |
| 200 | 1.8700(0.0072) | 1.8864(0.0068) | 1.8833(0.0072) | 1.8670(0.0072) | 1.8869(0.0069) | 1.3311(0.0149) | |
| 500 | 1.8724(0.0079) | 1.8703(0.0075) | 1.8778(0.0077) | 1.8705(0.0077) | 1.8907(0.0066) | 1.2448(0.0128) | |
| IV | 100 | 1.7478(0.0140) | 1.7405(0.0142) | 1.7307(0.0139) | 1.7351(0.0128) | 1.7192(0.0133) | 1.5468(0.0158) |
| 200 | 1.6862(0.0124) | 1.6505(0.0130) | 1.6495(0.0146) | 1.6577(0.0123) | 1.6634(0.0117) | 1.4576(0.0125) | |
| 500 | 1.5813(0.0103) | 1.4326(0.0164) | 1.4694(0.0163) | 1.5506(0.0084) | 1.5523(0.0086) | 1.4251(0.0094) |
Table 3:
Mean (standard deviation) of the estimation errors Δ with (n, p) = (100, 50) and (500, 200)
| (n, p) | Model | SIR | SR | dMAVE | PALS | PQR | PER | ||
|---|---|---|---|---|---|---|---|---|---|
| s=5 | s=10 | s=20 | |||||||
| (100, 50) | I | 1.4101(0.0006) | 1.4005(0.0021) | 1.4009(0.0023) | 1.4075(0.0011) | 1.4078(0.0010) | 1.3760(0.0067) | 1.3684(0.0065) | 1.3680(0.0059) |
| II | 1.3812(0.0043) | 1.3330(0.0070) | 1.3198(0.0093) | 1.3453(0.0053) | 1.3517(0.0050) | 1.1347(0.0239) | 1.1994(0.0172) | 1.2602(0.0116) | |
| III | 1.9998(0.0000) | 1.9995(0.0004) | NA | 1.9994(0.0000) | 1.9998(0.0001) | 1.7829(0.0176) | 1.6960(0.0172) | 1.8095(0.0108) | |
| IV | 1.9689(0.0027) | 1.9459(0.0579) | NA | 1.9656(0.0030) | 1.9686(0.0025) | 1.7805(0.0163) | 1.8141(0.0138) | 1.8461(0.0100) | |
| (500, 200) | I | 1.4120(0.0003) | 1.4092(0.0007) | 1.4077(0.0008) | NA | 1.4097(0.0004) | 1.3629(0.0742) | 1.3824(0.049) | 1.3824(0.0492) |
| II | 1.3657(0.0035) | 1.3413(0.0060) | 1.3240(0.0060) | NA | 1.3318(0.0033) | 1.0476(0.0064) | 1.0504(0.0101) | 1.1473(0.0110) | |
| III | 1.9998(0.0000) | 1.9597(0.0281) | NA | NA | 1.9998(0.0000) | 1.4919(0.0195) | 1.5672(0.0172) | 1.7959(0.0090) | |
| IV | 1.9779(0.0014) | 1.6970(0.0688) | NA | NA | 1.9759(0.0014) | 1.5875(0.0218) | 1.6813(0.0182) | 1.8176(0.0084) | |
Under all settings, we use τ = 0.1, 0.2, …, 0.9 and m = 2000 for PER. For implementing SIR, we use the dr package in R and set the number of slices to 5 under all settings. We use the MAVE package in R to implement SR and dMAVE. For the implementation of PALS and PQR, we use τ = 0.1, 0.2, …, 0.9 and λ = 1 under all settings. We measure the accuracy of the estimation of each method using the Frobenius norm distance between the estimated and the true bases matrices. Let denote the estimate of η, the true basis of 𝒮Y |X. We measure the accuracy of using
| (12) |
where PA = A(A⊤A)−1A⊤, and ∥.∥F is the matrix Frobenius norm. Smaller values of Δ indicate better estimation accuracy. Models I and II are single index models with η being β1 and β2, respectively. For models III and IV, η is (β3, β4) and (β5, β6), respectively.
We start with the results in Table 1. For model I, PER gives the most accurate estimate of η compared to the estimates based on the other methods. The performance of SIR, SR, dMAVE, PALS, and PQR appear to be similar, although PALS fares slightly better than the rest. SR and dMAVE perform poorly, especially, when the sample size is large, even though these MAVE-type methods are not handicapped by the constraints of linearity and extreme values. For model II, both PALS and PER emerge as the best with PALS slight better than PER when the sample size is at most 200. However, when we increase the sample size to 500, SR and dMAVE show significant improvement. SIR slightly lags behind all the other methods even though the predictor satisfies the linearity condition. In model III where all the components of the predictor are discrete and sparse, we see that PER outperforms all the other methods. dMAVE also appears to be slightly better than the remaining models. The results for SIR, PALS, and PQR is not surprising as the linearity condition does not hold. Lastly, in model IV where X is a complex mix of continuous and discrete variables with less sparsity, SR and dMAVE give the best estimates followed by PER.
In Table 2, we summarize the results for p = 10 with the same sample sizes as in Table 1. Here, PER performs better than all the other methods under all settings. This is because as the dimensionality of the predictor increases, the data becomes more sparse. For instance, in model IV, introducing four additional binary variables further increases the sparsity, which makes it difficult to separate the relevant information from the transitory features using interpoint distances. Thus, the MAVE-type methods no longer give the best performance as we observed in Table 1.
Table 3 shows what happens when the predictor dimensionality is increasing at almost the same rate as the sample size. The results show that PER based on sparse projection vectors especially, sparse vectors with 5 and 10 coordinates, perform better than the other methods, where applicable. In Table 3, we see that the performance of SIR, SR, dMAVE, PALS, and PQR, where applicable, appear to be similar across models when (n, p) = (100, 50). dMAVE did not converge for the models with discrete predictors, i.e., III and IV. For the case of (n, p) = (500, 200), PALS does not converge for all models while dMAVE again does not converge for the models with discrete predictors. Almost all the models besides PER struggle to converge when we further increase the dimension of X. Therefore, we did not investigate scenarios with p ≈ n or p > n although PER is applicable under such settings. Additionally, Table 3 demonstrates that using sparse directions with fewer number of coordinates, i.e., s{n → 0 yield better estimates for PER. While we do not have a theoretical way to determine s, we observed in our simulation studies that choosing s such that s{n ⩽ 0.1 gives better estimates.
4.1. Practical concerns: effect of tuning parameters
Projection expectile regression like SIR, SR, dMAVE, PALS, and PQR also relies on tuning parameter in its estimation. PER depends on the expectile levels τ and the number of sampled projection vectors m. However, our proposal appears to be robust to the choice of these tuning parameters. We demonstrate how varying the values of τ and m affect the performance of PER with additional simulations. For each parameter we consider a list of values, i.e., we set m = {1000, 2000, 5000, 10000} and consider four different sets of expectile levels: τ1 = {0.2, 0.4, 0.6, 0.8}, τ2 = {0.1, 0.2, …, 0.8, 0.9}, τ3 = (0.05, 0.10, …, 0.90, 0.95), and τ4 = (0.01, 0.02, …, 0.98, 0.99).
To investigate the effect of m, we fix τ and vary m; and vice versa. The results of the simulation studies are summarized in Tables 4 and 5. All results are based on 100 random replicates. As clearly demonstrated in Table 4, the difference in performance becomes negligible beyond m = 2000 when τ is fixed at τ2. This is why we used m = 2000 in the previous simulations. In Table 5, we also see that the performance of PER is robust to the expectile levels when we fixed m at 2000.
Table 4:
Simulation results for the effects of m with (n, p) = (100, 10) and τ2
| Model | m = 1000 | m = 2000 | m = 5000 | m = 10000 |
|---|---|---|---|---|
| I | 1.2928(0.0127) | 1.2966(0.0124) | 1.2953(0.0126) | 1.2949(0.0129) |
| II | 1.0179(0.0217) | 1.0030(0.0216) | 1.0034(0.022) | 1.0092(0.0216) |
| III | 1.5334(0.0184) | 1.4906(0.0124) | 1.4835(0.0188) | 1.5046(0.0130) |
| IV | 1.6177(0.0123) | 1.5468(0.0158) | 1.5478(0.0181) | 1.5561(0.0132) |
Table 5:
Simulation results for the effects of τ with (n, p) = (100, 10) and m = 2000
| Model | τ 1 | τ 2 | τ 3 | τ 4 |
|---|---|---|---|---|
| I | 1.2954(0.0126) | 1.2966(0.0124) | 1.2978(0.0122) | 1.3000(0.012) |
| II | 1.0036(0.0215) | 1.0030(0.0216) | 1.0024(0.0217) | 1.0017(0.0217) |
| III | 1.50 46(0.0129) | 1.4906(0.0124) | 1.4835(0.0126) | 1.4837(0.0122) |
| IV | 1.5470(0.0157) | 1.5468(0.0158) | 1.5459(0.0155) | 1.5445(0.0155) |
5. Data analysis of personal medical cost in the United States
The United States has the highest per capita health cost in the world, which makes health insurance vital to accessing health care [Emma Wager and Cox]. Hoffman and Paradise (2008) showed a strong association between health insurance coverage and access to primary and preventive care, and the medical management of chronic illness. For vulnerable subpopulations, especially, people living with chronic diseases like high blood pressure and AIDS, health insurance is shown to improve their health [Levy and Meltzer (2008)]. Therefore, it is of interest to know how insurance service providers in the United States determine the premium to charge their customers.
In this study, we seek to determine how age, sex, number of children, smoking habit, and the regional location of an individual is associated with health insurance charges in the United States. Our data is simulated based on demographic statistics from the U.S. Census Bureau to reflect real conditions for patients in the United States. The data can be found in [dataset][Lantz (2019)] and on Kaggle at https://www.kaggle.com/mirichoi0218/insurance. The data consist of 1338 observations with the following predictor variables:
age: integer between 18 and 64 indicating the age of the primary beneficiary.
bmi: body mass index (BMI) of the insured, ranging between 15.96 and 53.13.
children: integer between 0 and 5 indicating the number of children/dependents covered by the insurance plan.
sex: policy holder’s sex, restricted to male or female.
smoker: binary variable indicating whether the insured regularly smokes tobacco.
region: beneficiary’s place of residence in the U.S., divided into four geographic regions: northeast, southeast, southwest, or northwest.
The description of the insurance charges based on the above features are given in Figures 1 and 2. The boxplots in Figure 1 show that smokers tend to have higher charges than non-smokers across regions. The insurance charges also appear to be similar for males and females, but with varying levels of dispersion across regions. We also see in Figure 2 that insurance charges are linearly associated with age and bmi. In addition, the scatter plot reveals three different clusters stacked on top of each other with some sparsity and dispersion.
Figure 1:

insurance charges by region, sex, and smoking habits
Figure 2:

insurance charges by age and bmi
To analyze this dataset, we let Y denote the insurance charges, and X1, X2, X3 denote age, bmi, and number of children, respectively. For the factor variables, we create dummy variables X4, X5, X6, X7, and X8 as follows:
Lastly, we denote the predictor as X = (X1, …, X8). Notice here that X does not satisfy the linearity assumption as it contains a mixture of discrete and continuous variables. The dummy variables also makes X sparse. We apply all the methods used in the simulation study in addition to the ordinary least squares (OLS) to analyze the insurance charges. OLS was included because it is the method originally used in Lantz (2019) and it is usually the first method that comes to mind for such analysis.
Here, since we do not know η, we cannot use the Frobenius norm distance metric in (12) to evaluate performance. Therefore, we use the conditional distance correlation of Wang et al. (2015), which is in line with the fundamental principle of SDR. For random vectors P, Q, and U with arbitrary dimensions, the conditional distance correlation between P and Q given U, denoted as ϱ(P, Q, U), takes a value between 0 and 1. ϱ(P, Q, U) = 0 if and only if P ⫫ Q|U. Thus, by the definition of SDR, ϱ(Y, X, η⊤X) = 0. For our analysis, we use the sample conditional distance correlation ϱn(Y, X, ) to measure the accuracy of for each method. Smaller values of ϱn(.) indicate better performance. To implement SIR, we set the number of slices to 5, while for PALS and PQR, we set τ = 0.1, 0.2, …, 0.9 and λ = 1. For PER, we set τ = 0.1, 0.2, …, 0.9 and try m = {1000, 2000, 5000, 10000}, and report the best result.
To estimate 𝒮Y|X, we first need to estimate the structural dimension d. Several methods have been proposed in the literature on how to do this. Here, we propose to use a sequential approach based on conditional independence using conditional distance correlation. Let denote the estimated basis for a given d, and be the estimated first d sufficient predictors. We start with d = 1 and estimate ϱn(Y, X, ). We continue to increase d by 1 until we obtain a value of ϱn(Y, X, ) which is higher than the one that precedes it. Using this approach, we determined d to be 1 for all methods except SR. And even with the first two directions, SR gives a higher ϱn(.) than the single direction of PER. Hence, we stick to d = 1 and report for the various methods in Table 6. The estimate for PER is based on m = 2000.
Table 6:
estimated bases and conditional distance correlations for health insurance data
| SIR | SR | dMAVE | PALS | PQR | OLS | PER | |
|---|---|---|---|---|---|---|---|
| age | −0.0236 | 0.0290 | 0.0256 | 0.0111 | −0.0098 | 256.8564 | −0.3716 |
| bmi | −0.0063 | 0.0023 | 0.0013 | 0.0145 | −0.0034 | 339.1935 | −0.1703 |
| children | −0.0486 | 0.0617 | 0.0520 | 0.0213 | −0.0169 | 475.5005 | −0.0617 |
| sex is male | 0.0296 | −0.0552 | −0.0501 | −0.0098 | 0.0166 | −131.3144 | −0.0470 |
| smoker is yes | −0.9899 | 0.9886 | 0.9920 | 0.9975 | −0.9989 | 23848.5345 | −0.9064 |
| lives in Northwest | 0.0437 | −0.0353 | −0.0312 | −0.0099 | 0.0132 | −352.9639 | 0.0190 |
| lives in Southeast | 0.0890 | −0.0799 | −0.0656 | −0.0438 | 0.0254 | −1035.0220 | −0.0321 |
| lives in Southwest | 0.0800 | −0.0854 | −0.0689 | −0.0452 | 0.0280 | −960.0510 | 0.0623 |
| 0.2865 | 0.2501 | 0.2729 | 0.4532 | 0.4908 | 0.6422 | 0.0869 |
We see in Table 6 that PER yields the least ϱn(Y, X, ) of 8.69%, which indicates that the first sufficient predictor , was able to capture most of the dependence relationship between Y and X. SR gives the second best estimate followed closely by dMAVE and SIR. Compared to PER, these three methods perform much worse with sample conditional distance correlation values that are about three times that of PER. The estimates based on PALS, PQR, and OLS barely captured the dependence relationship between Y and X with sample conditional distance correlation values above 45%.
Figure 3 gives an illustration of how each method captures the relation between Y and X. The scatter plot of Y versus the first sufficient predictor shows that besides PER, all other methods seem to get caught in the temporal clusters. Recall from Figures 1 and 2 that smokers with lower charges overlap with non-smokers with higher charges. However, the patterns in exhibits B, D, and F separate smokers from non-smokers and split each group into two different clusters. Exhibits A, C, and E managed to capture a slight overlap, but still, deviate from what we actually observed in Figure 2. Only the pattern in exhibit G reflects what we observed in Figure 2. These scatter plots buttress the values we obtained for the conditional distance correlations in Table 6.
Figure 3:

scatter plot of charges vs 1st sufficient predictor based on various SDR methods
In a nutshell, the PER estimate shows that personal health insurance charges load heavily on the smoking habit of the insured, followed by age, and BMI. The number of children, sex, and the location of the insured appear to have no strong influence on how much insurance providers charge. This is convincing in the sense that smokers are more predisposed than non-smokers, to say, to lung cancer and other respiratory diseases. We also expect older people and people with high body mass index to have a higher risk for obesity, heart disease, and other chronic illnesses than those who are younger and/or have lower BMI. The estimates based on the other methods, however, emphasize smoking habit as the sole determinant of insurance charges, which is not very convincing. Overall, exhibit G reveals that we need a piecewise regression with three linear functions to model the relationship between insurance charges based on the features of the insured.
6. Concluding Remarks
In this paper, we present a new method for estimating the central space for regression analysis with a complex predictor structure called projection expectile regression. Our extensive simulation studies show that when the predictor is sparse, projection expectile regression performs better than the linearity-based and MAVE-type methods. The analysis of health insurance charges also demonstrates that when the response surface is stratified, existing methods struggle to recover the central space because they get caught up in the clusters of data points. However, projection expectile regression is not handicapped by this non-smoothness in the response link.
We did not investigate how to determine the structural dimension based on PER in this paper. We suggest choosing d based on the conditional distance correlation as this metric is in line with the core principle of SDR. In addition, our proposal is limited to only regression with monotone links since the linear projection technique is ill-equipped to handle nonlinear structures. PER shares this limitation with SIR, PALS, PQR, and OLS. Therefore, future work would be to extend this idea to other response link functions. Finally, although we only focused on the case of small and moderate dimensional predictors, our proposal readily extends to cases with n << p by Lemma 1(b).
Acknowledgements
We thank the editor, the associate editor, and the two reviewers for their insightful comments, which significantly improved the original manuscript. The author also sincerely thank Drs: Marie Lynn Miranda and Lizhen Lin in the Department of Applied & Computational Mathematics and Statistics at the University of Notre Dame; and Dr. Yuexiao Dong in the Department of Statistics, Operations and Data Science at Temple University for their feedback and comments, which greatly improved the initial idea. This work was partly supported by the National Institute of Environmental Health Sciences(NIEHS), Powering Research through Innovative methods for Mixtures (PRIME) Program, grant R01ES028819. The content is solely the responsibility of the author and does not necessarily represent the official views of the National Institutes of Health.
Appendix A. Appendix of proofs
Proof of Lemma 2.
Let and τ ∈ (0, 1). By Jensen’s inequality, E[ρτ(Y)|t⊤X] ⩾ ρτ[E(Y|t⊤X)] due to the convexity of ρτ. Notice that by the definition of ρτ (.) in (4), ρτ[E(Y|t⊤X)] = 0 if and only if E(Y |t⊤X) = 0. However, [E(Y|t⊤X)] = 0 holds if and only if E(Y |X) = 0 by Theorem 1 in Bierens (1982). □
Proof of Theorem 1.
Without loss of generality, assume E(X) = 0, and let be given.
The inequality is due to the convexity of ρτ and the second equality holds by the definition of SDR in (1). Under conditions in Lemma 1, E(t⊤X|η⊤X) is linear in η⊤X, which implies that E(t⊤X|η⊤X) = Σ(η⊤Ση)−1η⊤X. Therefore, the minimizer bτ must satisfy bτη(η⊤Ση)−1Σ ∈ Span(η). □
Proof of Corollary 1.
For any fixed , Span[Λ(t)] ∈ 𝒮Y|X by Theorem 1. Let v be any vector orthogonal to the central space. Then v ⊥ Λ(t), , which implies E[Λ(T)v] = 0 for any random T uniformly distributed on . Therefore, v ⊥ Span{E[Λ(T)]}. □
We provide the upcoming Lemma to help us with the proof of theorem 2.
Lemma 3. Let Sn be a sequence of random vectors in with E(Sn) = μn and Var(Sn) = Σn. Then .
Proof of Lemma 3.
Here, Σn is allowed to be singular, so that if is a projection onto span(Σn), then almost surely when Σn is singular. Thus, we have
| (A.1) |
where A† is the Moore-Penrose inverse of a symmetric matrix A. Next, by Chebyshev’s inequality,
| (A.2) |
Notice that . Therefore, , which implies that . □
Proof of Theorem 2.
To prove (10), we start by expressing as
| (A.3) |
where is a sequence of random matrices. Because n/m = O(1) and is no greater than Op(n−1/2), the term by the Lindeberg-Levy central limit theorem. Also,
| (A.4) |
Thus,
| (A.5) |
Next, we show that is also Op(n−1/2). It will suffice to show that is Op(n−1/2). Because ξ(X, Y, t) is square integrable with E [ξ(X, Y, t) = 0, Vec(Wn) also has a mean of 0 and variance given by
| (A.6) |
Notice that if i ≠ i′ and j ≠ j′, . Similarly, if i ≠ i′ and j ≠ j′, then . Therefore, (A.6) reduces to 0 whenever i ≠ i′ as such we can express (A.6) as
| (A.7) |
| (A.8) |
where T1 ⫫ T2 and (T1, T2) ⫫ (X, Y). Finally, by Lemma 3,
Hence, the term in (A.5) is also Op(n−1/2) which completes the proof of (10).
To prove (11), note that because m → ∞ as n → ∞,
Now, let . Then, E(Ec) = 0. Next, if we let T ⫫ (X, Y), then
Hence, Var[Vec(Wn)] = Var[Vec(Ec)] + o(n−1). We now proceed to find the covariance of Vec(Wn) and Vec(Ec). That is,
where the second equality holds because the distribution of is the same for all j. Whenever i ≠ i′, the summand goes to 0, which reduces the covariance to
Therefore, Cov[Vec(Wn), Vec(Ec)] = Var[Vec(Ec)]+o(n−1). Thus, by Lemma 3, Var[Vec(Wn) − Vec(Ec)] = o(n−1) since E(Wn) = E(Ec) = 0. Hence, as n → ∞, Wn = Ec + op(n−1/2). Therefore, it can be deduced that
which implies (11). □
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- Bickel PJ, Kur G, Nadler B, 2018. Projection pursuit in high dimensions. Proceedings of the National Academy of Sciences 115, 9151–9156. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bierens HJ, 1982. Consistent model specification tests. Journal of Econometrics 20, 105–134. [Google Scholar]
- Chiaromonte F, Cook RD, Li B, 2002. Sufficient dimension reduction in regressions with categorical predictors. Annals of Statistics, 475–497. [Google Scholar]
- Cook RD, Weisberg S, 1991. Sliced inverse regression for dimension reduction: Comment. Journal of the American Statistical Association 86, 328–332. [Google Scholar]
- Dong Y, 2021. A brief review of linear sufficient dimension reduction through optimization. Journal of Statistical Planning and Inference 211, 154–161. [Google Scholar]
- Eaton ML, 1986. A characterization of spherical distributions. Journal of Multivariate Analysis 20, 272–276. [Google Scholar]
- Emma Wager JO, Cox C,. How does health spending in the u.s. compare to other countries? URL: https://www.healthsystemtracker.org/chart-collection/health-spending-u-s-compare-countries-2/#GDP%20per%20capita%20and%20health%20consumption%20spending%20per%20capita,%202020%20(U.S.%20dollars,%20PPP%20adjusted).
- Hall P, 1989. On projection pursuit regression. The Annals of Statistics, 573–588. [Google Scholar]
- Hoffman C, Paradise J, 2008. Health insurance and access to health care in the united states. Annals of the New York Academy of Sciences 1136, 149–160. [DOI] [PubMed] [Google Scholar]
- Huber PJ, 1985. Projection pursuit. The annals of Statistics, 435–475. [Google Scholar]
- Kong E, Xia Y, 2014. An adaptive composite quantile approach to dimension reduction. The Annals of Statistics 42, 1657–1688. [Google Scholar]
- Lantz B, 2019. Machine learning with R: expert techniques for predictive modeling. Packt publishing ltd. [Google Scholar]
- Levy H, Meltzer D, 2008. The impact of health insurance on health. Annu. Rev. Public Health 29, 399–409. [DOI] [PubMed] [Google Scholar]
- Li B, Wang S, 2007. On directional regression for dimension reduction. Journal of the American Statistical Association 102, 997–1008. [Google Scholar]
- Li KC, 1991. Sliced inverse regression for dimension reduction. Journal of the American Statistical Association 86, 316–327. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li KC, Duan N, 1989. Regression analysis under link violation. The Annals of Statistics 17, 1009–1052. [Google Scholar]
- Newey WK, Powell JL, 1987. Asymmetric least squares estimation and testing. Econometrica: Journal of the Econometric Society, 819–847. [Google Scholar]
- Shao Y, Cook RD, Weisberg S, 2009. Partial central subspace and sliced average variance estimation. Journal of Statistical Planning and Inference 139, 952–961. [Google Scholar]
- Soale AN, Dong Y, 2021. On expectile-assisted inverse regression estimation for sufficient dimension reduction. Journal of Statistical Planning and Inference 213, 80–92. [Google Scholar]
- Soale AN, Dong Y, 2022. On sufficient dimension reduction via principal asymmetric least squares. Journal of Nonparametric Statistics 34, 77–94. [Google Scholar]
- Wang C, Shin SJ, Wu Y, 2018. Principal quantile regression for sufficient dimension reduction with heteroscedasticity. Electronic Journal of Statistics 12, 2114–2140. [Google Scholar]
- Wang H, Xia Y, 2008. Sliced regression for dimension reduction. Journal of the American Statistical Association 103, 811–821. [Google Scholar]
- Wang Q, Yao W, 2012. An adaptive estimation of mave. Journal of Multivariate Analysis 104, 88–100. [Google Scholar]
- Wang X, Pan W, Hu W, Tian Y, Zhang H, 2015. Conditional distance correlation. Journal of the American Statistical Association 110, 1726–1734. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xia Y, 2007. A constructive approach to the estimation of dimension reduction directions. The Annals of Statistics 35, 2654–2690. [Google Scholar]
- Xia Y, Tong H, Li WK, Zhu LX, 2002. An adaptive estimation of dimension reduction space. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 64, 363–410. [Google Scholar]
