Mixture prior for sparse signals with dependent covariance structure

Ling Wang; Zongqiang Liao

doi:10.1371/journal.pone.0284284

. 2023 Apr 27;18(4):e0284284. doi: 10.1371/journal.pone.0284284

Mixture prior for sparse signals with dependent covariance structure

Ling Wang ^1,^*, Zongqiang Liao ²

Editor: Jiang Gui³

PMCID: PMC10138223 PMID: 37104465

Abstract

In this study, we propose an estimation method for normal mean problem that can have unknown sparsity as well as correlations in the signals. Our proposed method first decomposes arbitrary dependent covariance matrix of the observed signals into two parts: common dependence and weakly dependent error terms. By subtracting common dependence, the correlations among the signals are significantly weakened. It is practical for doing this because of the existence of sparsity. Then the sparsity is estimated using an empirical Bayesian method based on the likelihood of the signals with the common dependence removed. Using simulated examples that have moderate to high degrees of sparsity and different dependent structures in the signals, we demonstrate that the performance of our proposed algorithm is favorable compared to the existing method which assumes the signals are independent identically distributed. Furthermore, our approach is applied on the widely used “Hapmap” gene expressions data, and our results are consistent with the findings in other studies.

Introduction

Feature selection in normal mean with high-dimensional data is one of main challenges in Genetics studies. The feature selection in normal mean problem can be described as a p-dimensional vector Y satisfying Y = X + Z. In particular, X is a sparse vector, that is, a great portion of elements of X are zeros and the proportion of zeros in X is unknown. Z is a vector of noises and has a distribution as $Z \sim N (0, Σ)$ , where Σ is the covariance matrix with arbitrary dependent structure. The purpose of this study is to find the desirable estimate of X vector which has the accurate sparsity information while considering correlations in the error vector Z.

Traditionally, there are mainly two branches of studies in feature selection under normal mean issue. One branch of studies focuses primarily on constructing and selecting subset of features that are useful to build a good predictor where Y could be the regression coefficients, e.g. [1–4]. Another branch of studies uses multiple hypothesis testing in genomics/bio-informatics where test statistics could be considered as Y, such as [5–7].

The study [8] was the first study that employed empirical Bayesian estimation method to find the sparsity in sparse vectors observed in Gaussian white noises. They assumed a mixture of point mass at zero and parametric distribution as the prior distribution for the sparse vectors, and inferred theoretical statistical properties for the estimators. [4] extended the work of [8] to allow more flexibility in the estimation by assuming the prior distribution as a mixture of point mass and nonparametric distribution. Nonetheless, both studies assumed that sparse signals in normal mean problem were independent identically distributed.

Another study [7] investigated the problem of controlling false discovery rate (FDR) with dependent structure in test statistics of multiple testing. They used an innovative principle factor approximation (PFA) procedure to significantly weaken the correlation structure. However, their study used a common p-value threshold to select the relevant features which can not be adapted to estimate sparsity in signals.

Our study extends the work of [4, 8] to allow sparse signals to be correlated and have arbitrary covariance structure, and adopts the PFA procedure of [7] to estimate sparsity in signals automatically by assuming a mixture prior in an empirical Bayesian estimating setup. The motivation for allowing correlation in sparse signals comes from gene expressions data analyses that need to identify a proportion of genes associated with the outcome. There are usually statistical correlations between genes in gene expression data because of either biological reason (i.e. some genes are connected in biochemical pathways, [9]), or technical reason (i.e. imperfect normalization, [10]). Ignoring the correlation between genes may cause high variability of statistical estimators and even compromise their consistency [6, 11]. Furthermore, there exists a sparsity problem: although the number of genes is large, there are maybe only a very small subset of genes that contribute to the outcome, [7].

To find the sparsity in a high-dimensional sparse vector with correlations, we first apply spectral decomposition on the covariance matrix of the error terms and take out the principle factors that derive the strong dependence in the covariance matrix so that the remaining factors have weak dependence. Because of the existence of sparsity, the approximation is accurate [7]. Then, we employ an empirical Bayesian method to estimate the sparsity based on the marginal likelihood of signals without strong dependence. We consider a num-ber of dependent structures in the sparse signals for simulation studies. The results demon-strate the advantages of our empirical Bayesian (DepEB) estimator considering dependence over the empirical Bayesian (EB) estimator proposed by [4] without considering depen-dence when there exists strong dependence in the covariance: 1) Our DepEB estimator pro-vides more accurate estimates of the sparsity than the EB estimator; 2) Our DepEB estimator gives smaller (integrated) mean squared errors (MSEs) than the EB estimator. In a real gene data application, we consider observed signals as the corresponding marginal regression coefficients of the genes on the outcome. There are strong correlations among the marginal coefficients because of the high association between genes. Our proposed DepEB estimator outperforms the EB estimator, and our results are also consistent with other studies.

In sum, our estimating method has two main features. First, our estimating method allows the covariance of the observed high-dimensional signals to have arbitrary dependent structure instead of assuming the signal vector follows an independent and identically distribution. Second, our estimating method incorporates the possibility of sparsity and uses a mixture prior with an atom of probability at zero and a non-parametric density for the nonzero part. We treat the mixture probability as well as the non-parametric part as hyperparameters, and we estimate them by an empirical Bayesian procedure automatically.

The remainder of this paper is organized as follows. Section Materials and Methods explicitly describes our model and estimation algorithm for our proposed estimator. In section Simulation Studies, the performance of the proposed estimator is evaluated by a number of simulation studies. Section Real Data Analysis presents the real data analysis using our proposed estimator. Section Conclusion concludes the paper.

Materials and methods

Model

We are interested in estimating measurement error models where we observe only the error-contaminated variable Y_i in the equation:

\begin{matrix} Y_{i} = X_{i} + Z_{i}, i = 1, \dots, p, \end{matrix}

(1)

where Z₁, …, Z_p are measurement errors with joint distribution as $N (0, Σ)$ . This study allows arbitrary dependent structure in the covariance matrix Σ. X_i is assumed to come from a mixture of a delta function with point mass at zero and a completely unspecified nonparametric density θ

\begin{matrix} P (X_{i} | ψ, θ) = ψ δ (X_{i}) + (1 - ψ) θ (X_{i}), \end{matrix}

(2)

where δ(X_i) is the Dirac delta function when X_i = 0 and ψ ∈ [0, 1] describes the prior probability when X_i = 0. This prior structure represents our belief that some of X₁, …, X_p are exactly zero, i.e. the mixing parameter ψ is the fraction of zeros in X₁, …, X_p corresponding to the number of irrelevant features. We treat ψ as a hyperparameter and estimate it using an empirical Bayesian approach. For the nonzero part of the prior θ(X_i), there were a number of parametric distribution prior specifications used by previous studies, such as Laplace distribution in [12], Student-t distribution in [13], and normal distribution in [14]. Although these previous studies have shown that the parametric specifications are successful, they all need to assume a specific shape for the nonzero part. This study takes a more realistic approach by leaving the nonzero part of the prior, θ(X_i), completely unspecified as proposed in [4].

Principal factor approximation

Based on the observation of Y₁, …, Y_p in Eq 1, we estimate the values for X₁, …, X_p. In the sparse scenario, estimated $\hat{X_{i}}$ need to find the correct degree of sparsity in X_i.

Because covariance matrix Σ of Z₁, …, Z_p can have arbitrary dependent structure, the first step of estimation is to remove dependence among the error terms Z_i. Therefore, we approximate the likelihood function of signals Y₁, …, Y_p with weakly dependent normal random variables as proposed in [7].

The definition of weakly dependent normal random variables is stated as

Definition 1 If a set of random variables (I₁, …, I_p)^T has the normal distribution $N ({(μ_{1}, \dots, μ_{p})}^{T}, C)$ and the (i, j)th element c_ij in the covariance matrix C satisfying the condition

\begin{matrix} lim_{p \to \infty} \frac{\sum_{i, j} | c_{i j} |}{p^{2}} = 0, \end{matrix}

(3)

then I₁, …, I_p are called weakly dependent normal random variables.

We apply principal factor approximation (PFA) technique in [7] to decompose the covariance matrix Σ so Eq 1 becomes a factor model with weakly dependent normal random errors. Before PFA procedure, the estimate of Σ need to be calculated since in real data set, true Σ is usually unknown. We propose to use sample covariance of the errors as the estimate for Σ.

The details of PFA procedure are described in the following steps:

Apply spectral decomposition on the covariance matrix Σ. Denote eigenvalues of Σ as λ₁, …, λ_p, which are arranged from the largest to the smallest. Denote the corresponding eigenvectors as γ₁, …, γ_p, then $Σ = \sum_{i = 1}^{p} λ_{i} γ_{i} γ_{i}^{T}$ .
Further separate $Σ = \sum_{i = 1}^{p} λ_{i} γ_{i} γ_{i}^{T}$ to two parts for an appropriate integer k so $Σ = \sum_{i = 1}^{k} λ_{i} γ_{i} γ_{i}^{T} + A$ , where $A = \sum_{i = k + 1}^{p} λ_{i} γ_{i} γ_{i}^{T}$ and k is chosen as the smallest k such that
$\begin{matrix} \frac{\sqrt{λ_{k + 1}^{2} + . . . + λ_{p}^{2}}}{λ_{1} + . . . + λ_{p}} \leq ϵ \end{matrix}$ (4)
for a predetermined small value ϵ, say 0.05. Let $L = (\sqrt{λ_{1}} γ_{1}, \dots, \sqrt{λ_{k}} γ_{k})$ . Then the covariance matrix Σ can be decomposed as Σ = LL^T + A.
Rewrite Eq 1 as $Y_{i} = X_{i} + \sum_{h = 1}^{k} b_{i h} W_{h} + K_{i}$ , i = 1, …, p. Here ${(b_{1 j}, \dots, b_{p j})}^{T} = \sqrt{λ_{j}} γ_{j}$ for j = 1, …, k. Each factor $W_{h} \sim N (0, 1)$ for h = 1, …, k and the random errors ${(K_{1}, \dots, K_{p})}^{T} \sim N (0, A)$ . As shown in [7], K₁, …, K_p are weakly dependent normal random variables based on k chosen by Eq 4.
Estimate factors W₁, …, W_k based on the data. From observed values Y₁, …, Y_p, choose the first s Y_i’s that have the smallest 90% percentile absolute values. Then approximately, $Y_{i} = \sum_{h = 1}^{k} b_{i h} W_{h} + K_{i}, i = 1, \dots, s$ because the existence of sparsity makes this approximation practical. $\hat{W_{1}}, . . ., \hat{W_{k}}$ are obtained by least-absolute deviation regression on the approximation equation.

After applying PFA procedures on the covariance matrix Σ, we can re-write the normal mean problem with dependent error terms in Eq 1 as

\begin{matrix} Y_{i} = X_{i} + \sum_{h = 1}^{k} b_{i h} \hat{W_{h}} + K_{i}, i = 1, \dots, p . \end{matrix}

(5)

Denote $ϕ_{i} = Y_{i} - \sum_{h = 1}^{k} b_{i h} \hat{W_{h}}$ , then Eq 5 can be rewritten as

\begin{matrix} φ_{i} = X_{i} + K_{i}, i = 1, \dots, p, \end{matrix}

(6)

where K_i’s are weakly dependent normal variables with distribution as $N (0, a_{i}^{2})$ and $a_{i} = \sqrt{Var (Z_{i}) - \sum_{h = 1}^{k} b_{i h}^{2}}$ for i = 1, …, p.

Empirical bayesian estimation

Here, we estimate the hyperparameter $\hat{ψ}$ in Eq 2 using empirical Bayesian method.

Posterior distribution

We first derive the posterior distribution based on the likelihood function $P (ϕ_{i} | X_{i})$ and the prior distribution of X_i, for i = 1, …, p.

From Eq 6 we have $P (ϕ_{i} | X_{i}) = N (ϕ_{i} | X_{i}, a_{i})$ . Since K_i’s are weakly dependent, the likelihood of the parameters ϕ given the observations X can be approximated as

\begin{matrix} P (ϕ | X) \approx \prod_{i = 1}^{p} P (ϕ_{i} | X_{i}) = \prod_{i = 1}^{p} N (ϕ_{i} | X_{i}, a_{i}) . \end{matrix}

(7)

Given the prior distribution of X in Eq 2, i.e. $P (X_{i} | ψ, θ)$ and likelihood $P (ϕ | X)$ , the posterior of X given ϕ can be written as

\begin{matrix} P (X | ϕ, ψ, θ) = \frac{\prod_{i = 1}^{p} P (ϕ_{i} | X_{i}) P (X_{i} | ψ, θ)}{m (ϕ | ψ, θ)}, \end{matrix}

(8)

where m(ϕ|ψ, θ) is the marginal of the data given the hyper-parameter ψ and the nonparametric distribution θ

\begin{matrix} m (ϕ | ψ, θ) = \prod_{i = 1}^{p} \int P (ϕ_{i} | X_{i}) P (X_{i} | ψ, θ) d X_{i} . \end{matrix}

(9)

Replace $P (ϕ_{i} | X_{i})$ by $N (ϕ_{i} | X_{i}, a_{i})$ and substitute $P (X_{i} | ψ, θ)$ by Eq 2, then Eq (9) can be rewritten as

\begin{matrix} m (ϕ | ψ, θ) & = & \prod_{i = 1}^{p} \int N (ϕ_{i} | X_{i}, a_{i}) [ψ δ (X_{i}) + (1 - ψ) θ (X_{i})] d X_{i} . \end{matrix}

That is

\begin{matrix} m (ϕ_{i} | ψ, θ) & = & ψ N (ϕ_{i} | \sum_{h = 1}^{k} b_{i h} \hat{W_{h}}, a_{i}) + (1 - ψ) \int N (ϕ_{i} | X_{i}, a_{i}) θ (X_{i}) d X_{i} \\ = & ψ N (ϕ_{i} | \sum_{h = 1}^{k} b_{i h} \hat{W_{h}}, a_{i}) + (1 - ψ) g (ϕ_{i}), \end{matrix}

(10)

where

g (ϕ_{i}) = \int N (ϕ_{i} | X_{i}, a_{i}) θ (X_{i}) d X_{i} .

Then the posterior of X_i can be written as:

P (X_{i} | ϕ_{i}, ψ, θ) = \frac{ψ δ (X_{i}) N (ϕ_{i} | \sum_{h = 1}^{k} b_{i h} \hat{W_{h}}, a_{i}) + (1 - ψ) θ (X_{i}) N (ϕ_{i} | X_{i}, a_{i})}{ψ N (ϕ_{i} | \sum_{h = 1}^{k} b_{i h} \hat{W_{h}}, a_{i}) + (1 - ψ) g (ϕ_{i})} .

Define

q_{i} = \frac{ψ N (ϕ_{i} | \sum_{h = 1}^{k} b_{i h} \hat{W_{h}}, a_{i})}{ψ N (ϕ_{i} | \sum_{h = 1}^{k} b_{i h} \hat{W_{h}}, a_{i}) + (1 - ψ) g (ϕ_{i})}

and

G (X_{i}) = \frac{N (X_{i} | ϕ_{i}, a_{i}) θ (X_{i})}{\int N (X_{i} | ϕ_{i}, a_{i}) θ (X_{i}) d X_{i}},

then

\begin{matrix} P (X_{i} | ϕ_{i}, ψ, θ) & = & q_{i} δ (X_{i}) + (1 - q_{i}) G (X_{i}) . \end{matrix}

(11)

Notice that q_i is the posterior density of X_i when it is 0 and G(X_i) is the posterior density of X_i when it is not 0. That is

\begin{matrix} P (X_{i} = 0 | ϕ_{i}, ψ, θ) & = & q_{i}, \\ P (X_{i} | ϕ_{i}, ψ, θ, X_{i} \neq 0) & = & G (X_{i}) . \end{matrix}

Estimate the hyperparameter ψ

We use an empirical Bayesian approach in [15] to estimate the hyperparameter ψ by maximizing the marginal likelihood m(ϕ|ψ, θ)

\begin{matrix} \hat{ψ} & = & arg max_{ψ} m (ϕ | ψ, θ) = arg max_{ψ} log m (ϕ | ψ, θ) \\ = & arg max_{ψ} \sum_{i = 1}^{p} log [ψ N (ϕ_{i} | \sum_{h = 1}^{k} b_{i h} \hat{W_{h}}, a_{i}) + (1 - ψ) g (ϕ_{i})] . \end{matrix}

(12)

Notice $g (ϕ_{i}) = \int N (ϕ_{i} | X_{i}, a_{i}) θ (X_{i}) d X_{i}$ , which is the marginal density of non-zero X_i’s. It is proposed by [4] to estimate g(ϕ_i) directly. In this way, we do not have to specify any prior distribution form for the non-zero part of X_i’s. We use a weighted non-parametric kernel density estimator in [4] to estimate g of the following form

g (ϕ_{i}) = \frac{1}{p ν} \sum_{j = 1}^{p} (1 - q_{i}) R (\frac{ϕ_{i} - ϕ_{j}}{ν}),

where R is a kernel function satisfying ∫R(x)dx = 1 and ν is a positive number called band-width of the kernel. The most widely used kernel is a normal density with zero mean and unit variance, that is, $R (x) = N (x | 0, 1)$ . Therefore, we set the bandwidth of the kernel using the normal reference rule $ν = O (p^{- 1 / 5})$ as in [16].

In summary, the hyperparameter ψ can be estimated by the following iterative steps

Algorithm

1. Given the current estimate $\hat{g} (ϕ_{i})$ , obtain $\hat{ψ}$ by maximizing the log-marginal in Eq (12) numerically.

2. Compute $\hat{q_{i}}$ using the current estimated $\hat{ψ}$ and $\hat{g} (ϕ_{i})$

\hat{q_{i}} = \frac{\hat{ψ} N (ϕ_{i} | \sum_{h = 1}^{k} b_{i h} \hat{W_{h}}, a_{i})}{\hat{ψ} N (ϕ_{i} | \sum_{h = 1}^{k} b_{i h} \hat{W_{h}}, a_{i}) + (1 - \hat{ψ}) \hat{g} (ϕ_{i})} .

3. Re-estimate $\hat{g} (ϕ_{i})$ using the current estimate of $\hat{q_{i}}$

\begin{matrix} \hat{g} (ϕ_{i}) = \frac{1}{p ν} \sum_{j = 1}^{p} (1 - \hat{q_{i}}) R (\frac{ϕ_{i} - ϕ_{j}}{ν}) . \end{matrix}

(13)

Posterior mean

We use the mean of the posterior as a point estimate for non-zero part of X_i’s.

\begin{matrix} {\hat{X}}_{i} = (1 - \hat{q_{i}}) E_{G} [X_{i}], \end{matrix}

(14)

where

E_{G} [X_{i}] = \frac{\int X_{i} N (X_{i} | ϕ_{i}, a_{i}) θ (X_{i}) d X_{i}}{\int N (X_{i} | ϕ_{i}, a_{i}) θ (X_{i}) d X_{i}} = ϕ_{i} + {\hat{g}}^{'} (ϕ_{i}) / \hat{g} (ϕ_{i}),

and the estimate of marginal $\hat{g} (ϕ_{i})$ is obtained using Eq 13 and the estimate of derivative ${\hat{g}}^{'} (ϕ_{i})$ is given by

{\hat{g}}^{'} (ϕ_{i}) = \frac{1}{p ν^{2}} \sum_{j = 1}^{p} (1 - \hat{q_{i}}) R^{'} (\frac{ϕ_{i} - ϕ_{j}}{ν}) .

Simulation studies

In this section, we conduct simulation studies to evaluate our proposed estimation procedure and make comparisons between our method with the one in [4]. We simulate the data for Eq 1. A sequence of p = 1000 signals Y_i’s are generated with 200 repetitions in the simulation study.

We generate X_i using a type of non-zero distribution with different degree of sparsity. Two different types of distribution are considered for the non-zero values in X_i’s: 1) Uninormal and 2) Binormal. The detailed descriptions for the distribution of X_i’s used in the simulation studies are summarized in Table 1. We let X_i equal to zero at randomly selected positions and the proportion of zeros in X_i’ ranges from 0.6 to 0.9 with interval of 0.1, which represents the situation from low sparsity to high sparsity. Because the existence of sparsity makes the PFA approximation practical as specified in Eq 5, we consider relatively high values of sparsity parameter ψ in the simulation studies.

Table 1. Distribution types of X_i’s for different sparsity of ψ.

Uninormal (Unimodal Normal)	$\begin{matrix} X_{i} = {\begin{matrix} 0, & if V_{i} = 0, \\ N (5, 1), & if V_{i} = 1, \end{matrix} \\ and P (V_{i} = 0) = ψ, ψ = 0.6, \dots, 0.9 \end{matrix}$
Binormal (Bimodal Normal)	$\begin{matrix} X_{i} = {\begin{matrix} 0, & if V_{i} = 0, \\ N (3, 1) with P = 0.4 \\ or & if V_{i} = 1, \\ N (8, 1) with P = 0.6, \end{matrix} \\ and P (V_{i} = 0) = ψ, ψ = 0.6, \dots, 0.9 \end{matrix}$

Open in a new tab

The observed value of Y_i is generated by adding noise Z_i for each X_i based on Eq 1. Let $Z \sim N (0, Σ)$ . Our simulation studies consider six different dependent structures in the covariance of noises Z. The dependent structures of Σ are similar to the settings used in [7], and they are generated using random variables ${B_{i}}_{i = 1}^{p}$ with a sample size of n = 100. The covariance matrix Σ is the sample correlation matrix of B₁, ⋯, B_p.

Independent Cauchy: let B_i be i.i.d. Cauchy random variables with location parameter 0 and scale parameter 1.
Three-Factor Model: let $B_{i} = ρ_{i}^{(1)} D^{(1)} + ρ_{i}^{(2)} D^{(2)} + ρ_{i}^{(3)} D^{(3)} + H_{i}$ , where $ρ_{i}^{(1)}$ , $ρ_{i}^{(2)}$ and $ρ_{i}^{(3)}$ are i.i.d. $U (- 1, 1)$ , $D^{(1)} \sim N (- 2, 1)$ , $D^{(2)} \sim N (1, 1)$ and $D^{(3)} \sim N (4, 1)$ , and H_i are i.i.d. $N (0, 1)$ .
Nonlinear-Factor Model: let $B_{i} = sin (ρ_{i}^{(1)} D^{(1)}) + sgn (ρ_{i}^{(2)}) exp (| ρ_{i}^{(2)} | D^{(2)}) + H_{i}, i = 1, \dots, 1000$ , where $ρ_{i}^{(1)}$ and $ρ_{i}^{(2)}$ are i.i.d. $U (- 1, 1)$ , D⁽¹⁾ and D⁽²⁾ are i.i.d. $N (0, 1)$ , and H_i are i.i.d. $N (0, 1)$ .
Fan & Song’s Model: let ${B_{i}}_{i = 1}^{900}$ be i.i.d. $N (0, 1)$ , and $B_{i} = \sum_{l = 1}^{10} Z_{l} {(- 1)}^{l + 1} / 5 + \sqrt{1 - 10 / 25} η_{i}, i = 901, \dots, 1000$ , where η_i for i = 901, …, 1000 are standard normally distributed.
Equal Correlation of 0.4: let diagonal elements of covariance matrix Σ be 1 and off-diagonal elements be 0.4 representing the strength of correlation in the error terms Z.
Equal Correlation of 0.8: let diagonal elements of covariance matrix Σ be 1 and off-diagonal elements be 0.8.

The heatmaps for six covariance structures of error terms Z are shown in Fig 1. It can be seen from Fig 1 that the correlations of error terms in Independent Cauchy, Three-Factor Model, Nonlinear-Factor Model are very small. Fan & Song’s Model is very close to independent distribution but it has some dependence in a small part of the data points. The correlations of error terms in Equal Correlation of 0.4 and 0.8 models are much stronger than other covariance structures.

Given Z, we estimate the mixture probability of ψ and recover the posterior distribution of X. Using simulated data sets, Figs 2 and 3 show estimated ψ and mean squared error (MSE) for Uninormal and Binormal types of X and six covariance structures of error terms Z over 200 simulations. To make comparison, Figs 2 and 3 show results obtained both by using our proposed DepEB estimation algorithm considering dependence in error terms (in red color) and the EB estimator proposed by [4] without considering dependence in error terms (in gray color).

Specifically, panels (a) in Figs 2 and 3 show means and standard deviations of estimated ψ over 200 repetitions versus true values of ψ when distribution of X follows Uninormal and Binormal, respectively. The first four sub-plots in panels (a) of Figs 2 and 3 show the estimated mixture probability ψ when the error terms are generated using Independent Cauchy Model, Three-Factor Model, Nonlinear-Factor Model and Fan & Song’s Model. Notice that the correlation structures of the error terms in these four models are very close to independent structure. In these cases, the results obtained by using our proposed DepEB estimation algorithm (in red color) and the EB estimator (in gray color) both can give estimates close to the true values of ψ and give small standard deviations of the estimates.

The last two sub-plots in panels (a) of Figs 2 and 3 show the estimated results when the error terms are generated using Equal Correlation of 0.4 and 0.8 structures, respectively. As shown in the heatmap Fig 1, these two types of dependence structure have much stronger dependence among error terms than the previous four dependence structures. It can be seen that our proposed DepEB estimators give quite accurate estimates and small standard deviations (in red color), which indicates our estimation algorithm can adapt quite well to high correlation in error terms. The distribution of error terms is clearly mis-specified if ignoring the dependence structure. Therefore, the EB estimators tend to underestimate ψ and give large standard deviations (in gray color), which indicates that the proportion of nonzeros would be overestimated and existence of big bias.

To evaluate the accuracy of the estimation for the non-zero part of X, we calculate mean squared error $MSE = 1 / p \sum_{i = 1}^{p} {(\hat{X_{i}} - X_{i})}^{2}$ , where $\hat{X}$ is the mean of posterior. Panels (b) in Figs 2 and 3 plot the means and standard deviations of MSE over 200 repetitions versus the true values of ψ when distribution of X follows Uninormal and Binormal, respectively.

The following findings can be seen from panels (b) in Figs 2 and 3: 1) MSEs estimated by our DepEB algorithm (in red color) are always lower than those estimated by the EB algorithm (in gray color). When correlations in error terms are very strong (e.g. Equal Correlation of 0.8 model), MSEs estimated by our proposed DepEB algorithm are much lower than those estimated by the EB algorithm; 2) As the signal becomes more sparse, the MSEs estimated by both the DepEB algorithm and EB algorithm become lower; 3) Similar to the findings in standard deviation of estimated ψ, standard deviations of MSEs in our proposed DepEB algorithm are small for different types of correlation structure in the error terms, while the standard deviations of MSEs estimated by the EB algorithm are large when there are moderate to strong correlations in the error terms (e.g. Equal Correlation of 0.4 and 0.8 models); 4) As the signal becomes more sparse, the standard deviations of MSEs estimated by both the DepEB algorithm and EB algorithm generally become smaller.

Real data analysis

We use gene expression data set to test the validity of our proposed DepEB estimation algorithm. The gene expression data used in this study are for 90 unrelated Asians from the international “HapMap” project [17], which include 45 Japanese in Tokyo, Janpan (JPT) and 45 Han Chinese in Beijing, China (CHB). The gene expression data were generated by an Illumina Sentrix Human-6 Expression BeadChip [18] and have been normalized by using the ithquantile normalization across replicates and median normalization across individuals. These gene expression data have been used in [19, 20], and they are available on ftp://ftp.sanger.ac.uk/pub/genevar/ or https://doi.org/10.6084/m9.figshare.22491772.

Distribution of marginal gene effects

We consider a outcome of interest ζ and gene data stored in a n × p matrix S. The gene data represent p gene expressions for n individuals. More specifically, element S_ij in S represents the j^th gene of the i^th individual. With the data, we perform marginal linear regression between the outcome variable ζ and gene S_j

\begin{matrix} min_{ι_{j}, τ_{j}} E {(ζ - ι_{j} - τ_{j} S_{j})}^{2} . \end{matrix}

(15)

Let α_j and β_j be the solutions to Eq 15, and ${\hat{β}}_{j}$ be the least squares estimators for β_j for j = 1, …, p, where ${\hat{β}}_{j} = {(S_{j}^{T} S_{j})}^{- 1} S_{j}^{T} ζ$ .

We assume the sample correlation between S_j and S_k is ${\hat{ρ}}_{j k}$ , the sample standard deviation of S_j is ${\hat{σ}}_{j j}$ , the conditional distribution of ζ given S₁, …, S_p is $N (μ (S_{1}, \dots, S_{p}), σ^{2})$ . The covariance of any two least squares estimators $\hat{β_{j}}$ and $\hat{β_{k}}$ is

\begin{matrix} Cov (\hat{β_{j}}, \hat{β_{k}}) = Cov (\sum_{i = 1}^{n} \frac{S_{i j}}{{\hat{σ}}_{j j}^{2}} ζ_{i}, \sum_{i = 1}^{n} \frac{S_{i k}}{{\hat{σ}}_{k k}^{2}} ζ_{i}) = σ^{2} {\hat{ρ}}_{j k} / n {\hat{σ}}_{j j} {\hat{σ}}_{k k} . \end{matrix}

(16)

Furthermore, since β_j is the solution to Eq 15, $β_{j} = E ({(S_{j}^{T} S_{j})}^{- 1} S_{j}^{T} ζ) = E ({\hat{β}}_{j})$ .

Therefore, the joint distribution of least squares estimators ${\hat{β}}_{1}, \dots, {\hat{β}}_{p}$ is $N ({(β_{1}, \dots, β_{p})}^{T}, Σ^{*})$ , where the (j, k)^th element of covariance matrix Σ* is $Σ_{j k}^{*} = σ^{2} {\hat{ρ}}_{j k} / n {\hat{σ}}_{j j} {\hat{σ}}_{k k}$ .

Next, we standardize ${\hat{β}}_{1}, \dots, {\hat{β}}_{p}$ using their standard deviations as suggested in [7]:

\begin{matrix} {\hat{U}}_{j} = \frac{{\hat{β}}_{j}}{S D ({\hat{β}}_{j})} = \frac{{\hat{β}}_{j}}{σ / (\sqrt{n} {\hat{σ}}_{j j})}, j = 1, \dots, p . \end{matrix}

(17)

Then, the joint distribution of standardized estimators ${\hat{U}}_{1}, \dots, {\hat{U}}_{p}$ is $N ({(μ_{1}, \dots, μ_{p})}^{T}, Σ)$ , where $μ_{j} = \sqrt{n} β_{j} {\hat{σ}}_{j j} / σ$ for j = 1, …, p and the (j, k)^th element of covariance matrix Σ is $Σ_{j k} = {\hat{ρ}}_{j k}$ . It is obvious that if S₁, …, S_p are correlated, then ${\hat{U}}_{1}, \dots, {\hat{U}}_{p}$ are also correlated.

We can rewrite the standardized marginal regression estimators $\hat{U_{j}}$ as a model of true marginal regression coefficients and error terms of the marginal regression coefficients as the form in Eq 1

\begin{matrix} {\hat{U}}_{j} = μ_{j} + υ_{j}, j = 1, \dots, p, \end{matrix}

(18)

where the true marginal genetic effects μ_j are assumed to follow a mixture distribution of a point mass at 0 (corresponding to no effect) and a nonparametric distribution (corresponding to having effects). The error terms of the standardized marginal regression coefficients υ₁, …, υ_p follow a distribution of $N (0, Σ)$ . The covariance matrix Σ can have arbitrary dependent structure. We use empirical estimates ${\hat{σ}}^{2}$ , ${\hat{ρ}}_{j k}$ and ${\hat{σ}}_{j j}$ to calculate the standardized estimators and Σ.

Gene expression outcome and highly dependent covariance structure

From the gene expression data, we take the gene expressions of CHRNA6 (cholinergic receptor, nicotinic, alpha 6) as the outcome variable (ζ) in Eq 15. Because the gene CHRNA6 is known to be related to activation of dopamine releasing neurons with nicotine [21], it is a widely studied subject in nicotine addiction studies.

In the remaining gene expressions, we use the method in [19] to first calculate correlations between a gene and all other genes, then count the number of correlations that are greater than 0.6 for this gene. If the count of >0.6 correlations for a gene is greater than 2, 200, we consider this gene as highly correlated with other genes. Therefore, there are 2, 269 genes satisfying this criterion. We take each of the 2, 269 genes as S_j in Eq 15, and store the gene data in the n × p matrix S, where n = 90, p = 2, 269. Fig 4 illustrates the heatmap of the absolute values of correlation among the 2, 269 genes.

Next, we regress CHRNA6 (ζ) on each of the 2, 269 gene expressions (S_j). We use empirical estimates ${\hat{σ}}^{2}$ , ${\hat{ρ}}_{j k}$ and ${\hat{σ}}_{j j}$ to calculate the standardized least squares regression coefficients ( ${\hat{U}}_{j}$ ), which have the distribution as shown in Eq 17. Lastly, using our proposed empirical Bayesian method (DepEB) described in Section, we obtain weakly dependent estimates and estimate the sparsity parameter and posterior distribution of the genes’ true marginal effect on CHRNA6. As a comparison, we also apply the EB estimation procedure proposed by [4] on the highly correlated gene data set.

Analysis results

Our proposed DepEB estimation procedure considering dependence can identify 14 genes (0.617% of the studied 2, 269 genes) to be associated with CHRNA6, while the EB estimation procedure without considering dependence can not identify genes related to CHRNA6. Table 2 lists the gene names and the values of the posterior mean of the genes’ effects estimated by the DepEB procedure on CHRNA6.

Table 2. Genes associated with CHRNA6 estimated by the DepEB procedure.

Gene Name	Posterior Mean of the Gene’s effect on CHRNA6
GI_14149729-S	0.0403
GI_14249217-S	0.0886
GI_19923528-S	0.0011
GI_21314756-S	0.0002
GI_22907051-S	0.0002
GI_32189368-S	0.0034
GI_32261328-S	0.0002
GI_33239450-A	0.0002
GI_38327038-I	-0.0248
GI_42659728-S	0.0002
GI_4504190-S	0.0021
GI_4506330-S	-0.2308
GI_8922084-S	0.0003
GI_9256536-S	0.0019

Open in a new tab

Our findings are consistent with the findings in [19, 20], both of which also used the 90 JPT-CHB population’s gene data to identify significant genes associated with CHRNA6. Our proposed DepEB procedure selects genes GI_14249217-S, GI_19923528-S, GI_32189368-S, GI_32261328-S, GI_42659728-S, GI_4506330-S and GI_9256536-S, which are also among genes identified by [19] to be related to CHRNA6. In particular, [19] found that gene GI_42659728-S is very likely and gene GI_4506330-S is extremely likely to be related to the outcome CHRNA6. In addition, gene GI_32189368-S, i.e. POLE2, Homo sapiens polymerase (DNA directed), epsilon 2 (p59 subunit), was also discovered by [20] to be related to CHRNA6.

Conclusion

In feature selection of normal mean problem, we propose a new DepEB estimator which extends the empirical Bayesian (EB) estimator method proposed by [4]. Our estimator allows arbitrary dependent structure in the error terms. We first apply eigenvalue decomposition to decompose correlated signals to common dependence and weakly dependent random errors. Next, we subtract the common dependence from the signals. We then get approximate likelihood of the sparse signals using the weakly dependent errors. The existence of sparsity makes the approximation feasible. Finally, we use an iterative maximization algorithm based on non-parametric kernel density to find the sparsity and the posterior distribution of the signals in a Bayesian estimating model.

In our simulation studies, we consider a number of covariance structures in relatively high sparse signals. The simulation results illustrate that when there exists moderate or strong correlations in the signals, our DepEB estimation procedure can correctly find the sparsity of the signals and produce a lower MSE compared to that in the EB estimation procedure ignoring the correlation structure in the sparse signals. Furthermore, we apply our proposed estimation procedure in the 90 JPT-CHB population’s gene expression data set with a highly dependent covariance structure. Our proposed DepEB estimation method outperforms the EB estimation method and identifies genes that are associated with the outcome gene CHRNA6. The findings in the real data analysis are consistent with those in other studies. Hence, this study provides an useful application for feature selection when facing strongly dependent covariates.

In the real data analysis of this study, we use marginal regression coefficients as the observed signals for the sparse vector. The reason of using marginal information is to deal with high dimensionality, see studies in [7, 22]. Therefore, one possible direction for future research is to use the appropriate initial estimators from the full regression model as the observed signals for sparse vector.

Data Availability

The raw data files for data analysis are available from figshare at: https://doi.org/10.6084/m9.figshare.22491772. The complete R and Matlab code for simulation and real data analysis are available from Github at: https://github.com/wangling03/Mixture/.

Funding Statement

The author(s) received no specific funding for this work.

References

1. George E. & Foster D. Calibration and empirical Bayes variable selection. Biometrika. 87, 731–747 (2000) doi: 10.1093/biomet/87.4.731 [DOI] [Google Scholar]
2. Tibshirani R., Hastie T., Narasimhan B. & Chu G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings Of The National Academy Of Sciences. 99, 6567–6572 (2002) doi: 10.1073/pnas.082099299 [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Guyon I. & Elisseeff A. An introduction to variable and feature selection. Journal Of Machine Learning Research. 3, 1157–1182 (2003) [Google Scholar]
4.Raykar, V. & Zhao, L. Nonparametric prior for adaptive sparsity. Proceedings Of The Thirteenth International Conference On Artificial Intelligence And Statistics. pp. 629–636 (2010)
5. Abramovich F., Benjamini Y., Donoho D. & Johnstone I. Adapting to unknown sparsity by controlling the false discovery rate. The Annals Of Statistics. 34, 584–653 (2006) doi: 10.1214/009053606000000074 [DOI] [Google Scholar]
6. Efron B. & Tibshirani R. On testing the significance of sets of genes. The Annals Of Applied Statistics. 1, 107–129 (2007) doi: 10.1214/07-AOAS101 [DOI] [Google Scholar]
7. Fan J., Han X. & Gu W. Estimating false discovery proportion under arbitrary covariance dependence. Journal Of The American Statistical Association. 107, 1019–1035 (2012) doi: 10.1080/01621459.2012.720478 [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Johnstone I. & Silverman B. Needles and straw in haystacks: Empirical Bayes estimates of possibly sparse sequences. The Annals Of Statistics. 32, 1594–1649 (2004) doi: 10.1214/009053604000000030 [DOI] [Google Scholar]
9. Pawitan Y., Calza S. & Ploner A. Estimation of false discovery proportion under general dependence. Bioinformatics. 22, 3025–3031 (2006) doi: 10.1093/bioinformatics/btl527 [DOI] [PubMed] [Google Scholar]
10. Ploner A., Miller L., Hall P., Bergh J. & Pawitan Y. Using correlations to evaluate low-level analysis procedures for high-density oligonucleotide microarray data. BMC Bioinformatics. 6 pp. 80 (2005) doi: 10.1186/1471-2105-6-80 [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Qiu X., Brooks A., Klebanov L. & Yakovlev A. The effects of normalization on the correlation structure of microarray data. BMC Bioinformatics. 6, 1–11 (2005) doi: 10.1186/1471-2105-6-120 [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Tibshirani R. Regression shrinkage and selection via the lasso. Journal Of The Royal Statistical Society: Series B (Methodological). 58, 267–288 (1996) [Google Scholar]
13. Tipping M. Sparse Bayesian learning and the relevance vector machine. Journal Of Machine Learning Research. 1, 211–244 (2001) [Google Scholar]
14. Wang L., Shen H., Liu H. & Guo G. Mixture SNPs effect on phenotype in genome-wide association studies. BMC Genomics. 16, 1–11 (2015) [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Berger, J. Statistical decision theory and Bayesian analysis. (Springer Science & Business Media,2013)
16. Wand M. & Jones M. Kernel smoothing. (CRC press,1994) [Google Scholar]
17. Thorisson G., Smith A., Krishnan L. & Stein L. The international HapMap project web site. Genome Research. 15, 1592–1593 (2005) doi: 10.1101/gr.4413105 [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Stranger B., Nica A., Forrest M., Dimas A., Bird C., Beazley C., et al., Population genomics of human gene expression. Nature Genetics. 39, 1217–1224 (2007) doi: 10.1038/ng2142 [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Xue F. & Qu A. Variable selection for highly correlated predictors. ArXiv Preprint ArXiv:1709.04840. (2017) [Google Scholar]
20. Fan J., Shao Q. & Zhou W. Are discoveries spurious? Distributions of maximum spurious correlations and their applications. Annals Of Statistics. 46, 989 (2018) doi: 10.1214/17-AOS1575 [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Thorgeirsson T., Gudbjartsson D., Surakka I., Vink J., Amin N., Geller F., et al., Sequence variants at CHRNB3–CHRNA6 and CYP2A6 affect smoking behavior. Nature Genetics. 42, 448–453 (2010) doi: 10.1038/ng.573 [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Fan J. & Lv J. Sure independence screening for ultrahigh dimensional feature space. Journal Of The Royal Statistical Society: Series B (Statistical Methodology). 70, 849–911 (2008) doi: 10.1111/j.1467-9868.2008.00674.x [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[pone.0284284.ref001] 1. George E. & Foster D. Calibration and empirical Bayes variable selection. Biometrika. 87, 731–747 (2000) doi: 10.1093/biomet/87.4.731 [DOI] [Google Scholar]

[pone.0284284.ref002] 2. Tibshirani R., Hastie T., Narasimhan B. & Chu G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings Of The National Academy Of Sciences. 99, 6567–6572 (2002) doi: 10.1073/pnas.082099299 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0284284.ref003] 3. Guyon I. & Elisseeff A. An introduction to variable and feature selection. Journal Of Machine Learning Research. 3, 1157–1182 (2003) [Google Scholar]

[pone.0284284.ref004] 4.Raykar, V. & Zhao, L. Nonparametric prior for adaptive sparsity. Proceedings Of The Thirteenth International Conference On Artificial Intelligence And Statistics. pp. 629–636 (2010)

[pone.0284284.ref005] 5. Abramovich F., Benjamini Y., Donoho D. & Johnstone I. Adapting to unknown sparsity by controlling the false discovery rate. The Annals Of Statistics. 34, 584–653 (2006) doi: 10.1214/009053606000000074 [DOI] [Google Scholar]

[pone.0284284.ref006] 6. Efron B. & Tibshirani R. On testing the significance of sets of genes. The Annals Of Applied Statistics. 1, 107–129 (2007) doi: 10.1214/07-AOAS101 [DOI] [Google Scholar]

[pone.0284284.ref007] 7. Fan J., Han X. & Gu W. Estimating false discovery proportion under arbitrary covariance dependence. Journal Of The American Statistical Association. 107, 1019–1035 (2012) doi: 10.1080/01621459.2012.720478 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0284284.ref008] 8. Johnstone I. & Silverman B. Needles and straw in haystacks: Empirical Bayes estimates of possibly sparse sequences. The Annals Of Statistics. 32, 1594–1649 (2004) doi: 10.1214/009053604000000030 [DOI] [Google Scholar]

[pone.0284284.ref009] 9. Pawitan Y., Calza S. & Ploner A. Estimation of false discovery proportion under general dependence. Bioinformatics. 22, 3025–3031 (2006) doi: 10.1093/bioinformatics/btl527 [DOI] [PubMed] [Google Scholar]

[pone.0284284.ref010] 10. Ploner A., Miller L., Hall P., Bergh J. & Pawitan Y. Using correlations to evaluate low-level analysis procedures for high-density oligonucleotide microarray data. BMC Bioinformatics. 6 pp. 80 (2005) doi: 10.1186/1471-2105-6-80 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0284284.ref011] 11. Qiu X., Brooks A., Klebanov L. & Yakovlev A. The effects of normalization on the correlation structure of microarray data. BMC Bioinformatics. 6, 1–11 (2005) doi: 10.1186/1471-2105-6-120 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0284284.ref012] 12. Tibshirani R. Regression shrinkage and selection via the lasso. Journal Of The Royal Statistical Society: Series B (Methodological). 58, 267–288 (1996) [Google Scholar]

[pone.0284284.ref013] 13. Tipping M. Sparse Bayesian learning and the relevance vector machine. Journal Of Machine Learning Research. 1, 211–244 (2001) [Google Scholar]

[pone.0284284.ref014] 14. Wang L., Shen H., Liu H. & Guo G. Mixture SNPs effect on phenotype in genome-wide association studies. BMC Genomics. 16, 1–11 (2015) [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0284284.ref015] 15.Berger, J. Statistical decision theory and Bayesian analysis. (Springer Science & Business Media,2013)

[pone.0284284.ref016] 16. Wand M. & Jones M. Kernel smoothing. (CRC press,1994) [Google Scholar]

[pone.0284284.ref017] 17. Thorisson G., Smith A., Krishnan L. & Stein L. The international HapMap project web site. Genome Research. 15, 1592–1593 (2005) doi: 10.1101/gr.4413105 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0284284.ref018] 18. Stranger B., Nica A., Forrest M., Dimas A., Bird C., Beazley C., et al., Population genomics of human gene expression. Nature Genetics. 39, 1217–1224 (2007) doi: 10.1038/ng2142 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0284284.ref019] 19. Xue F. & Qu A. Variable selection for highly correlated predictors. ArXiv Preprint ArXiv:1709.04840. (2017) [Google Scholar]

[pone.0284284.ref020] 20. Fan J., Shao Q. & Zhou W. Are discoveries spurious? Distributions of maximum spurious correlations and their applications. Annals Of Statistics. 46, 989 (2018) doi: 10.1214/17-AOS1575 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0284284.ref021] 21. Thorgeirsson T., Gudbjartsson D., Surakka I., Vink J., Amin N., Geller F., et al., Sequence variants at CHRNB3–CHRNA6 and CYP2A6 affect smoking behavior. Nature Genetics. 42, 448–453 (2010) doi: 10.1038/ng.573 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0284284.ref022] 22. Fan J. & Lv J. Sure independence screening for ultrahigh dimensional feature space. Journal Of The Royal Statistical Society: Series B (Statistical Methodology). 70, 849–911 (2008) doi: 10.1111/j.1467-9868.2008.00674.x [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Mixture prior for sparse signals with dependent covariance structure

Ling Wang

Zongqiang Liao

Roles

Abstract

Introduction