Unified tests for fine-scale mapping and identifying sparse high-dimensional sequence associations

Shaolong Cao; Huaizhen Qin; Alexej Gossmann; Hong-Wen Deng; Yu-Ping Wang

doi:10.1093/bioinformatics/btv586

. 2015 Oct 12;32(3):330–337. doi: 10.1093/bioinformatics/btv586

Unified tests for fine-scale mapping and identifying sparse high-dimensional sequence associations

Shaolong Cao ^1,², Huaizhen Qin ^2,³, Alexej Gossmann ^2,⁴, Hong-Wen Deng ^2,³, Yu-Ping Wang ^1,^2,^3,^*

PMCID: PMC5006306 PMID: 26458888

Abstract

Motivation: In searching for genetic variants for complex diseases with deep sequencing data, genomic marker sets of high-dimensional genotypic data and sparse functional variants are quite common. Existing sequence association tests are incapable of identifying such marker sets or individual causal loci, although they appeared powerful to identify small marker sets with dense functional variants. In sequence association studies of admixed individuals, cryptic relatedness and population structure are known to confound the association analyses.

Method: We here propose a unified marker wise test (uFineMap) to accurately localize causal loci and a unified high-dimensional set based test (uHDSet) to identify high-dimensional sparse associations in deep sequencing genomic data of multi-ethnic individuals with random relatedness. These two novel tests are based on scaled sparse linear mixed regressions with L_p (0 < p < 1) norm regularization. They jointly adjust for cryptic relatedness, population structure and other confounders to prevent false discoveries and improve statistical power for identifying promising individual markers and marker sets that harbor functional genetic variants of a complex trait.

Results: With large scale simulation data and real data analyses, the proposed tests appropriately controlled Type I error rates and appeared to be more powerful than several prominent methods. We illustrated their practical utilities by the applications to DNA sequence data of Framingham Heart Study for osteoporosis. The proposed tests identified 11 novel significant genes that were missed by the prominent famSKAT and GEMMA. In particular, four out of six most significant pathways identified by the uHDSet but missed by famSKAT have been reported to be related to BMD or osteoporosis in the literature.

Availability and implementation: The computational toolkit is available for academic use: https://sites.google.com/site/shaolongscode/home/uhdset

Contact: wyp@tulane.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

1 Introduction

Deep sequencing technologies have been generating huge amounts of data of rare and common DNA sequence variants. A number of sequence association tests have been developed to identify marker sets, e.g. a group of SNPs or CNVs (copy-number variations), that contain functional genetic variants. Most of these tests, however, do not jointly model cryptic relatedness, population structure and other covariates. With the growing demand of analyzing next generation sequencing data of multi-ethnic individuals, linear mixed models have become popular because of their demonstrated effectiveness in accounting for sample relatedness (Amos, 1994) and population structure which occurs when there are large-scale systematic differences in genetic ancestry among individuals in a sample. Typical examples include individuals with various levels of immigrant ancestry and more recent shared ancestors than one would expect in a homogenies population. Cryptic relatedness, refers to the presence of relatives in a sample of ostensibly unrelated individuals, could pose more serious confounding than population structure (Devlin and Roeder, 1999), especially for samples from small and isolated populations (Voight and Pritchard, 2005). Accounting for population structure is more challenging when family structure or cryptic relatedness is also present (Price et al., 2010). We paved the way to correct for the effects of both confounders jointly.

Within the framework of linear mixed models, famSKAT (Chen et al., 2013) and GEMMA (Genome-wide Efficient Mixed Model Association) (Zhou and Stephens, 2012) appeared as two powerful sequence association tests for identifying small marker sets that harbor dense functional genetic variants. FamSKAT is a set based test which is an extension of SKAT to be applicable to family data. GEMMA is a computationally efficient method for fitting multivariate linear mixed models. These prominent tests require that the number of markers in a testing set is much smaller than the sample size. However, in deep sequencing studies, one encounters quite often high-dimensional data sets (HDS), where the number of marker loci is larger than the sample size and the number of functional variants is very small. The aforementioned tests are incapable of identifying such sparse HDS and the functional variants. Some sparse regression methods were developed to localize individual functional markers from high-dimensional marker sets, jointly modeling pedigree structure and population structure. They include Lasso (Rakitsch et al., 2013), Ridge regression (Endelman, 2011), Elastic-net (Zou and Hastie, 2005) and the USR that we proposed recently (Cao et al., 2014). However, these methods yield biased solutions and are ineffective to prevent false discoveries of random markers and high-dimensional marker sets irrelevant to functional variants.

In this article, we first present a unified test (uFineMap) for accurately localizing individual causal loci. The uFineMap is a marker wise test under a scaled sparse linear mixed regression, which jointly models marker wise effect, relatedness and population stratification. It applies scaled L_p (0 < p < 1) norm regularization to generate a de-biased solution. Next, we present an additional significant test (unified high-dimensional set based test, uHDSet) for identifying high-dimensional sparse associations in deep sequencing genomic data of related individuals. The uHDset integrates the marker wise statistics of the uFineMap to identify susceptible high-dimensional marker sets. In the uHDSet, the dependence among markers is modeled to appropriately control set-based Type I error rates. Under extensive simulations, the uFineMap outperformed the GEMMA (Zhou and Stephens, 2012) and a Scaled Lasso based method (Javanmard and Montanari, 2014). The uHDSet yields higher statistical power than famSKAT and GEMMA. Applications to Framingham Heart Study also show that our methods yield novel interesting candidate genes and pathways for follow-up studies, showing its advantages over the two compared prominent alternative methods. Finally, caveats of the proposed methods and perspective future efforts are discussed.

2 Methods

We focus on constructing statistical tests for high-dimensional genetic data with cryptic relatedness. We propose two significance tests: uFineMap test (single marker/variant test) and uHDSet test (unified high-dimensional set test or whole regional test). Similar to Bühlmann (2013) and Javanmard and Montanari (2014), we develop uFineMap significance test for single variants based on the scaled sparse regression (Sun and Zhang, 2012), which is a generalization of ordinary sparse regression. Furthermore, we build new statistics for the uHDSet test based on a combination of marker wise statistics. The uHDSet test facilitates us to identify susceptible genes or genetic regions instead of single variants.

2.1 Unified scaled L_p norm regularized regression

At first, we need to define some basic notations. Let n denote the number of subjects; m denotes the number of independent variables (SNPs); and L represents the number of covariates. Suppose we have dependent variable $Y = {(y_{1}, y_{2}, \dots, y_{n})}^{T}$ , which stands for phenotype for each subject. $X = (x_{1}, x_{2}, \dots, x_{n})$ is a nxm matrix where the row $x_{i} = {(x_{i 1}, x_{i 2}, .., x_{i m})}^{T}$ represents genotype data for the ith subject. Typically, genotypes are coded as 0, 1 or 2 which denote the number of copies the minor allele. $Φ_{n \times n} = (φ_{i, j})$ is the kinship matrix or IBD (identity-by-descent) matrix. The kinship coefficient $φ_{i, j}$ measures the relatedness between individual i and j. $W = (w_{1}, w_{2}, \dots, w_{n})$ is an nxL matrix, where $w_{i} = {(w_{i 1}, w_{i 2}, \dots, w_{i L})}^{T}$ represents the covariates, e.g. age, sex, height and weight.

We assume that the phenotypes, genotypes and covariates are associated with the following linear mixed model:

Y = W α + X β + ε

(1)

where $ε \sim N (0, Σ = σ_{Φ}^{2} Φ + σ_{e}^{2} I_{n})$ , $α = {(α_{1}, α_{2}, \dots, α_{L})}^{T}$ and $β = {(β_{1}, β_{2}, \dots, β_{m})}^{T}$ are the corresponding regression coefficients. Both Emma and Gemma methods can evaluate the variance component ratio $σ_{Φ}^{2} / σ_{ε}^{2}$ of covariant matrix $Σ$ . In this article, we use Gemma method to evaluate the $σ_{Φ}^{2} / σ_{ε}^{2}$ ratio.

In model (1), the regression coefficients $β = {(β_{1}, β_{2}, \dots, β_{m})}^{T}$ represent the effect of variants which are the most important variables we are interested in. However, the high-dimensionally of genetic data will easily lead to over-fitting problem under regular regression model. To overcome this issue, a general form of the unified sparse regression model with L_p (0 < p < 1) norm regularization was proposed by USR paper with the following minimization problem (Cao et al., 2014):

(\hat{β}, \hat{α}) = \underset{β \in R^{m}, α \in R^{L}}{argmin} {(Y - W α - X β)}^{T} Σ^{- 1} (Y - W α - X β) + λ | | β | |_{p}^{p}

(2)

where the L_p (0 < p < 1) norm regularization is defined by

λ | | β | |_{p}^{p} = λ \sum_{i = 1}^{m} | β_{i} |^{p}, 0 < p < 1

As is well-known (Cao et al., 2014; Chen et al., 2010; XU et al., 2012) that L_p (0 < p < 1) norm regularization results in a sparser solution than L1 norm regularization, which was widely popularized by the Lasso (least absolute shrinkage and selection operator) (Tibshirani et al., 1996). In particular, previous simulation results in Cao et al. (2014) suggest that the use of the L_0.3 norm regularization, in order to achieve a proper sparsity level of the solution with great computational efficiency. To keep the method flexible, we also offer users different choices for the L_p (0 < p < 1) norm in our R code.

In addition to the selection of the L_p norm, the regularization (tuning) parameter $λ$ largely affects the solution of Equation (2) as well. In general, the choice of $λ$ is regarded as a difficult problem. Popular methods for this purpose include the minimization of either the Bayesian information criterion (BIC) or the Akaike information criterion (AIC) as a function of $λ$ , cross-validation, and stability selection (Meinshausen and Bühlmann, 2010) to select $λ$ . However, none of these methods can be applied to control the Type I error, especially for a region-based significance test.

By adopting the idea of scaled sparse linear regression (Sun and Zhang, 2012), which jointly estimates the regression coefficients and the noise level of the data, we avoid the regularization parameter selection problem. The estimated noise level is used for bias correction. The obtained de-biased estimator is applied to perform marker wise significance tests for each variant.

The scaled L_p norm based sparse regression model is given by

(\hat{β}, \hat{α}, \hat{σ}) = \underset{\hat{β} \in R^{m}, \hat{α} \in R^{L}, σ > 0}{argmin} {\frac{{(Y - W α - X β)}^{T} Σ^{- 1} (Y - W α - X β)}{2 n σ} + \frac{σ}{2} + λ | | β | |_{p}^{p}}

(3)

In the unified scaled sparse regression the tuning parameter $λ$ is updated iteratively, which requires an initial value $λ_{0}$ . However, the sensitivity of the results to the selection of $λ_{0}$ is low. Moreover, de-biased estimators can be constructed to balance out the bias in the estimated noise level $\hat{σ}$ and the bias caused by the L_p norm regularization, which are both proportional to the initial $λ_{0}$ . The asymptotic distribution of the de-biased estimators can then be derived without major difficulties.

To solve the optimization problem (3), we combine the algorithm for unified L_p norm based sparse regression with that for the general scaled sparse regression (Sun and Zhang, 2012) and propose the following algorithm.

The algorithm for unified scaled sparse regression (3)

Step 1: Data centralization: $\sum_{i = 1}^{n} x_{i j} = 0$ , for j = 1,2, $\dots$ m
Step 2: Initialize $λ^{(0)} = λ_{0} = 2 \sqrt{\frac{\log (m)}{n}}$ , $σ^{(0)} = \sqrt{\frac{Y^{T} Σ^{- 1} Y}{n}}$ , ${\hat{α}}^{(0)} = 0$ and ${\hat{β}}^{(0)} = 0$ , set iterative index r = 0, $ε = 0.0001$ ; Initialize, $β_{j}^{(0)} = 0$ , for j = 1,2 $\dots$ m
Step 3: Update $\hat{σ}, λ, \hat{β}, \hat{α}$ coordinately
$\begin{array}{l} {\hat{σ}}^{(r + 1)} = \sqrt{\frac{1}{n} {(Y - W {\hat{α}}^{(r)} - X {\hat{β}}^{(r)})}^{T} Σ^{- 1} (Y - W {\hat{α}}^{(r)} - X {\hat{β}}^{(r)})} \\ λ^{(r + 1)} = σ^{(r + 1)} λ_{0} \end{array}$
Update the regression coefficients by USR method (Cao et al., 2014)
$\begin{array}{l} ({\hat{β}}^{(r + 1)}, {\hat{α}}^{(r + 1)}) \\ = \underset{β \in R^{m}, α \in R^{L}}{\arg \min} {\frac{1}{2 n {\hat{σ}}^{(r + 1)}} {(Y - W α^{(r)} - X β^{(r)})}^{T} Σ^{- 1} (Y - W α^{(r)} - X β^{(r)}) + λ^{(r)} | | β^{(r)} | |_{p}^{p}} \end{array}$
Step 4: If $| | {\hat{β}}^{(r + 1)} - {\hat{β}}^{(r)} | |_{2} < ε$ stop; otherwise return to Step 3

2.2 The bias correction of unified scaled L_p norm regularized sparse regression

Lasso, Ridge regression, and many other popular regression methods utilize a regularization term, in order to obtain a stable solution on an HDS. The L₁ norm regularization term used in Lasso typically shrinks many regression coefficients to zero. This, however, introduces a bias making the non-zero regression coefficients smaller in magnitude.

Adopting the idea of unbiased estimation (Javanmard and Montanari, 2014), we develop a unbiased estimator to recover the unbiased regression coefficients, and to assess the corresponding asymptotic Gaussian distribution. A detailed algorithm is presented below.

The algorithm for unbiased estimator

Step 1: Set $γ = \frac{\hat{λ}}{\hat{σ}}$ , where $\hat{λ}$ and $\hat{σ}$ are the estimated parameters of the unified scaled sparse regression (3)
Step 2: Set $Z = (X^{T} Σ^{- 1} X) / n$
Step 3: For i = 1,2, $\dots$ ,m, solve $u_{i}$ by the following constraint convex program:
$\begin{matrix} minimize & u_{i}^{T} Z u_{i} \\ subject to & | | Z u_{i} - e_{i} | |_{\infty} \leq γ \end{matrix}$
Because the calculation of each $u_{i}$ is independent. To increase the computation speed, we parallelize the calculation.
$Step 4 : Set M = {(m_{1}, m_{2}, \dots, m_{m})}^{T}$ (4)
If any of the above problems is not feasible, then set $M = I_{m \times m}$

$Step 5 : Define the unbiased estimator by$
${\hat{β}}^{u} = \hat{β} + \frac{1}{n} M X^{T} Σ^{- 1} (Y - X \hat{β})$ (5)

where $\hat{β}$ is the solution of formula (3).

2.3 Hypothesis tests and confidence intervals

To clarify the problem, we assume $Y$ is the covariates adjusted phenotype. After ignoring the covariates, the true model becomes:

Y = X β_{0} + ε, ε \sim N (0, Σ = σ_{Φ}^{2} Φ + σ_{e}^{2} I_{n})

(6)

where $β_{0}$ is the ground truth regression coefficients and stands for true signal.

We define the sparse level of $β_{0}$ as $S_{0} = {i \in {1, 2, \dots, m} | β_{0, i} \neq 0}$ . In this article, we apply a weak assumption for the sparse model, which is $s_{0} = | S_{0} | = o (\sqrt{n / \log (m)})$ . Without any further notice, we always assume that this assumption holds. Although the sparse ground truth is preferred, our method is also robust for the non-sparse setting, according to the simulation result in Supplementary Figs 5.8S and 5.9S in the Appendix.

2.3.1 uFineMap test

For each predictor i, we need to develop a significance test to determine whether the corresponding regression coefficient $β_{i}$ is significant or not. For a specific $i \in {1, 2, \dots, m}$ , we define the null hypothesis H_0: $β_{i} = 0$ versus the alternative hypothesis H₁: $β_{i} \neq 0$

Supposing the model (6) stands and considering the unbiased estimator (5), we prove that the following asymptotic distribution holds

n ({\hat{β}}^{u} - β^{0}) \overset{d}{\to} N (0, σ^{2} M X^{T} Σ^{- 1} X M^{T}),

(7)

where $M$ is defined by formula (4). The detailed proof is given in Theorem 1 in Appendix.

With this theorem, we can directly derive the significance test for each marker, e.g. uFineMap test. The p-value for each variable can be calculated by the following:

P (i) = 2 (1 - Φ (\frac{n | {\hat{β}}_{i}^{u} |}{\hat{σ} \sqrt{{[M X^{T} Σ^{- 1} X M^{T}]}_{i, i}}})), i = 1, 2, \dots, m

(8)

where $Φ$ is the cumulative distribution function of a standard normal distribution.

2.3.2 uHDSet test

The next major question is how to control the family-wise error rates (FWER) to claim the whole significant genetic region. Besides Bonferroni–Holm correction or some existing multiple testing correction methods to control the FWER or false discovery rate (Benjamini and Hochberg, 1995, 2000). We are commitment to developing a powerful and efficient multiple testing adjustment, taking dependence into consideration, which would be more powerful than uncorrelated adjustment.

For uHDSet test, the null hypothesis is H_0: $β_{1} = β_{2} = \dots = β_{m} = 0$ , and the alternative hypothesis is H_1: $\exists β_{i} \neq 0, i \in {1, 2, \dots, m}$ .

Inspired by the idea of van de Geer et al. (2014), we develop a new statistic for uHDSet significance test by utilizing the previous uFineMap statistics: $S = \max_{i \in {1, 2, \dots, m}} \frac{n | {\hat{β}}_{i}^{u} |}{\hat{σ} \sqrt{{[M X^{T} Σ^{- 1} X M^{T}]}_{i, i}}}$ .

For an arbitrary $z \in R$ , the following equation holds

P (S \leq z | X) - P (\max_{i \in {1, 2, \dots, m}} \frac{| W_{i} |}{\hat{σ} \sqrt{{[M X^{T} Σ^{- 1} X M^{T}]}_{i, i}}} \leq z | X) \to 0

where $W \sim N (0, {\hat{σ}}^{2} M X^{T} Σ^{- 1} X M^{T})$ . The proof is presented in Theorem 2 in Appendix.

Under null hypothesis H_0: $β_{1} = β_{2} = \dots = β_{m} = 0$ , statistic S is asymptotically equivalent to the maximum of a series of dependent $χ^{2} (1)$ variables, whose distribution relies on the design matrix $X^{T} Σ^{- 1} X$ . For any fixed matrix $X^{T} Σ^{- 1} X$ , we simulate its distribution and use its quantile to estimate the p-value of the uHDSet statistic S.

3 Results

To validate our proposed tests, we conducted simulations under various types of pedigree structures to demonstrate their performances comprehensively, in terms of both Type I error rates control and statistical power.

3.1 Nuclear family simulation

We use the following linear model to generate simulation data with nuclear family structure (each family consists of two children and their parents):

Y = b X β_{0} + ε, ε \sim N (0, Σ)

(9)

where b is the effect size for causal marker; $Σ = 1 / 3 Φ + 2 / 3 I$ .

We randomly assign 30% of variables to be rare variants [minor allele frequency (MAF) < 1%], 20% of variables to be low frequency variants (1% < MAF < 5%) and the rest variables to be common variants (5% < MAF < 50%).

3.1.1 Data generation

The basic procedure of performing nuclear family simulation is as follows:

Step 1: Given MAF for each variable, set the ground truth $β_{0}$ with 10 causal variants (five of them are rare variants); set the correlation matrix $K_{i j} = ρ^{| i - j |}$ , where $i, j \in {1, 2, \dots, m}$ and the coefficient $ρ$ determines the correlation for each pair of variables. We set $ρ = 0.6$ throughout the simulation.
Step 2: Sampling $E_{n \times m}^{1} \sim N (0, I \otimes K)$ and $E_{n \times m}^{2} \sim N (0, I \otimes K)$
Step 3: For each subject i and variable j, update the genotype matrix by: $X_{i j} = I (E_{i j}^{1} > Φ^{- 1} (m a f (j))) + I (E_{i j}^{2} > Φ^{- 1} (m a f (j)))$ .
Step 4: Generate the vector of trait values of n subjects according to model (9) for a given b. The selection of b is discussed at Section 3.1.3.

3.1.2 Type I error rates evaluation

To validate if the proposed significant tests can control the Type I error rates, we generated genotype data by the procedure in Section 3.1.1, setting n = 500 and m = 1000. The trait value is generated by $Y = ε \sim N (0, Σ)$ . We replicated this simulation 1000 times and recorded the corresponding p-values to draw quantile–quantile (Q-Q) plots. Under null hypothesis, the quantile of the p-value should follow the uniform distribution U(0,1).

Figure 1 illustrates most points are aligned near the diagonal line, which is expected. The two dashed curves represent 95% concentration band (CB). With all the points concentrated within the 95% CB, we concluded that the observed p-values follow the uniform distribution over interval (0,1). The Q-Q plot assures that the Type I error rates of uFineMap test is appropriately controlled.

Fig. 1. — The Q-Q plot for uFineMap test. The x axis is negative log10 of expected p-values, and the y axis represents negative log10 of observed p-values

Figure 2 shows that the distribution of uHDSet test’s p-values agrees with the uniform distribution, indicating the validity of the adjustment of multiple testing. Therefore, we can draw a conclusion that both of our uFineMap test and uHDSet test can control the Type I error rates appropriately.

Fig. 2. — The Q-Q plot for uHDSet test. The x axis is negative log10 of expected p-values, while the y axis represents negative log10 of observed p-values

3.1.3 Statistical power analysis

The design matrix is simulated by the same procedure as in Section 3.1.1. As typical, we set the nominal significance level at 0.05 and generated the trait values with respect to various values of heritability H. We define the heritability H to be the ratio of variance between true signal and the total variance of trait value, which can be explicitly written as:

H = \frac{b^{2} V a r (X β_{0})}{V a r (Y)} = \frac{b^{2} V a r (X β_{0})}{b^{2} V a r (X β_{0}) + V a r (ε)}

Then we have $b = \sqrt{\frac{H V a r (ε)}{(1 - H) V a r (X β_{0})}}$

Let the ground truth signal to be $β_{0} (i) = {\begin{matrix} 1 & i \in {1, 3, 5, 7, 9, 11} \\ 0 & o t h e r w i s e \end{matrix}$ , i.e. the true marker set to be recovered. We set two of the causal variants to be rare variants and the other four as common variants. We increased the heritability H from 0 to 1 and calculated its power at each point. For the sake of saving computational time, we only repeated the procedure 2000 times for each given H.

The statistical power for the uFineMap test is defined as $P o w e r = \sum_{t = 1}^{T} s_{0}^{- 1} \sum_{i \in S_{0}} I [P (i) < 0.05] / T$ , where T is the simulation replicates; $P (i)$ is the p-value from uFineMap test of ith marker and I() is the indicator function.

For uHDSet test, $P (t)$ represents the p-value calculated at t-th simulation. We define the empirical statistical power to be

P o w e r = \sum_{t = 1}^{T} I [P (t) < 0.05] / T

To evaluate our method, we compare the uFineMap test with other high-dimensional inference methods [e.g. Scaled Lasso (Javanmard and Montanari, 2014), single marker χ² test and Gemma (Zhou and Stephens, 2012)]. For the uHDSet test comparison, we additionally consider a popular regional based association test, famSKAT (Wu et al., 2011). The results are shown in Figs 3 and 4, respectively.

Fig. 4. — Power versus sample size for uFineMap tests

In Fig. 3, the uFineMap test performs uniformly better than Scaled Lasso test, Gemma and the single marker test. It indicates that the uFineMap test has a noticeable power gain to identify both common and rare causal variants.

Figure 4 evaluates different methods’ performance with respect to sample size changes. It illustrates that our uFineMap test overall outperforms other two methods especially when the sample size are small. Meanwhile, all the competing methods show a similar pattern for a large sample problem.

Similar to Figs 3, 4, 5 and 6 indicate that the statistical power of all regional tests will increase with the growth of sample size and heritability, which is consistent with our expectation. In addition, at the lower sample size area, our uHDSet test performs much better than famSKAT and Gemma. With the increase of the sample size, the powers of the three methods converge to the same value.

Fig. 5. — Power versus heritability for regional tests. The legend ‘uHDSet’ stands for our proposed method

Fig. 6. — Power versus sample size for uHDSet tests

In conclusion, our proposed tests have higher power than competing existing methods regardless of heritability. Meanwhile, it performs almost equally well for large sample size data.

3.2 Complex family simulation

To further compare different methods fairly, instead of using our own or over-simplified simulation data, we used the software SeqSIMLA2. SeqSIMLA2 can simulate sequence data in families under quantitative disease models.

Using SeqSIMLA2, we generate quantitative traits for 8 large families with 67 individuals (the family tree for each family is shown in Supplementary Appendix Fig. 5.1S) with 1000 SNPs in total.

3.2.1 Type I error rates evaluation

To verify the validity of our proposed tests, we need to evaluate if the Type I error is well controlled under the null hypothesis. Supplementary Figs 5.1S and 5.2S (in Appendix) show the Q-Q plots for uFineMap test and uHDSet test, respectively. The results indicate that the Type I error rates is appropriately controlled in complex family structure.

3.2.2 Power comparison

We randomly assign 50 causal variants (25 common, 25 rare) to generate the continuous phenotype. Additionally, we proposed two simulation setting for markers effects. We assign all causal markers to be positively related to the trait value for the same causal direction setting. For the different causal direction setting, half of the causal markers are randomly given a negative relationship with the trait value.

Figures 7 and 8 present the comparison of three competing methods under same direction and different direction settings, respectively. The similar patterns also occurred at a marker wise tests comparison. To make the presentation concise, we only show the result of regional tests, and the result of marker wise tests can be found in the Appendix (Supplementary Figs 5.3S and 5.4S). We can draw the conclusion that all three methods are robust with respect to causal variants direction. But our uHDSet test is almost uniformly more powerful than Gemma and famSKAT for SeqSIMLA simulation data.

Fig. 7. — Power comparison with same causal direction

Fig. 8. — Power comparison with different causal direction

3.3 Analysis of sequence data from Framingham Heart Study

To demonstrate the effectiveness of our methods for real genetic variants detection, we applied them to the analysis of sequence data of Framingham Heart Study. This dataset contains both GWAS and next generation sequencing (NGS) data from 4229 subjects with HipBMD data. We used the FISH (Zhang et al., 2014) method for genotype imputation and selected HipBMD as the phenotype data. After quality control, we obtained 3322 individuals with 6 500 475 SNPs in total. We apply two kinds of data analysis strategies: whole genome analysis and pathway-based analysis.

3.3.1 Whole genome analysis

We separate each chromosome into several genetic windows and then apply our uFineMap and uHDSet tests in each window. We set the window size to be 100 kb base pairs. After the separation, the whole genome is separated by a total number of 16 514 sets of markers. The phenotype is adjusted by the covariates and the top 10 principle components of the genotype before the application of the proposed method. Following the same process as in the simulation studies, we obtain the results and draw the Manhattan plots for 22 chromosomes, as shown in Figs 9 and 10, respectively. Additional results of Manhattan plots for the whole genome (i.e. from chromosome 1 to 22) with higher resolution are presented in Appendix.

Fig. 9. — The Manhattan plot for uFineMap test of 22 chromosomes. Each point represents p-value of its corresponding SNP

Fig. 10. — The Manhattan plot for uHDSet test of 22 chromosomes. Each point represents p-value of a 100 kb window SNPs region

By combining the overlapped region of Figs 9 and 10, the uHDSet test report 68 regions of highest susceptibility that exceed a p-value threshold of 0.001. The reported p-value is based on the whole regional test. According to GeneCards websites, there are 11 genes (Table 1) within the selected regions that are associated with BMD or osteoporosis disease, which further confirms our findings. However, these 11 genes are missed by the use of famSKAT and Gemma method. The reported p-value of Gemma is generated by the minimal p-value after Bonferroni correction for the SNPs within the region.

Table 1.

The selected susceptibility genes by uHDSet test

Gene	Chromosome	uHDSet p-value	famSKAT p-value	Gemma p-value
DNM3	1	2.47E-06	0.071107033	0.963871
APOB	2	7.43E-05	0.018075521	0.044156
ERC1	12	0.000154572	0.075876014	0.54554
SRD5A1	5	0.000267385	0.227392554	1
NR3C2	4	0.000317415	0.884812719	0.287339
PLCG1	20	0.000487724	0.022591921	1
INSIG2	2	0.00067805	0.73450689	0.29285
CYP24A1	20	0.000719511	0.132626874	1
ITGA1	5	0.000794757	0.143515502	1
BMPR2	2	0.000901023	0.762703102	0.729078
WNT4	1	0.000940191	0.602006435	0.718623

Open in a new tab

For the marker wise test, the uFineMap test report 82 susceptible SNPs that exceed a p-value threshold of 10⁻⁵. Table 2 shows the six reported SNPs that are associated with BMD or osteoporosis disease according to GeneCards websites.

Table 2.

The selected susceptibility SNPs by uFineMap test

SNPs	Gene	Chromosome	uFineMap	Gemma
rs11571334	ALOX12	17	4.47E-07	4.68E-05
rs3136452	F2	11	5.39E-07	8.37E-05
rs1264891	OVGP1	1	2.36E-06	5.53E-05
rs10513003	ITGA1	5	4.38E-06	2.99E-05
rs1491717	GC	4	5.17E-06	7.43E-05
rs235766	BMP2	20	5.67E-06	2.99E-05

Open in a new tab

3.3.2 Pathway analysis

To further illustrate the benefit of the uHDSet test, we collect 880 pathways from KEGG, REACTOME and BIOCARTA pathway analysis databases. We first extract genes belonging to each pathway, then select the corresponding SNPs. The selected SNPs of a specific pathway are combined to form the design matrix for association tests. We list six most significant pathways that pass p-value cut-off 10⁻³ in Table 3 for which the prominent famSKAT methods fails to detect. The two P38/MAPK pathways were previously found to play a critical role by other publications (Kim et al., 2013; Lee et al., 2008). Endogenous Sterols pathway is also related with BMD reported by another study (Warriner and Saag, 2013). Chemokines pathway is important regulator in development, homeostasis and pathophysiological processes associated with osteoporosis (Lazennec and Richmond, 2010).

Table 3.

The selected functional pathways by uHDSet test only

Pathway name	uHDSet p-value	famSKT p-value
REACTOME_FACILITATIVE_NA_INDEPENDENT_GLUCOSE_TRANSPORTERS	5.00E-05	0.05809
REACTOME_ACTIVATED_TAK1_MEDIATES_P38_MAPK_ACTIVATION	7.00E-05	0.05635
REACTOME_P38MAPK_EVENTS	8.00E-05	0.09401
REACTOME_ENDOGENOUS_STEROLS	0.00016	0.00110
REACTOME_CHEMOKINE_RECEPTORS_BIND_CHEMOKINES	3.00E-04	0.07827
KEGG_GLYCOSPHINGOLIPID_BIOSYNTHESIS_GLOBO_SERIES	0.00065	0.13751

Open in a new tab

Each p-value in Table 3 is generated based on a whole pathway-based region. It can be seen that, our uHDSet method is more powerful than famSKAT in identifying significant pathways which contain a relatively large number of genetic markers.

4 Conclusion

Some promising association tests with the adjustment of family structure have been established on the LDSs (low dimensional sets). However, these tests would suffer power loss in high dimensional data. To overcome the limitations of these tests, we propose the uFineMap and uHDSet tests for assessing the significance of the HDSs with cryptic relatedness, which are based on novel scaled linear mixed sparse regressions. The proposed tests are designed to address the challenge of variants detection under complex pedigree structures, which implement an explicit way to properly control the Type I error rates at both single marker level and SNPs set level.

The promising results of testing on both simulated and real data indicate that the uFineMap and uHDSet tests yield considerably higher statistical power gains in comparison to other competing methods, especially for high dimensional data with cryptic relatedness. The uFineMap test can pinpoint single susceptible variants with higher resolutions, even for rare functional variants. In addition, our methods also maintain substantial power for detecting susceptibility variants in low dimensional data of large samples. Last but not least, our methods can identify both rare and common variants efficiently.

One limitation of the proposed methods is that we assume linear mixed relationship between phenotype and genotype, which might not be true in the real world. Therefore, non-linear regression models with adjustment of relatedness and population stratification may be more suitable. In addition, the overall computational complexity is $O (n^{2} m^{3})$ , which is much higher than simply solving the sparse linear mixed model or other efficient methods designed for LDSs, particularly for extremely large data. To solve this issue, parallel computing is implemented to reduce the total computational time for large scale genetic data analyses.

Supplementary Material

Supplementary Data

supp_32_3_330__index.html^{(896B, html)}

Acknowledgement

Our work is partially supported by NIH R01 GM109068, R01 MH107354 and R01 MH104680. The Framingham Heart Study is conducted and supported by the National Heart, Lung, and Blood Institute NHLBI in collaboration with Boston University (Contract no. N01-HC-25195).

Conflict of Interest: none declared.

References

Amos C.I. (1994) Robust variance-components approach for assessing genetic linkage in pedigrees. Am. J. Hum. Genet., 54, 535. [PMC free article] [PubMed] [Google Scholar]
Benjamini Y., Hochberg Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Methodol., 57, 289–300. [Google Scholar]
Benjamini Y., Hochberg Y. (2000) On the adaptive control of the false discovery rate in multiple testing with independent statistics. J. Educ. Behav. Stat., 25, 60–83. [Google Scholar]
Bühlmann P. (2013) Statistical significance in high-dimensional linear models. Bernoulli, 19, 1212–1242. [Google Scholar]
Cao S., et al. (2014) A unified sparse representation for sequence variant identification for complex traits. Genet. Epidemiol., 38, 671–679. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen H., et al. (2013) Sequence kernel association test for quantitative traits in family samples. Genet. Epidemiol., 37, 196–204. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen X., et al. (2010) Lower bound theory of nonzero entries in solutions of l_2-lp minimization. SIAM J. Sci. Comput., 32, 2832–2852. [Google Scholar]
Devlin B., Roeder K. (1999) Genomic control for association studies. Biometrics, 55, 997–1004. [DOI] [PubMed] [Google Scholar]
Endelman J.B. (2011) Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Genome, 4, 250–255. [Google Scholar]
Javanmard A., Montanari A. (2014) Confidence intervals and hypothesis testing for high-dimensional regression. J. Mach. Learn. Res., 15, 2869–2909. [Google Scholar]
Kim H.K., et al. (2013) Osteogenic activity of collagen peptide via ERK/MAPK pathway mediated boosting of collagen synthesis and its therapeutic efficacy in osteoporotic bone by back-scattered electron imaging and microarchitecture analysis. Molecules, 18, 15474–15489. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lazennec G., Richmond A. (2010) Chemokines and chemokine receptors: new insights into cancer-related inflammation. Trends Mol. Med., 16, 133–144. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lee H.W., et al. (2008) Berberine promotes osteoblast differentiation by Runx2 activation with p38 MAPK. J. Bone Miner. Res., 23, 1227–1237. [DOI] [PubMed] [Google Scholar]
Meinshausen N., Bühlmann P. (2010) Stability selection. J. R. Stat. Soc. Ser. B Stat. Methodol., 72, 417–473. [Google Scholar]
Price A.L., et al. (2010) New approaches to population stratification in genome-wide association studies. Nat. Rev. Genet., 11, 459–463. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rakitsch B., et al. (2013) A Lasso multi-marker mixed model for association mapping with population structure correction. Bioinformatics, 29, 206–214. [DOI] [PubMed] [Google Scholar]
Sun T., Zhang C.-H. (2012) Scaled sparse linear regression. Biometrika, 99, 879–898. [Google Scholar]
Tibshirani R. (1996) Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Methodol., 58, 267–288. [Google Scholar]
van de Geer S., et al. (2014) On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42, 1166–1202. DOI: 10.1214/14-AOS1221. [Google Scholar]
Voight B.F., Pritchard J.K. (2005) Confounding from cryptic relatedness in case-control association studies. PLoS Genet., 1, e32. [DOI] [PMC free article] [PubMed] [Google Scholar]
Warriner A.H., Saag K.G. (2013) Glucocorticoid-related bone changes from endogenous or exogenous glucocorticoids. Curr. Opin. Endocrinol. Diabetes Obes., 20, 510–516. [DOI] [PubMed] [Google Scholar]
Wu M.C., et al. (2011) Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet., 89, 82–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
XU Z.-B., et al. (2012) Representative of L1/2 regularization among Lq (0 < q ≤ 1) regularizations: an experimental study based on phase diagram. Acta Automatica Sinica, 38, 1225–1228. [Google Scholar]
Zhang L., et al. (2014) FISH: fast and accurate diploid genotype imputation via segmental hidden Markov model. Bioinformatics, 30, 1876–1883. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhou X., Stephens M. (2012) Genome-wide efficient mixed-model analysis for association studies. Nat. Genet., 44, 821–824. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zou H., Hastie T. (2005) Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol., 67, 301–320. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

supp_32_3_330__index.html^{(896B, html)}

supp_btv586_Appendix_for_Unified_tests.docx^{(15.6MB, docx)}

[btv586-B1] Amos C.I. (1994) Robust variance-components approach for assessing genetic linkage in pedigrees. Am. J. Hum. Genet., 54, 535. [PMC free article] [PubMed] [Google Scholar]

[btv586-B2] Benjamini Y., Hochberg Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Methodol., 57, 289–300. [Google Scholar]

[btv586-B3] Benjamini Y., Hochberg Y. (2000) On the adaptive control of the false discovery rate in multiple testing with independent statistics. J. Educ. Behav. Stat., 25, 60–83. [Google Scholar]

[btv586-B4] Bühlmann P. (2013) Statistical significance in high-dimensional linear models. Bernoulli, 19, 1212–1242. [Google Scholar]

[btv586-B5] Cao S., et al. (2014) A unified sparse representation for sequence variant identification for complex traits. Genet. Epidemiol., 38, 671–679. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btv586-B6] Chen H., et al. (2013) Sequence kernel association test for quantitative traits in family samples. Genet. Epidemiol., 37, 196–204. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btv586-B7] Chen X., et al. (2010) Lower bound theory of nonzero entries in solutions of l_2-lp minimization. SIAM J. Sci. Comput., 32, 2832–2852. [Google Scholar]

[btv586-B8] Devlin B., Roeder K. (1999) Genomic control for association studies. Biometrics, 55, 997–1004. [DOI] [PubMed] [Google Scholar]

[btv586-B9] Endelman J.B. (2011) Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Genome, 4, 250–255. [Google Scholar]

[btv586-B10] Javanmard A., Montanari A. (2014) Confidence intervals and hypothesis testing for high-dimensional regression. J. Mach. Learn. Res., 15, 2869–2909. [Google Scholar]

[btv586-B11] Kim H.K., et al. (2013) Osteogenic activity of collagen peptide via ERK/MAPK pathway mediated boosting of collagen synthesis and its therapeutic efficacy in osteoporotic bone by back-scattered electron imaging and microarchitecture analysis. Molecules, 18, 15474–15489. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btv586-B12] Lazennec G., Richmond A. (2010) Chemokines and chemokine receptors: new insights into cancer-related inflammation. Trends Mol. Med., 16, 133–144. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btv586-B13] Lee H.W., et al. (2008) Berberine promotes osteoblast differentiation by Runx2 activation with p38 MAPK. J. Bone Miner. Res., 23, 1227–1237. [DOI] [PubMed] [Google Scholar]

[btv586-B14] Meinshausen N., Bühlmann P. (2010) Stability selection. J. R. Stat. Soc. Ser. B Stat. Methodol., 72, 417–473. [Google Scholar]

[btv586-B15] Price A.L., et al. (2010) New approaches to population stratification in genome-wide association studies. Nat. Rev. Genet., 11, 459–463. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btv586-B16] Rakitsch B., et al. (2013) A Lasso multi-marker mixed model for association mapping with population structure correction. Bioinformatics, 29, 206–214. [DOI] [PubMed] [Google Scholar]

[btv586-B17] Sun T., Zhang C.-H. (2012) Scaled sparse linear regression. Biometrika, 99, 879–898. [Google Scholar]

[btv586-B18] Tibshirani R. (1996) Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Methodol., 58, 267–288. [Google Scholar]

[btv586-B19] van de Geer S., et al. (2014) On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42, 1166–1202. DOI: 10.1214/14-AOS1221. [Google Scholar]

[btv586-B20] Voight B.F., Pritchard J.K. (2005) Confounding from cryptic relatedness in case-control association studies. PLoS Genet., 1, e32. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btv586-B21] Warriner A.H., Saag K.G. (2013) Glucocorticoid-related bone changes from endogenous or exogenous glucocorticoids. Curr. Opin. Endocrinol. Diabetes Obes., 20, 510–516. [DOI] [PubMed] [Google Scholar]

[btv586-B22] Wu M.C., et al. (2011) Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet., 89, 82–93. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btv586-B23] XU Z.-B., et al. (2012) Representative of L1/2 regularization among Lq (0 < q ≤ 1) regularizations: an experimental study based on phase diagram. Acta Automatica Sinica, 38, 1225–1228. [Google Scholar]

[btv586-B24] Zhang L., et al. (2014) FISH: fast and accurate diploid genotype imputation via segmental hidden Markov model. Bioinformatics, 30, 1876–1883. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btv586-B25] Zhou X., Stephens M. (2012) Genome-wide efficient mixed-model analysis for association studies. Nat. Genet., 44, 821–824. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btv586-B26] Zou H., Hastie T. (2005) Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol., 67, 301–320. [Google Scholar]

PERMALINK

Unified tests for fine-scale mapping and identifying sparse high-dimensional sequence associations

Shaolong Cao

Huaizhen Qin

Alexej Gossmann

Hong-Wen Deng

Yu-Ping Wang

Abstract

1 Introduction

2 Methods

2.1 Unified scaled Lp norm regularized regression

2.2 The bias correction of unified scaled Lp norm regularized sparse regression

2.3 Hypothesis tests and confidence intervals

2.3.1 uFineMap test

2.3.2 uHDSet test

3 Results

3.1 Nuclear family simulation

3.1.1 Data generation

3.1.2 Type I error rates evaluation

Fig. 1.

Fig. 2.

3.1.3 Statistical power analysis

Fig. 3.

Fig. 4.

Fig. 5.

Fig. 6.

3.2 Complex family simulation

3.2.1 Type I error rates evaluation

3.2.2 Power comparison

Fig. 7.

Fig. 8.

3.3 Analysis of sequence data from Framingham Heart Study

3.3.1 Whole genome analysis

Fig. 9.

Fig. 10.

Table 1.

Table 2.

3.3.2 Pathway analysis

Table 3.

4 Conclusion

Supplementary Material

Acknowledgement

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

2.1 Unified scaled L_p norm regularized regression

2.2 The bias correction of unified scaled L_p norm regularized sparse regression