Screening properties of trend tests in genetic association studies

Zhenzhen Jiang; Hongping Guo; Jinjuan Wang

doi:10.1038/s41598-023-35929-4

. 2023 Jun 5;13:9139. doi: 10.1038/s41598-023-35929-4

Screening properties of trend tests in genetic association studies

Zhenzhen Jiang ^1,², Hongping Guo ³, Jinjuan Wang ^4,^✉

PMCID: PMC10241885 PMID: 37277435

Abstract

In genome-wide association study, extracting disease-associated genetic variants among millions of single nucleotide polymorphisms is of great importance. When the response is a binary variable, the Cochran-Armitage trend tests and associated MAX test are among the most widely used methods for association analysis. However, the theoretical guarantees for applying these methods to variable screening have not been built. To fill this gap, we propose screening procedures based on adjusted versions of these methods and prove their sure screening properties and ranking consistency properties. Extensive simulations are conducted to compare the performances of different screening procedures and demonstrate the robustness and efficiency of MAX test-based screening procedure. A case study on a dataset of type 1 diabetes further verifies their effectiveness.

Subject terms: Genetics, Medical research

Introduction

With the development of high throughput sequencing techniques, hundreds of thousands of single nucleotide polymorphisms (SNPs) in the genome are recorded, which enables researchers to investigate and treat diseases from the perspective of genetic variants. To identify the disease-related genes or genetic markers among all these SNPs, genome-wide association study (GWAS) is a widely used strategy. Up to now, more than one hundred thousands of SNPs have been identified to be related to many traits^1–7.

The commonly used GWAS tests the association between the phenotype and each SNP sequentially, obtains a series of test statistics or p-values, and selects the associated SNPs by comparing these statistics or p-values with a given threshold. When the phenotype is binary, Cochran-Armitage trend test (CATT)⁸ is always used to detect the associated SNPs. It has been shown that when the underlying genetic model is known, where the commonly used ones are recessive, additive or dominant models, CATT has an optimal form^9,10. However, the true genetic models are always unknown and may be very complicated. For the sake of robustness, an omnibus test called MAX is proposed^11,12, which uses the maximum of CATTs under different genetic models as a measure for association. The asymptotical distribution of MAX is given in the work of Zheng et al.¹³. Since its being raised, MAX has been widely used and investigated. Li et al.¹⁴ introduced a selection procedure based on the rank of MAX. Kim et al.¹⁵ proposed a SNP selection method based on MAX and a penalized support vector machine strategy.

Though CATTs and MAX have concise forms and are extensively used, theoretical properties for the applications of CATTs and MAX to GWAS have not been investigated. To control false discovery rate (FDR) in GWAS, Bonferroni correction strategy and FDR control procedures, such as Benjamini–Hochberg procedure, are two widely used strategies. But they both assume that all the SNPs are independent, which certainly is improperly since linkage disequilibrium usually exists among SNPs and may lead to omission on related SNPs. Considering these drawbacks, feature screening methods are sensible alternatives. Rather than select the associated SNPs directly, feature screening approaches aim to eliminate most of the irrelevant SNPs at first. After a screening procedure, there remains only a small amount of SNPs and researchers can concentrate on these remaining SNPs, which can save much time and work.

In the last few years, feature screening methods have been proposed for various situations. Fan and Lv¹⁶ first proposed a screening method called the sure independence screening approach for Gaussian response and predictors under linear regressions. Since then, sure screening property, which retains all the important predictors with high probability as the sample size goes into infinity, has been regarded as a feature screening criterion. Many screening procedures have been developed for diverse models, such as the generalized linear model¹⁷ and additive model¹⁸ among others. Although many procedures can be directly applied to GWAS with corresponding models and data types, only PC-SIS, proposed in the work of Huang et al.¹⁹, is applicable to the considered situation where both the outcome and predictors are categorical. However, PC-SIS does not take the information on genetic model into consideration. Just as mentioned above, CATTs and MAX test consider this information in the association analysis. But their screening properties have not been studied yet. To fill this gap, we propose feature screening methods based on CATTs in different genetic models and MAX test, and investigate their sure screening and rank consistency properties.

The rest of paper is organised as follows. In “Trend test”, we briefly describe the trend tests which can be used to evaluate the relationship between a binary variable and a genotype variable. “Independence screening procedure” introduces the independence screening procedures based on the adjusted trend test statistics, and presents sure screening and ranking consistency properties. Simulation studies are conducted in “Simulation studies” . And a case study on type 1 diabetes is demonstrated in “Application to a real dataset”. A conclusion for this work is presented in “Conclusion”. All proofs of theorems are provided in the Supplemental Materials.

Trend test

CATT evaluates the association between a binary variable and a SNP, and is widely used in case-control genetic data analysis. Compared with Pearson chi-square test, it makes use of the underlying genetic model. Its specific form is as follows. Suppose r cases and s controls are enrolled in the study. For a given SNP, the genotypes can be expressed as aa, Aa and AA, respectively, with A being a high risk candidate allele. In the sample of cases, the counts of aa, Aa and AA are $r_{0}, r_{1}$ and $r_{2}$ , respectively. And the corresponding counts in the control samples are $s_{0}, s_{1}$ and $s_{2}$ . Thus we have $r = r_{0} + r_{1} + r_{2}, s = s_{0} + s_{1} + s_{2}$ . Denote $n = r + s$ and $n_{i} = r_{i} + s_{i}$ for $i = 0, 1, 2$ . All these counts are displayed in Table 1. Then CATT can be written as

\begin{matrix} Z = \frac{\sqrt{n} \sum_{i = 0}^{2} X_{i} (s r_{i} - r s_{i})}{\sqrt{r s [n \sum_{i = 0}^{2} X_{i}^{2} n_{i} - {(\sum_{i = 0}^{2} X_{i} n_{i})}^{2}]}}, \end{matrix}

where $(X_{0}, X_{1}, X_{2})$ is a pre-defined genotype score vector. Note that the optimal score vector for CATT varies across different genetic models. Specifically, for the commonly encountered recessive genetic model, additive genetic model and dominant genetic model, the optimal genotype score vectors are (0, 0, 1), $(0, \frac{1}{2}, 1)$ and (0, 1, 1), respectively. And the respective corresponding CATT can be denoted as $Z_{0}, Z_{\frac{1}{2}}$ and $Z_{1}$ . Under the null hypothesis of no association, these three CATTs above are asymptotically normally distributed as N(0, 1).

Table 1.

Genotype distribution in sample.

	aa	Aa	AA	Total
Cases	$r_{0}$	$r_{1}$	$r_{2}$	r
Controls	$s_{0}$	$s_{1}$	$s_{2}$	s
Total	$n_{0}$	$n_{1}$	$n_{2}$	n

Open in a new tab

However, in practice, the true genetic model is unknown. Thus none of $Z_{0}, Z_{\frac{1}{2}}$ and $Z_{1}$ is robust in all situations. To tackle this issue, the statistic MAX is proposed as

\begin{matrix} Z_{\max} = max {| Z_{0} |, | Z_{\frac{1}{2}} |, | Z_{1} |} . \end{matrix}

By using the maximum of absolute values of $Z_{0}, Z_{\frac{1}{2}}$ and $Z_{1}$ , $Z_{\max}$ obtains robustness under diverse situations.

Independence screening procedure

Screening procedure

CATTs and MAX test are designed for testing the relationship between a binary response and a SNP variable. We apply them to feature screening task and display their properties.

Suppose $G = {(G_{1}, G_{2}, \dots, G_{m})}^{⊤}$ is a m-dimensional SNP vector and Y is a binary response which is 1 for a case sample and 0 for a control sample. Denote $P (Y = 1) = p$ and $P (Y = 0) = q$ , where $p + q = 1 .$ Our aim is to identify the SNPs among all the m SNPs that are related with Y. In accordance with practice, each SNP takes value in ${0, 1, 2}$ , corresponding to genotypes aa, Aa and AA, respectively.

For the kth $(k = 1, 2, \dots, m)$ predictor $G_{k}$ , we set probabilities for case population as $p_{ik} = P (G_{k} = i, Y = 1), i = 0, 1, 2$ and those for control population as $q_{ik} = P (G_{k} = i, Y = 0), i = 0, 1, 2$ , which are displayed in Table 2. Note that $p_{0 k} + p_{1 k} + p_{2 k} = p$ and $q_{0 k} + q_{1 k} + q_{2 k} = q$ for each k in ${1, 2, \dots, m}$ . Denote $f_{ik} = p_{ik} + q_{ik}, i = 0, 1, 2, k = 1, 2, \dots, m$ . Then $f_{0 k} + f_{1 k} + f_{2 k} = 1, k = 1, 2, \dots, m$ .

Table 2.

Genotype distribution in population.

	$G_{k} = 0$	$G_{k} = 1$	$G_{k} = 2$	Tatal
$Y = 1$	$p_{0 k}$	$p_{1 k}$	$p_{2 k}$	p
$Y = 0$	$q_{0 k}$	$q_{1 k}$	$q_{2 k}$	q
Total	$f_{0 k}$	$f_{1 k}$	$f_{2 k}$	1

Open in a new tab

Denote the pre-defined score vectors for the recessive, additive and dominant genetic model as $(X_{0, 0}, X_{1, 0}, X_{2, 0}) = (0, 0, 1)$ , $(X_{0, \frac{1}{2}}, X_{1, \frac{1}{2}}, X_{2, \frac{1}{2}}) = (0, \frac{1}{2}, 1),$ and $(X_{0, 1}, X_{1, 1}, X_{2, 1}) = (0, 1, 1)$ , respectively. Then define four measures for the association relationship between $G_{k} (k = 1, 2, \dots, m)$ and Y as

\begin{matrix} \begin{matrix} ω_{j, k} = \frac{\sum_{i = 0}^{2} X_{i, j} (q p_{ik} - p q_{ik})}{\sqrt{p q [\sum_{i = 0}^{2} X_{i, j}^{2} f_{ik} - {(\sum_{i = 0}^{2} X_{i, j} f_{ik})}^{2}]}}, j = 0, \frac{1}{2}, 1 ; k = 1, 2, \dots, m, \end{matrix} \end{matrix}

and

\begin{matrix} \begin{matrix} ν_{k} = max {| ω_{0, k} |, | ω_{\frac{1}{2}, k} |, | ω_{1, k} |}, k = 1, 2, \dots, m . \end{matrix} \end{matrix}

It is obvious that when $G_{k} (k = 1, 2, \dots, m)$ is independent of Y, $ω_{j, k} = 0 (j = 0, \frac{1}{2}, 1)$ and $ν_{k} = 0$ .

For $k \in {1, 2, \dots, m}$ , let ${(g_{lk}, y_{l}), l = 1, 2, \dots, n}$ be n pairs of observations of $(G_{k}, Y)$ . Denote $r_{k} = {(r_{0 k}, r_{1 k}, r_{2 k})}^{⊤}, s_{k} = {(s_{0 k}, s_{1 k}, s_{2 k})}^{⊤},$ where $r_{ik} (i = 0, 1, 2)$ are the counts of each genotype in case sample and $s_{ik} (i = 0, 1, 2)$ are the counts in control sample. Notice that $r_{0 k} + r_{1 k} + r_{2 k} = r$ and $s_{0 k} + s_{1 k} + s_{2 k} = s$ . Denote $n_{ik} = r_{ik} + s_{ik}, i = 0, 1, 2$ , then we have $n_{0 k} + n_{1 k} + n_{2 k} = n$ .

Given the above notations, the empirical estimators of $ω_{0, k}, ω_{\frac{1}{2}, k}, ω_{1, k}$ , and $ν_{k}$ for $k \in {1, 2, \dots, m}$ are

\begin{matrix} \begin{matrix} {\hat{ω}}_{j, k} = \frac{\sum_{i = 0}^{2} X_{i, j} (\hat{q} {\hat{p}}_{ik} - \hat{p} {\hat{q}}_{ik})}{\sqrt{\hat{p} \hat{q} [\sum_{i = 0}^{2} X_{i, j}^{2} {\hat{f}}_{ik} - {(\sum_{i = 0}^{2} X_{i, j} {\hat{f}}_{ik})}^{2}]}}, j = 0, \frac{1}{2}, 1, \end{matrix} \end{matrix}

and

\begin{matrix} \begin{matrix} {\hat{ν}}_{k} = max {| {\hat{ω}}_{0, k} |, | {\hat{ω}}_{\frac{1}{2}, k} |, | {\hat{ω}}_{1, k} |}, \end{matrix} \end{matrix}

where ${\hat{p}}_{ik}, {\hat{q}}_{ik}, \hat{p}, \hat{q}, {\hat{f}}_{ik}$ are the empirical estimators of $p_{ik}, q_{ik}, p, q, f_{ik}$ , and can be estimated as

\begin{matrix} \begin{matrix} {\hat{p}}_{ik} & = & \frac{1}{n} \sum_{l = 1}^{n} I (G_{lk} = i, Y_{l} = 1) = \frac{r_{ik}}{n} \\ {\hat{q}}_{ik} & = & \frac{1}{n} \sum_{l = 1}^{n} I (G_{lk} = i, Y_{l} = 0) = \frac{s_{ik}}{n}, \\ \hat{p} & = & \frac{1}{n} \sum_{l = 1}^{n} I (Y_{l} = 1) = \frac{r}{n}, \\ \hat{q} & = & \frac{1}{n} \sum_{l = 1}^{n} I (Y_{l} = 0) = \frac{s}{n}, \\ {\hat{f}}_{ik} & = & \frac{1}{n} \sum_{l = 1}^{n} I (G_{lk} = i) = \frac{n_{ik}}{n} . \end{matrix} \end{matrix}

Plug them into the expression, ${\hat{ω}}_{j, k}$ has the form

\begin{matrix} {\hat{ω}}_{j, k} = \frac{\sum_{i = 0}^{2} X_{i, j} (s r_{ik} - r s_{ik})}{\sqrt{r s [n \sum_{i = 0}^{2} X_{i, j}^{2} n_{ik} - {(\sum_{i = 0}^{2} X_{i, j} n_{ik})}^{2}]}} . \end{matrix}

Note that ${\hat{ω}}_{j, k} = \frac{Z_{j, k}}{\sqrt{n}}$ , where $Z_{0, k}, Z_{\frac{1}{2}, k}$ and $Z_{1, k}$ are CATT statistics between $G_{k}$ and Y for the pre-defined score vector $(X_{0}, X_{1}, X_{2})$ being (0, 0, 1), $(0, \frac{1}{2}, 1)$ and (0, 1, 1), respectively. So ${\hat{ω}}_{j, k}$ is an adjusted version of $Z_{j, k}$ , whose value range is not effected by sample size. And $\hat{ν_{k}}$ maintains the ranking result of $Z_{m a x, k}$ for each predictor. Large values of $\hat{ν_{k}}$ indicate the existence of association between $G_{k}$ and Y. We denote ${\hat{ω}}_{j, k}$ as aCATT and $\hat{ν_{k}}$ as aMAX.

Assume that only a small part of SNPs are related with the response Y. We use aCATT $| {\hat{ω}}_{j, k} |$ s and aMAX $\hat{ν_{k}}$ s to identify their positions. The screening procedures based on $| {\hat{ω}}_{0, k} |$ s, $| {\hat{ω}}_{\frac{1}{2}, k} |$ s, $| {\hat{ω}}_{1, k} |$ s and $\hat{ν_{k}}$ s are named as REC-SIS, ADD-SIS, DOM-SIS and MAX-SIS, respectively, where REC-SIS, ADD-SIS and DOM-SIS are collectively called as CATT-SIS.

Screening properties

We call a SNP as an active SNP if it is associated with the response Y. Define different index sets of active SNPs based on different measures by

\begin{matrix} A_{j}^{*} = & {1 \leq k \leq m : | ω_{j, k} | > 0}, j = 0, \frac{1}{2}, 1, \end{matrix}

\begin{matrix} A^{*} = & {1 \leq k \leq m : ν_{k} > 0} . \end{matrix}

Their estimated truncated active index sets can be expressed as

\begin{matrix} {\hat{A}}_{j}^{*} = & {1 \leq k \leq m : | {\hat{ω}}_{j, k} | \geq c_{0} n^{- τ}}, j = 0, \frac{1}{2}, 1, \end{matrix}

\begin{matrix} {\hat{A}}^{*} = & {1 \leq k \leq m : {\hat{ν}}_{k} \geq c_{0} n^{- τ}} . \end{matrix}

where $c > 0$ and $τ > 0$ are two pre-specified constants and satisfy some certain conditions.

Now we investigate the theoretical properties of the screening procedures of ${\hat{A}}_{j}^{*}$ and ${\hat{A}}^{*}$ s. First list some conditions.

Condition 1

(C1): There exists constants $0 < ζ_{\min} \leq ζ_{\max} < 1$ such that for $i = 0, 1, 2$ and $k = 1, 2, \dots, m$ , if $p_{ik} \neq 0 (q_{ik} \neq 0)$ , then $p_{ik} \in (ζ_{\min}, ζ_{\max}) (q_{ik} \in (ζ_{\min}, ζ_{\max}))$ .
(C2): $min_{k \in A_{j}^{*}} ω_{j, k} \geq 2 c_{0} n^{- τ}$ for $j = 0, \frac{1}{2}, 1$ , where constant $c_{0} > 0$ and $0 \leq τ < \frac{1}{2}$ .
(C3): $min_{k \in A^{*}} ν_{k} \geq 2 c_{0} n^{- τ}$ , where constant $c_{0} > 0$ and $0 \leq τ < \frac{1}{2}$ .
(C4): For given constants $c_{0} > 0, 0 \leq τ < \frac{1}{2},$ and $log (m) = o (n^{1 - 2 τ} \land n^{\frac{1}{2}})$ where $a \land b = min {a, b}$ , $\underset{m \to \infty}{lim inf} (min_{k \in A_{j}^{*}} ω_{j, k} - max_{k \notin A_{j}^{*}} ω_{j, k}) > 2 c_{0} n^{- τ}$ for $j = 0, \frac{1}{2}, 1$ .
(C5): For given constants $c_{0} > 0, 0 \leq τ < \frac{1}{2},$ and $log (m) = o (n^{1 - 2 τ} \land n^{\frac{1}{2}})$ where $a \land b = min {a, b}$ , $\underset{m \to \infty}{lim inf} (min_{k \in A^{*}} ν_{k} - max_{k \notin A^{*}} ν_{k}) > 2 c_{0} n^{- τ}$ .

Then we present the sure screening properties based on aCATT and aMAX in Theorem 1 and 2, whose proofs are shown in Supplemental Materials.

Theorem 1

(Sure Screening Property of CATT-SIS):

(i)
If Condition (C1) holds, then for $j = 0, \frac{1}{2}$ and 1 we have
$\begin{matrix} \begin{matrix} P (max_{1 \leq k \leq m} | {\hat{ω}}_{j, k} - ω_{j, k} | \geq c_{0} n^{- τ}) < O (m exp {- c_{1} n^{1 - 2 τ} - c_{2} n^{\frac{1}{2}}}), \end{matrix} \end{matrix}$ 13
with $c_{1} > 0$ and $c_{2} > 0$ being two constants.
(ii)
Furthermore, if both Conditions (C1) and (C2) are satisfied, for $j = 0, \frac{1}{2}$ and 1 we obtain that
$\begin{matrix} P (A_{j}^{*} \subseteq {\hat{A}}_{j}^{*}) \geq 1 - O (κ exp {- c_{1} n^{1 - 2 τ} - c_{2} n^{\frac{1}{2}}}), \end{matrix}$ 14
where $κ$ is the cardinality of $A_{j}^{*}$ , and $c_{1}, c_{2} > 0$ are the same as those in inequality (13).

Theorem 2

(Sure Screening Property for MAX-SIS):

(i)
If Condition (C1) holds, then we have
$\begin{matrix} \begin{matrix} P (max_{1 \leq k \leq m} | {\hat{ν}}_{k} - ν_{k} | \geq c_{0} n^{- τ}) < O (m exp {- c_{3} n^{1 - 2 τ} - c_{4} n^{\frac{1}{2}}}), \end{matrix} \end{matrix}$ 15
where $c_{3} > 0$ and $c_{4} > 0$ are two constants.
(ii)
Furthermore, if both Conditions (C1) and (C3) are satisfied, we have that
$\begin{matrix} P (A^{*} \subseteq {\hat{A}}^{*}) \geq 1 - O (κ exp {- c_{3} n^{1 - 2 τ} - c_{4} n^{\frac{1}{2}}}), \end{matrix}$ 16
where $κ$ is the cardinality of $A^{*}$ , and $c_{3}, c_{4} > 0$ are the same as those in inequality (15).

Theorems 1 and 2 show that the screening procedures have satisfying performances with regard to selecting significant SNPs. They also possess ranking consistency property, which are shown below.

Theorem 3

(Ranking Consistency Property for CATT-SIS): Suppose Conditions (C1) and (C4) are satisfied, then for $j = 0, \frac{1}{2}$ and 1, it follows that

\begin{matrix} \underset{n \to \infty}{lim inf} {min_{k \in A_{j}^{*}} | {\hat{ω}}_{j, k} | - max_{k \notin A_{j}^{*}} | {\hat{ω}}_{j, k} |} \geq 0, a . s . \end{matrix}

Theorem 4

(Ranking Consistency Property for MAX-SIS) Suppose Condition (C1) and (C5) are satisfied, then it follows that

\begin{matrix} \underset{n \to \infty}{lim inf} {min_{k \in A^{*}} {\hat{ν}}_{k} - max_{k \notin A^{*}} {\hat{ν}}_{k}} \geq 0, a . s . \end{matrix}

In practice, c and $τ$ are hard to be determined to satisfy the condition that the estimated truncated active index sets contain the corresponding active index sets. So it is common to select SNPs corresponding to the first d largest statistic values as related SNPs, where d is a pre-defined constant. That is, the respective estimated active index sets have the following forms

\begin{matrix} {\hat{A}}_{j, d}^{*} = {1 \leq k \leq m : | {\hat{ω}}_{j, k} | is among the first d largest statistics}, \end{matrix}

and

\begin{matrix} {\hat{A}}_{d}^{*} = {1 \leq k \leq m : {\hat{ν}}_{k} is among the first d largest statistics} . \end{matrix}

We now explain why we determine the index sets corresponding to the first d largest statistics as active index sets. Take MAX-SIS for example. Given c and $τ$ , the cardinality of ${\hat{A}}^{*}$ is determined, which is denoted as $d_{0}$ . According to Theorem 4, MAX-SIS possesses ranking consistency property. Provided Conditions (C1) and (C5) are satisfied, we have ${\hat{A}}^{*} \subseteq {\hat{A}}_{d}^{*}$ if $d \geq d_{0}$ . This indicates that all active predictors are all included in ${\hat{A}}_{d}^{*}$ . Note that $P (A^{*} \subseteq {\hat{A}}_{d}^{*})$ is nondecreasing in d. As long as $d \geq d_{0}$ , we have $P (A^{*} \subseteq {\hat{A}}_{d}^{*}) \geq P (A^{*} \subseteq {\hat{A}}^{*}) \geq 1 - O (κ exp {- c_{3} n^{1 - 2 τ} - c_{4} n^{\frac{1}{2}}})$ based on Theorem 2 (ii). Therefore, estimating the active index set based on an index set corresponding to the first d largest statistics is reasonable.

Simulation studies

In this section, we conduct simulation studies to assess the performances of REC-SIS, ADD-SIS, DOM-SIS and MAX-SIS by comparing with PC-SIS¹⁹.

For each genetic model, the dimension of SNPs is $m = 10^{5}$ . Since the sample size, the case-to-control ratio and the minor allelic frequency (MAF)²⁰ can affect the association analysis in a case-control study, we consider different settings on them. To be specific, we choose the sample size n from ${1500, 3000, 4500}$ , the case-to-control ratio $w = p : q$ from ${1, 1 / 3, 1 / 5}$ and MAF $α$ from ${0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45}$ . Because only the counts of genotypes are needed to calculate the statistics of interest, there is no need to generate original samples ${(g_{l}, y_{l}), l = 1, 2, \dots, n}$ in the simulation studies. Instead, we can just generate the count data from the trinomial distribution for each dataset. For the kth genetic variant (SNP), the count vector of three genotypes for case samples $(r_{0 k}, r_{1 k}, r_{2 k})$ follows the trinomial distribution $Mul (n p, p_{0 k} / p, p_{1 k} / p, p_{2 k} / p)$ and that for control samples $(s_{0 k}, s_{1 k}, s_{2 k})$ follows the trinomial distribution $Mul (n q, q_{0 k} / q, q_{1 k} / q, q_{2 k} / q)$ , where $p_{0 k} + p_{1 k} + p_{2 k} = p, q_{0 k} + q_{1 k} + q_{2 k} = q$ .

In each dataset, the first six SNPs are set to be related with Y and the rest SNPs are independent of Y. For the control sample, the count vector of each SNP $G_{k} (k \in {1, 2, \dots, 10^{5}})$ $(s_{0 k}, s_{1 k}, s_{2 k})$ is generated from the trinomial distribution $Mul (n q, q_{0 k} / q, q_{1 k} / q, q_{2 k} / q)$ , where $q_{0 k} = q {(1 - α)}^{2}, q_{1 k} = 2 q α (1 - α), q_{2 k} = q α^{2}$ with $α$ being the MAF. For the case sample, the count vector of each irrelevant SNP $G_{k}, (k \in {7, 8, \dots, 10^{5}})$ $(r_{0 k}, r_{1 k}, r_{2 k})$ is generated from $Mul (n p, p_{0 k} / p, p_{1 k} / p, p_{2 k} / p)$ with $p_{ik} / p = q_{ik} / q, i = 0, 1, 2$ ; while the count vector for each relevant SNP $G_{k} (k \in {1, 2, \dots, 6})$ $(r_{0 k}, r_{1 k}, r_{2 k})$ is generated from the trinomial distribution $Mul (n p, p_{0 k} / p, p_{1 k} / p, p_{2 k} / p)$ , where $(p_{0 k}, p_{1 k}, p_{2 k})$ are functions of $(q_{0 k}, q_{1 k}, q_{2 k})$ and are diverse for different genetic models. Four different genetic models are considered, that is, recessive genetic model, additive genetic model, dominant genetic model and mixture of them, which are denoted as Model I, Model II, Model III and Model IV as follows, respectively.

Under each genetic model, 500 repetitions are conducted to compare the performances of different methods. We employ two criteria to measure the effectiveness of each screening approach. One is the proportion for each relevant SNP $G_{k}, k \in A$ that is selected among all the 500 repetitions and is denoted as $P_{s}^{k}$ . The other is the proportion that all the relevant SNPs are simultaneous selected among these 500 repetitions, which is denoted as $P_{a}$ .

Model I.
Data are generated from the recessive genetic model. For the relevant SNPs $G_{k}, (k = 1, 2, \dots, 6)$ , $p_{0 k} = \frac{p q_{0 k}}{q_{0 k} + q_{1 k} + λ q_{2 k}}, p_{1 k} = \frac{p q_{1 k}}{q_{0 k} + q_{1 k} + λ q_{2 k}}, p_{2 k} = \frac{p λ q_{2 k}}{q_{0 k} + q_{1 k} + λ q_{2 k}}$ , with $λ = 1.8$ .
Model II.
Data are generated from the additive genetic model. For the relevant SNPs $G_{k}, (k = 1, 2, \dots, 6)$ , $p_{0 k} = \frac{p q_{0 k}}{q_{0 k} + λ q_{1 k} + (2 λ - 1) q_{2 k}}, p_{1 k} = \frac{p λ q_{1 k}}{q_{0 k} + λ q_{1 k} + (2 λ - 1) q_{2 k}}, p_{2 k} = \frac{p (2 λ - 1) q_{2 k}}{q_{0 k} + λ q_{1 k} + (2 λ - 1) q_{2 k}}$ , with $λ = 1.4$ .
Model III.
Data are generated from the dominant genetic model. For the relevant SNPs $G_{k}, (k = 1, 2, \dots, 6)$ , $p_{0 k} = \frac{p q_{0 k}}{q_{0 k} + λ q_{1 k} + λ q_{2 k}}, p_{1 k} = \frac{p λ q_{1 k}}{q_{0 k} + λ q_{1 k} + λ q_{2 k}}, p_{2 k} = \frac{p λ q_{2 k}}{q_{0 k} + λ q_{1 k} + λ q_{2 k}}$ , with $λ = 1.6$ .
Model IV.
Data are generated from the mixture of three genetic models. Relevant SNPs $G_{1}$ and $G_{2}$ are generated as those in Model I, relevant SNPs $G_{3}$ and $G_{4}$ are generated as those in Model II and relevant SNPs $G_{5}$ and $G_{6}$ are generated as those in Model III.

For each model, the proportions $P_{s}^{k}, k = 1, 2, \dots, 6$ and $P_{a}$ are calculated with the constant $d = [n / log n]$ , where [a] denotes the integer part of a. The results are plotted in Figs. 1, 2, 3 and 5. Since in Models I, II and III, the first six relevant SNPs are generated from the same distribution, $P_{s}^{k}, k = 1, 2, \dots 6$ are similar in these models. Therefore, we only plot the results for $P_{s}^{1}$ in Figs. 1, 2 and 3. In Model IV, the relevant SNPs are generated from different genetic models, so the results for $P_{s}^{k}, k = 1, 2, \dots 6$ are plotted in Fig. 3. Besides, the results for $P_{a}$ are all plotted in Figs. 1, 2, 3 and 5.

Selection proportions of different methods in Model I. The left subplot is for $P_{s}^{1}$ among 500 repetitions. The right subplot is for $P_{a}$ among 500 repetitions.

Selection proportions of different methods in Model II. The left subplot is for $P_{s}^{1}$ among 500 repetitions. The right subplot is for $P_{a}$ among 500 repetitions.

Selection proportions of different methods in Model III. The left subplot is for $P_{s}^{1}$ among 500 repetitions. The right subplot is for $P_{a}$ among 500 repetitions.

The venn diagram for the results of all the five procedures.

Results in Fig. 1 correspond to the recessive genetic model. It can be seen that REC-SIS performs the best, MAX-SIS comes the second, and DOM-SIS is the worst. As Fig. 1 illuminates, the ability of detecting $G_{1}$ for all the screening approaches increases as sample size, the case-to-control ratio and MAF increase. In addition, it shows that PC-SIS almost fails to detect the relevant SNPs when MAF is less than 0.3.

The simulation results for Model II are presented in Fig. 2. It shows that when the underlying genetic model is exactly additive genetic model, ADD-SIS performs best, MAX-SIS ranks the second. As Fig. 2 displays, the ability to detect relevant SNPs for all the screening approaches increases as the sample size and the case-to-control ratio increase. REC-SIS has low powers when MAF is small. The detection proportions of DOM-SIS first increase and then decrease slightly as MAF increases. In general, the detection proportions of MAX-SIS and ADD-SIS increase as MAF becomes larger. Whereas, the detection proportions of PC-SIS first increase slightly and then decrease dramatically as MAF increases.

The results for Model III are exhibited in Fig. 3. It shows that when the underlying genetic model is exactly dominant genetic model, DOM-SIS performs the best and REC-SIS can hardly work. As shown in Fig. 3 , the ability of detecting $G_{1}$ for all the screening approaches increases as the sample size and the case-to-control ratio increase. Furthermore, when MAF is greater than 0.25, the detection proportions for all the methods except REC-SIS decline as MAF increases. From the right subplot of Fig. 3, we can see that the performances of ADD-SIS and PC-SIS are greatly influenced by MAF, while those of DOM-SIS and MAX-SIS are robust against MAF.

As for Model IV, since the effects of sample size and case-to-control ratio on the performances of different methods have been demonstrated in the above three models, we take $n = 3000, w = 0.2$ as representative to demonstrate the effects of different genetic models. The results of $P_{s}^{k}, k = 1, 2, \dots, 6$ when $n = 3000, w = 0.2$ are illustrated in the left subplot of Fig. 4 and the results of $P_{a}$ under all scenarios are shown in the right subplot of Fig. 4 . Since the six relevant SNPs follow different genetic models, REC-SIS, ADD-SIS and DOM-SIS can not excel MAX-SIS and PC-SIS uniformly for all the relevant SNPs. Consistent with the results shown before, REC-SIS has the highest detection proportion for $G_{1}$ and $G_{2}$ , ADD-SIS has the highest detection proportion for $G_{3}$ and $G_{4}$ , and DOM-SIS has the highest detection proportion for $G_{5}$ and $G_{6}$ . None of REC-SIS, ADD-SIS and DOM-SIS has the best performance uniformly. However, no matter what the underlying genetic relationship is, MAX-SIS always has excellent performance. As for $P_{a}$ , MAX-SIS outperforms all the other methods significantly.

Selection proportions of different methods in Model IV. The left subplot is for $P_{s}^{k}, k = 1, 2, \dots, 6$ among 500 repetitions, when sample size is 3000 and case-to-control ratio is 0.2. The right subplot is for $P_{a}$ among 500 repetitions.

From the simulation results above, we can see that sample size and case-to-control ratio are two important factors that affect the association analysis. It is rational that increasement in sample size can enhance the efficiency in identifying associated SNPs. As for case-to-control ratio, when the ratio approaches 1, all the methods have better performances than conditions with larger ratios. Given the size for case sample, increasing the size for control sample has little contribution on the performances of all the methods. For example, when $w = p : q = 1 / 3, n = 3000$ and $w = p : q = 1 : 5, n = 4500$ , that is when the case sample size $r = 750$ , and the control sample size $s = 2250$ and 3750 respectively, the selection proportions of all the five screening methods have similar results no matter how MAF varies. The effect of MAF is not monotonic. In recessive model, the selection proportions of all the five methods increase as MAF increases. However, in other models, the selection proportions of some methods first increase and later decrease as MAF increases. Under all the scenarios considered, MAX-SIS is the most robust method among these five screening methods.

Overall, we can come to the conclusion that if all the candidate SNPs follow the same known genetic model, one of REC-SIS, ADD-SIS, DOM-SIS performs the best. However, the genetic model is always complicated and unknown in practice. In this case, MAX-SIS is recommended to reach robustness and efficiency.

Application to a real dataset

We apply the proposed screening procedures to a real case-control data of type 1 diabetes for British people¹. The data contains 459,446 SNPs for 2938 controls and 1963 cases. Since there exist some missing values in the genotype data, the number of observed genotypes for a single SNP varies across all the SNPs. Count the number of missing values for each partially observed SNP. And it shows that the average number and the largest number of all these counts is 16.72 and 503, respectively, and the $25 %, 50 %, 75 %$ quantile of these counts are 4, 7 and 13, respectively. To make aCATT and aMAX statistics have similar consistency rates for all the SNPs, SNPs with missing ratio large than 1% are deleted. Besides, SNPs with only two genotypes being observed are also removed from the dataset, resulting in 352,659 SNPs to be analyzed. For each SNP, the allele with lower frequency is treated as the risk allele. We use REC-SIS, ADD-SIS, DOM-SIS, MAX-SIS and PC-SIS to screen out the redundant SNPs, with the parameter d being $[4901 / log (4901)] = 576 .$ The results are shown in the venn diagram in Fig. 5 to display the screening results of all the five procedures. It shows that 242 SNPs are selected by all the procedures. Among these SNPs, SNPs rs9272346 and rs9272346 have been reported to be associated with type 1 diabetes¹. This indicates that there may be some important association information contained in these SNPs which need to be further investigated. We list these 242 SNPs in Table 3.

Table 3.

The 242 SNPs selected by all the five screening procedures.

rs1113523	rs203888	rs12527415	rs3130532	rs2523467	rs574710	rs9268645	rs7768538
rs1217200	rs9393881	rs11758688	rs2248880	rs2395034	rs539703	rs3135393	rs7453920
rs1230666	rs203884	rs3094123	rs6906846	rs3131631	rs3132959	rs1051336	rs6902723
rs1230658	rs1233708	rs3130649	rs2524067	rs2736177	rs926591	rs9268831	rs6903130
rs1230649	rs406511	rs3095350	rs7382297	rs1046089	rs3129900	rs9268877	rs9296044
rs6679677	rs2523443	rs6924600	rs2523537	rs760293	rs4959093	rs9270986	rs2857212
rs1217396	rs1611350	rs6457282	rs2523535	rs3130048	rs910050	rs615672	rs2071474
rs2488457	rs1736913	rs9295931	rs2523534	rs805301	rs910049	rs9271208	rs241427
rs9467704	rs2517646	rs2074508	rs2596437	rs707918	rs3129932	rs3129768	rs241403
rs9379851	rs3130391	rs3218815	rs5025315	rs376510	rs3129934	rs9272219	rs3101942
rs6933583	rs3094055	rs2074512	rs5022119	rs805292	rs9268403	rs9272346	rs241400
rs4712980	rs1012411	rs4711247	rs2523638	rs707915	rs12201454	rs9272723	rs151719
rs9393708	rs3132644	rs3132581	rs3997982	rs3130484	rs2894254	rs9273363	rs3129304
rs9358932	rs3132636	rs2530710	rs2596571	rs3131379	rs3129953	rs9275134	rs3129303
rs9379855	rs3129818	rs1634717	rs2523485	rs480092	rs2076533	rs2856688	rs10947374
rs9393713	rs3129819	rs1634718	rs3099849	rs2763979	rs9268480	rs7775228	rs9296069
rs1977	rs970269	rs1632854	rs2507976	rs550513	rs3763307	rs9469220	rs13215059
rs10456045	rs3132625	rs2844670	rs9266774	rs406936	rs6930933	rs2647015	rs13215062
rs9358945	rs3094050	rs2523865	rs9266775	rs3130287	rs2001099	rs2858308	rs376877
rs7763910	rs3094703	rs3130544	rs4081552	rs1150753	rs2001097	rs9275418	rs2179920
rs7773938	rs3094045	rs9405050	rs2596517	rs204991	rs3135378	rs9275523	rs3117242
rs4634439	rs3094034	rs1265052	rs2596464	rs204990	rs3135377	rs6936863	rs3128923
rs12190473	rs3130112	rs3130975	rs3131622	rs2071278	rs3135376	rs3916765	rs3117230
rs201002	rs3130113	rs3095324	rs2844507	rs3131294	rs2395161	rs9461799	rs872956
rs200995	rs3094694	rs13200022	rs2244579	rs377763	rs2395164	rs2227127	rs2395351
rs200991	rs8233	rs3130564	rs2248459	rs3134926	rs2395167	rs9276429	rs3116985
rs149946	rs3095329	rs2106074	rs2248462	rs424232	rs2213580	rs9276431	rs3129248
rs149969	rs3094127	rs887464	rs2248617	rs3130311	rs3135366	rs9276432	rs10807124
rs149970	rs3094122	rs1265181	rs3099844	rs1265777	rs9268557	rs9276440	rs11171739
rs202906	rs10947091	rs12199773	rs2516422	rs9268230	rs9268560	rs9276490	rs1265566
rs17696736	rs9746695

Open in a new tab

Conclusion

Screening SNPs in case-control study is a commonly encountered task in modern biomedical research. And CATT and MAX statistics are the most widely used screening measures for this issue. However, the theoretical guarantees for the application of CATT and MAX to SNP screening have not been investigated. We fill this gap by adjusting CATTs and MAX test, and proposing screening procedures based on the adjusted statistics. Sure screening properties and ranking consistency properties of these screening procedures are proved. Simulation results show that when the underlying genetic model is unknown, which is often the case in practice, MAX-SIS performs the best.

Despite of the high efficiency of the proposed procedures, there exist some factors that affect their performances. First, numerical simulations show that when both MAF and sample size are small, REC-SIS, ADD-SIS, DOM-SIS, MAX-SIS and PC-SIS all perform badly. This is because that under this situation, the number of samples possessed with minor alleles is too small to provide enough information for the association analysis. Second, it is obviously that the value of the parameter d influence the performances of different methods. We determine the value of d based on works in the previous literatures. Since how to choose an optimal d is not the focus of this work, we will conduct more detailed analysis further. Third, when there exist covariates to be adjusted for, new procedures need to be developed, which will be studied in a future work.

Supplementary Information

Supplementary Information.^{(152.6KB, pdf)}

Author contributions

Conceptualization, Z.J. and J.W.; methodology, Z.J., H.G. and J.W.; validation, Z.J., H.G. and J.W.; formal analysis, Z.J.; writing original draft preparation, Z.J. and J.W.; and writing review and editing, Z.J., H.G. and J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been supported by China Postdoctoral Science Foundation funded project (Grant No. 2021M700433), National Natural Science Foundation of China (NSFC) (Grant No. 12101047), and Natural Science Foundation of Hubei Province (Grant No. 2022CFB942).

Data availibility

All data included in this study are available upon request by contacting with the corresponding author. To facilitate the usage for the proposed methods, the codes are available upon request by contacting with the corresponding author.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-023-35929-4.

References

1.Wellcome Trust Case Control Consortium (WTCCC). Genome-wide association study of 14,000 cases of seven common diseases and 3000 shared controls. Nature447, 661–678 (2007). [DOI] [PMC free article] [PubMed]
2.Easton DF, Pooley KA, Dunning AM, Pharoah PD, Thompson D, Ballinger DG, Struewing JP, Morrison J, Field H, Luben R, et al. Genome-wide association study identifies novel breast cancer susceptibility loci. Nature. 2007;447:1087–1093. doi: 10.1038/nature05887. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Zeggini E, Weedon MN, Lindgren CM, Frayling TM, Elliott KS, Lango H, Timpson NJ, Perry JR, Rayner NW, Freathy RM, et al. Replication of genome-wide association signals in UK samples reveals risk loci for type 2 diabetes. Science. 2007;316:1336–C1341. doi: 10.1126/science.1142364. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Yue WH, Wang HF, Sun LD, Tang FL, Liu ZH, Zhang HX, Li WQ, Zhang YL, Zhang Y, Ma CC, et al. Genome-wide association study identifies a susceptibility locus for schizophrenia in Han Chinese at 11p11.2. Nat. Genet. 2011;43:1228–1232. doi: 10.1038/ng.979. [DOI] [PubMed] [Google Scholar]
5.Li LC, et al. Transcriptome-wide association study of coronary artery disease identifies novel susceptibility genes. Basic Res. Cardiol. 2022;117:6. doi: 10.1007/s00395-022-00917-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Li ZT, Wang B, Luo W, Xu YY, Wang JJ, Xue ZH, Niu YD, Cheng ZK, Ge S, Zhang W, Zhang JY, Li Q, Chong K. Natural variation of codon repeats in COLD11 endows rice with chilling resilience. Sci. Adv. 2022;9:eabq5506. doi: 10.1126/sciadv.abq5506. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Thomas NJ, Walkey HC, Kaur A, Misra S, Oliver NS, Colclough K, Weedon MN, Johnston DG, Hattersley AT, Patel KA. The relationship between islet autoantibody status and the genetic risk of type 1 diabetes in adult-onset type 1 diabetes. Diabetologia. 2022;66:310–320. doi: 10.1007/s00125-022-05823-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Sasieni PD. From genotypes to genes: Doubling the sample size. Biometrics. 1997;53:1253–1261. doi: 10.2307/2533494. [DOI] [PubMed] [Google Scholar]
9.Freidlin B, Zheng G, Li Z, Gastwirth JL. Trend tests for case–control studies of genetic markers: Power, sample size and robustness. Hum. Hered. 2002;53:146–152. doi: 10.1159/000064976. [DOI] [PubMed] [Google Scholar]
10.Zheng G, Freidlin B, Li Z, Gastwirth JL. Choice of scores in trend tests for case–control studies of candidate-gene associations. Biometric. J. 2003;45:335–348. doi: 10.1002/bimj.200390016. [DOI] [Google Scholar]
11.Sladek R, Rocheleau G, Rung J, Dina C, Shen L, Serre D, Boutin P, Vincent D, Belisle A, Hadjadj S, Balkau B, Heude B, Charpentier G, Hudson TJ, Montpetit A, Pshezhetsky AV, Prentki M, Posner BI, Balding DJ, Meyre D, Polychronakos C, Froguel P. A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature. 2007;445:881–885. doi: 10.1038/nature05616. [DOI] [PubMed] [Google Scholar]
12.Li Q, Zheng G, Li Z, Yu K. Efficient approximation of p-value of the maximum of correlated tests, with applications to genome-wide association studies. Ann. Hum. Genet. 2008;72:397–406. doi: 10.1111/j.1469-1809.2008.00437.x. [DOI] [PubMed] [Google Scholar]
13.Zheng G, Li Q, Yuan A. Some statistical properties of efficiency robust tests with applications to genetic association studies. Scand. J. Stat. 2014;41:762–774. doi: 10.1111/sjos.12060. [DOI] [Google Scholar]
14.Li Q, Yu K, Li Z, Zheng G. MAX-rank: A simple and robust genome-wide scan for case–control association studies. Hum. Genet. 2008;123:617–623. doi: 10.1007/s00439-008-0514-8. [DOI] [PubMed] [Google Scholar]
15.Kim J, Sohn I, Kim DDH, Jung SH. SNP selection in genome-wide association studies via penalized support vector machine with MAX test. Comput. Math. Methods Med. 2013;2013:340678. doi: 10.1155/2013/340678. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Fan J, Song R. Sure independence screening in generalized linear models with NP-dimensionality. Ann. Stat. 2010;38:3567–3604. doi: 10.1214/10-AOS798. [DOI] [Google Scholar]
18.Fan J, Feng Y, Song R. Nonparametric independence screening in sparse ultra-high-dimensional additive models. J. Am. Stat. Assoc. 2011;106:544–557. doi: 10.1198/jasa.2011.tm09779. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Li HD, Wang RH. Feature screening for ultrahigh dimensional categorical data with applications. J. Bus. Econ. Stat. 2014;32:237–244. doi: 10.1080/07350015.2013.863158. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Emily M. Power comparison of Cochran-Armitage trend test against allelic and genotypic tests in large-scale case–control genetic association studies. Stat. Methods Med. Res. 2018;27:2657–2673. doi: 10.1177/0962280216683979. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information.^{(152.6KB, pdf)}

Data Availability Statement

[CR1] 1.Wellcome Trust Case Control Consortium (WTCCC). Genome-wide association study of 14,000 cases of seven common diseases and 3000 shared controls. Nature447, 661–678 (2007). [DOI] [PMC free article] [PubMed]

[CR2] 2.Easton DF, Pooley KA, Dunning AM, Pharoah PD, Thompson D, Ballinger DG, Struewing JP, Morrison J, Field H, Luben R, et al. Genome-wide association study identifies novel breast cancer susceptibility loci. Nature. 2007;447:1087–1093. doi: 10.1038/nature05887. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Zeggini E, Weedon MN, Lindgren CM, Frayling TM, Elliott KS, Lango H, Timpson NJ, Perry JR, Rayner NW, Freathy RM, et al. Replication of genome-wide association signals in UK samples reveals risk loci for type 2 diabetes. Science. 2007;316:1336–C1341. doi: 10.1126/science.1142364. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Yue WH, Wang HF, Sun LD, Tang FL, Liu ZH, Zhang HX, Li WQ, Zhang YL, Zhang Y, Ma CC, et al. Genome-wide association study identifies a susceptibility locus for schizophrenia in Han Chinese at 11p11.2. Nat. Genet. 2011;43:1228–1232. doi: 10.1038/ng.979. [DOI] [PubMed] [Google Scholar]

[CR5] 5.Li LC, et al. Transcriptome-wide association study of coronary artery disease identifies novel susceptibility genes. Basic Res. Cardiol. 2022;117:6. doi: 10.1007/s00395-022-00917-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Li ZT, Wang B, Luo W, Xu YY, Wang JJ, Xue ZH, Niu YD, Cheng ZK, Ge S, Zhang W, Zhang JY, Li Q, Chong K. Natural variation of codon repeats in COLD11 endows rice with chilling resilience. Sci. Adv. 2022;9:eabq5506. doi: 10.1126/sciadv.abq5506. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Thomas NJ, Walkey HC, Kaur A, Misra S, Oliver NS, Colclough K, Weedon MN, Johnston DG, Hattersley AT, Patel KA. The relationship between islet autoantibody status and the genetic risk of type 1 diabetes in adult-onset type 1 diabetes. Diabetologia. 2022;66:310–320. doi: 10.1007/s00125-022-05823-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Sasieni PD. From genotypes to genes: Doubling the sample size. Biometrics. 1997;53:1253–1261. doi: 10.2307/2533494. [DOI] [PubMed] [Google Scholar]

[CR9] 9.Freidlin B, Zheng G, Li Z, Gastwirth JL. Trend tests for case–control studies of genetic markers: Power, sample size and robustness. Hum. Hered. 2002;53:146–152. doi: 10.1159/000064976. [DOI] [PubMed] [Google Scholar]

[CR10] 10.Zheng G, Freidlin B, Li Z, Gastwirth JL. Choice of scores in trend tests for case–control studies of candidate-gene associations. Biometric. J. 2003;45:335–348. doi: 10.1002/bimj.200390016. [DOI] [Google Scholar]

[CR11] 11.Sladek R, Rocheleau G, Rung J, Dina C, Shen L, Serre D, Boutin P, Vincent D, Belisle A, Hadjadj S, Balkau B, Heude B, Charpentier G, Hudson TJ, Montpetit A, Pshezhetsky AV, Prentki M, Posner BI, Balding DJ, Meyre D, Polychronakos C, Froguel P. A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature. 2007;445:881–885. doi: 10.1038/nature05616. [DOI] [PubMed] [Google Scholar]

[CR12] 12.Li Q, Zheng G, Li Z, Yu K. Efficient approximation of p-value of the maximum of correlated tests, with applications to genome-wide association studies. Ann. Hum. Genet. 2008;72:397–406. doi: 10.1111/j.1469-1809.2008.00437.x. [DOI] [PubMed] [Google Scholar]

[CR13] 13.Zheng G, Li Q, Yuan A. Some statistical properties of efficiency robust tests with applications to genetic association studies. Scand. J. Stat. 2014;41:762–774. doi: 10.1111/sjos.12060. [DOI] [Google Scholar]

[CR14] 14.Li Q, Yu K, Li Z, Zheng G. MAX-rank: A simple and robust genome-wide scan for case–control association studies. Hum. Genet. 2008;123:617–623. doi: 10.1007/s00439-008-0514-8. [DOI] [PubMed] [Google Scholar]

[CR15] 15.Kim J, Sohn I, Kim DDH, Jung SH. SNP selection in genome-wide association studies via penalized support vector machine with MAX test. Comput. Math. Methods Med. 2013;2013:340678. doi: 10.1155/2013/340678. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] 16.Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.Fan J, Song R. Sure independence screening in generalized linear models with NP-dimensionality. Ann. Stat. 2010;38:3567–3604. doi: 10.1214/10-AOS798. [DOI] [Google Scholar]

[CR18] 18.Fan J, Feng Y, Song R. Nonparametric independence screening in sparse ultra-high-dimensional additive models. J. Am. Stat. Assoc. 2011;106:544–557. doi: 10.1198/jasa.2011.tm09779. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR19] 19.Li HD, Wang RH. Feature screening for ultrahigh dimensional categorical data with applications. J. Bus. Econ. Stat. 2014;32:237–244. doi: 10.1080/07350015.2013.863158. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] 20.Emily M. Power comparison of Cochran-Armitage trend test against allelic and genotypic tests in large-scale case–control genetic association studies. Stat. Methods Med. Res. 2018;27:2657–2673. doi: 10.1177/0962280216683979. [DOI] [PubMed] [Google Scholar]

PERMALINK

Screening properties of trend tests in genetic association studies

Zhenzhen Jiang

Hongping Guo

Jinjuan Wang

Abstract

Introduction

Trend test

Table 1.

Independence screening procedure

Screening procedure

Table 2.

Screening properties

Condition 1

Theorem 1

Theorem 2

Theorem 3

Theorem 4

Simulation studies

Figure 1.

Figure 2.

Figure 3.

Figure 5.

Figure 4.

Application to a real dataset

Table 3.

Conclusion

Supplementary Information

Author contributions

Funding

Data availibility

Competing interests

Footnotes

Supplementary Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases