Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2023 Jun 5;13:9139. doi: 10.1038/s41598-023-35929-4

Screening properties of trend tests in genetic association studies

Zhenzhen Jiang 1,2, Hongping Guo 3, Jinjuan Wang 4,
PMCID: PMC10241885  PMID: 37277435

Abstract

In genome-wide association study, extracting disease-associated genetic variants among millions of single nucleotide polymorphisms is of great importance. When the response is a binary variable, the Cochran-Armitage trend tests and associated MAX test are among the most widely used methods for association analysis. However, the theoretical guarantees for applying these methods to variable screening have not been built. To fill this gap, we propose screening procedures based on adjusted versions of these methods and prove their sure screening properties and ranking consistency properties. Extensive simulations are conducted to compare the performances of different screening procedures and demonstrate the robustness and efficiency of MAX test-based screening procedure. A case study on a dataset of type 1 diabetes further verifies their effectiveness.

Subject terms: Genetics, Medical research

Introduction

With the development of high throughput sequencing techniques, hundreds of thousands of single nucleotide polymorphisms (SNPs) in the genome are recorded, which enables researchers to investigate and treat diseases from the perspective of genetic variants. To identify the disease-related genes or genetic markers among all these SNPs, genome-wide association study (GWAS) is a widely used strategy. Up to now, more than one hundred thousands of SNPs have been identified to be related to many traits17.

The commonly used GWAS tests the association between the phenotype and each SNP sequentially, obtains a series of test statistics or p-values, and selects the associated SNPs by comparing these statistics or p-values with a given threshold. When the phenotype is binary, Cochran-Armitage trend test (CATT)8 is always used to detect the associated SNPs. It has been shown that when the underlying genetic model is known, where the commonly used ones are recessive, additive or dominant models, CATT has an optimal form9,10. However, the true genetic models are always unknown and may be very complicated. For the sake of robustness, an omnibus test called MAX is proposed11,12, which uses the maximum of CATTs under different genetic models as a measure for association. The asymptotical distribution of MAX is given in the work of Zheng et al.13. Since its being raised, MAX has been widely used and investigated. Li et al.14 introduced a selection procedure based on the rank of MAX. Kim et al.15 proposed a SNP selection method based on MAX and a penalized support vector machine strategy.

Though CATTs and MAX have concise forms and are extensively used, theoretical properties for the applications of CATTs and MAX to GWAS have not been investigated. To control false discovery rate (FDR) in GWAS, Bonferroni correction strategy and FDR control procedures, such as Benjamini–Hochberg procedure, are two widely used strategies. But they both assume that all the SNPs are independent, which certainly is improperly since linkage disequilibrium usually exists among SNPs and may lead to omission on related SNPs. Considering these drawbacks, feature screening methods are sensible alternatives. Rather than select the associated SNPs directly, feature screening approaches aim to eliminate most of the irrelevant SNPs at first. After a screening procedure, there remains only a small amount of SNPs and researchers can concentrate on these remaining SNPs, which can save much time and work.

In the last few years, feature screening methods have been proposed for various situations. Fan and Lv16 first proposed a screening method called the sure independence screening approach for Gaussian response and predictors under linear regressions. Since then, sure screening property, which retains all the important predictors with high probability as the sample size goes into infinity, has been regarded as a feature screening criterion. Many screening procedures have been developed for diverse models, such as the generalized linear model17 and additive model18 among others. Although many procedures can be directly applied to GWAS with corresponding models and data types, only PC-SIS, proposed in the work of Huang et al.19, is applicable to the considered situation where both the outcome and predictors are categorical. However, PC-SIS does not take the information on genetic model into consideration. Just as mentioned above, CATTs and MAX test consider this information in the association analysis. But their screening properties have not been studied yet. To fill this gap, we propose feature screening methods based on CATTs in different genetic models and MAX test, and investigate their sure screening and rank consistency properties.

The rest of paper is organised as follows. In “Trend test”, we briefly describe the trend tests which can be used to evaluate the relationship between a binary variable and a genotype variable. “Independence screening procedure” introduces the independence screening procedures based on the adjusted trend test statistics, and presents sure screening and ranking consistency properties. Simulation studies are conducted in “Simulation studies” . And a case study on type 1 diabetes is demonstrated in “Application to a real dataset”. A conclusion for this work is presented in “Conclusion”. All proofs of theorems are provided in the Supplemental Materials.

Trend test

CATT evaluates the association between a binary variable and a SNP, and is widely used in case-control genetic data analysis. Compared with Pearson chi-square test, it makes use of the underlying genetic model. Its specific form is as follows. Suppose r cases and s controls are enrolled in the study. For a given SNP, the genotypes can be expressed as aa, Aa and AA, respectively, with A being a high risk candidate allele. In the sample of cases, the counts of aa, Aa and AA are r0,r1 and r2, respectively. And the corresponding counts in the control samples are s0,s1 and s2. Thus we have r=r0+r1+r2,s=s0+s1+s2. Denote n=r+s and ni=ri+si for i=0,1,2. All these counts are displayed in Table 1. Then CATT can be written as

Z=ni=02Xi(sri-rsi)rs[ni=02Xi2ni-(i=02Xini)2], 1

where (X0,X1,X2) is a pre-defined genotype score vector. Note that the optimal score vector for CATT varies across different genetic models. Specifically, for the commonly encountered recessive genetic model, additive genetic model and dominant genetic model, the optimal genotype score vectors are (0, 0, 1), (0,12,1) and (0, 1, 1), respectively. And the respective corresponding CATT can be denoted as Z0,Z12 and Z1. Under the null hypothesis of no association, these three CATTs above are asymptotically normally distributed as N(0, 1).

Table 1.

Genotype distribution in sample.

aa Aa AA Total
Cases r0 r1 r2 r
Controls s0 s1 s2 s
Total n0 n1 n2 n

However, in practice, the true genetic model is unknown. Thus none of Z0,Z12 and Z1 is robust in all situations. To tackle this issue, the statistic MAX is proposed as

Zmax=max{|Z0|,|Z12|,|Z1|}. 2

By using the maximum of absolute values of Z0,Z12 and Z1, Zmax obtains robustness under diverse situations.

Independence screening procedure

Screening procedure

CATTs and MAX test are designed for testing the relationship between a binary response and a SNP variable. We apply them to feature screening task and display their properties.

Suppose G=(G1,G2,,Gm) is a m-dimensional SNP vector and Y is a binary response which is 1 for a case sample and 0 for a control sample. Denote P(Y=1)=p and P(Y=0)=q, where p+q=1. Our aim is to identify the SNPs among all the m SNPs that are related with Y. In accordance with practice, each SNP takes value in {0,1,2}, corresponding to genotypes aa, Aa and AA, respectively.

For the kth (k=1,2,,m) predictor Gk, we set probabilities for case population as pik=P(Gk=i,Y=1),i=0,1,2 and those for control population as qik=P(Gk=i,Y=0),i=0,1,2, which are displayed in Table 2. Note that p0k+p1k+p2k=p and q0k+q1k+q2k=q for each k in {1,2,,m}. Denote fik=pik+qik,i=0,1,2,k=1,2,,m. Then f0k+f1k+f2k=1,k=1,2,,m.

Table 2.

Genotype distribution in population.

Gk=0 Gk=1 Gk=2 Tatal
Y=1 p0k p1k p2k p
Y=0 q0k q1k q2k q
Total f0k f1k f2k 1

Denote the pre-defined score vectors for the recessive, additive and dominant genetic model as (X0,0,X1,0,X2,0)=(0,0,1), (X0,12,X1,12,X2,12)=(0,12,1), and (X0,1,X1,1,X2,1)=(0,1,1), respectively. Then define four measures for the association relationship between Gk(k=1,2,,m) and Y as

ωj,k=i=02Xi,j(qpik-pqik)pq[i=02Xi,j2fik-(i=02Xi,jfik)2],j=0,12,1;k=1,2,,m, 3

and

νk=max{|ω0,k|,|ω12,k|,|ω1,k|},k=1,2,,m. 4

It is obvious that when Gk(k=1,2,,m) is independent of Y, ωj,k=0(j=0,12,1) and νk=0.

For k{1,2,,m}, let {(glk,yl),l=1,2,,n} be n pairs of observations of (Gk,Y). Denote rk=(r0k,r1k,r2k),sk=(s0k,s1k,s2k), where rik(i=0,1,2) are the counts of each genotype in case sample and sik(i=0,1,2) are the counts in control sample. Notice that r0k+r1k+r2k=r and s0k+s1k+s2k=s. Denote nik=rik+sik,i=0,1,2, then we have n0k+n1k+n2k=n.

Given the above notations, the empirical estimators of ω0,k,ω12,k,ω1,k, and νk for k{1,2,,m} are

ω^j,k=i=02Xi,j(q^p^ik-p^q^ik)p^q^[i=02Xi,j2f^ik-(i=02Xi,jf^ik)2],j=0,12,1, 5

and

ν^k=max{|ω^0,k|,|ω^12,k|,|ω^1,k|}, 6

where p^ik,q^ik,p^,q^,f^ik are the empirical estimators of pik,qik,p,q,fik, and can be estimated as

p^ik=1nl=1nI(Glk=i,Yl=1)=riknq^ik=1nl=1nI(Glk=i,Yl=0)=sikn,p^=1nl=1nI(Yl=1)=rn,q^=1nl=1nI(Yl=0)=sn,f^ik=1nl=1nI(Glk=i)=nikn. 7

Plug them into the expression, ω^j,k has the form

ω^j,k=i=02Xi,j(srik-rsik)rs[ni=02Xi,j2nik-(i=02Xi,jnik)2]. 8

Note that ω^j,k=Zj,kn, where Z0,k,Z12,k and Z1,k are CATT statistics between Gk and Y for the pre-defined score vector (X0,X1,X2) being (0, 0, 1),  (0,12,1) and (0, 1, 1), respectively. So ω^j,k is an adjusted version of Zj,k, whose value range is not effected by sample size. And νk^ maintains the ranking result of Zmax,k for each predictor. Large values of νk^ indicate the existence of association between Gk and Y. We denote ω^j,k as aCATT and νk^ as aMAX.

Assume that only a small part of SNPs are related with the response Y. We use aCATT |ω^j,k|s and aMAX νk^s to identify their positions. The screening procedures based on |ω^0,k|s, |ω^12,k|s, |ω^1,k|s and νk^s are named as REC-SIS, ADD-SIS, DOM-SIS and MAX-SIS, respectively, where REC-SIS, ADD-SIS and DOM-SIS are collectively called as CATT-SIS.

Screening properties

We call a SNP as an active SNP if it is associated with the response Y. Define different index sets of active SNPs based on different measures by

Aj={1km:|ωj,k|>0},j=0,12,1, 9
A={1km:νk>0}. 10

Their estimated truncated active index sets can be expressed as

A^j={1km:|ω^j,k|c0n-τ},j=0,12,1, 11
A^={1km:ν^kc0n-τ}. 12

where c>0 and τ>0 are two pre-specified constants and satisfy some certain conditions.

Now we investigate the theoretical properties of the screening procedures of A^j and A^s. First list some conditions.

Condition 1

(C1)

There exists constants 0<ζminζmax<1 such that for i=0,1,2 and k=1,2,,m, if pik0(qik0), then pik(ζmin,ζmax)(qik(ζmin,ζmax)).

(C2)

minkAjωj,k2c0n-τ for j=0,12,1, where constant c0>0 and 0τ<12.

(C3)

minkAνk2c0n-τ, where constant c0>0 and 0τ<12.

(C4)

For given constants c0>0,0τ<12, and log(m)=o(n1-2τn12) where ab=min{a,b}, lim infm(minkAjωj,k-maxkAjωj,k)>2c0n-τ for j=0,12,1.

(C5)

For given constants c0>0,0τ<12, and log(m)=o(n1-2τn12) where ab=min{a,b}, lim infm(minkAνk-maxkAνk)>2c0n-τ.

Then we present the sure screening properties based on aCATT and aMAX in Theorem 1 and 2, whose proofs are shown in Supplemental Materials.

Theorem 1

(Sure Screening Property of CATT-SIS):

  • (i)
    If Condition (C1) holds, then for j=0,12 and 1 we have
    P(max1km|ω^j,k-ωj,k|c0n-τ)<O(mexp{-c1n1-2τ-c2n12}), 13
    with c1>0 and c2>0 being two constants.
  • (ii)
    Furthermore, if both Conditions (C1) and (C2) are satisfied, for j=0,12 and 1 we obtain that
    P(AjA^j)1-O(κexp{-c1n1-2τ-c2n12}), 14
    where κ is the cardinality of Aj, and c1,c2>0 are the same as those in inequality (13).

Theorem 2

(Sure Screening Property for MAX-SIS):

  • (i)
    If Condition (C1) holds, then we have
    P(max1km|ν^k-νk|c0n-τ)<O(mexp{-c3n1-2τ-c4n12}), 15
    where c3>0 and c4>0 are two constants.
  • (ii)
    Furthermore, if both Conditions (C1) and (C3) are satisfied, we have that
    P(AA^)1-O(κexp{-c3n1-2τ-c4n12}), 16
    where κ is the cardinality of A, and c3,c4>0 are the same as those in inequality (15).

Theorems 1 and 2 show that the screening procedures have satisfying performances with regard to selecting significant SNPs. They also possess ranking consistency property, which are shown below.

Theorem 3

(Ranking Consistency Property for CATT-SIS): Suppose Conditions (C1) and (C4) are satisfied, then for j=0,12 and 1, it follows that

lim infn{minkAj|ω^j,k|-maxkAj|ω^j,k|}0,a.s. 17

Theorem 4

(Ranking Consistency Property for MAX-SIS) Suppose Condition (C1) and (C5) are satisfied, then it follows that

lim infn{minkAν^k-maxkAν^k}0,a.s. 18

In practice, c and τ are hard to be determined to satisfy the condition that the estimated truncated active index sets contain the corresponding active index sets. So it is common to select SNPs corresponding to the first d largest statistic values as related SNPs, where d is a pre-defined constant. That is, the respective estimated active index sets have the following forms

A^j,d={1km:|ω^j,k|is among the firstdlargest statistics},

and

A^d={1km:ν^kis among the firstdlargest statistics}.

We now explain why we determine the index sets corresponding to the first d largest statistics as active index sets. Take MAX-SIS for example. Given c and τ, the cardinality of A^ is determined, which is denoted as d0. According to Theorem 4, MAX-SIS possesses ranking consistency property. Provided Conditions (C1) and (C5) are satisfied, we have A^A^d if dd0. This indicates that all active predictors are all included in A^d. Note that P(AA^d) is nondecreasing in d. As long as dd0, we have P(AA^d)P(AA^)1-O(κexp{-c3n1-2τ-c4n12}) based on Theorem 2 (ii). Therefore, estimating the active index set based on an index set corresponding to the first d largest statistics is reasonable.

Simulation studies

In this section, we conduct simulation studies to assess the performances of REC-SIS, ADD-SIS, DOM-SIS and MAX-SIS by comparing with PC-SIS19.

For each genetic model, the dimension of SNPs is m=105. Since the sample size, the case-to-control ratio and the minor allelic frequency (MAF)20 can affect the association analysis in a case-control study, we consider different settings on them. To be specific, we choose the sample size n from {1500,3000,4500}, the case-to-control ratio w=p:q from {1,1/3,1/5} and MAF α from {0.15,0.20,0.25,0.30,0.35,0.40,0.45}. Because only the counts of genotypes are needed to calculate the statistics of interest, there is no need to generate original samples {(gl,yl),l=1,2,,n} in the simulation studies. Instead, we can just generate the count data from the trinomial distribution for each dataset. For the kth genetic variant (SNP), the count vector of three genotypes for case samples (r0k,r1k,r2k) follows the trinomial distribution Mul(np,p0k/p,p1k/p,p2k/p) and that for control samples (s0k,s1k,s2k) follows the trinomial distribution Mul(nq,q0k/q,q1k/q,q2k/q), where p0k+p1k+p2k=p,q0k+q1k+q2k=q.

In each dataset, the first six SNPs are set to be related with Y and the rest SNPs are independent of Y. For the control sample, the count vector of each SNP Gk(k{1,2,,105}) (s0k,s1k,s2k) is generated from the trinomial distribution Mul(nq,q0k/q,q1k/q,q2k/q), where q0k=q(1-α)2,q1k=2qα(1-α),q2k=qα2 with α being the MAF. For the case sample, the count vector of each irrelevant SNP Gk,(k{7,8,,105}) (r0k,r1k,r2k) is generated from Mul(np,p0k/p,p1k/p,p2k/p) with pik/p=qik/q,i=0,1,2; while the count vector for each relevant SNP Gk(k{1,2,,6}) (r0k,r1k,r2k) is generated from the trinomial distribution Mul(np,p0k/p,p1k/p,p2k/p), where (p0k,p1k,p2k) are functions of (q0k,q1k,q2k) and are diverse for different genetic models. Four different genetic models are considered, that is, recessive genetic model, additive genetic model, dominant genetic model and mixture of them, which are denoted as Model I, Model II, Model III and Model IV as follows, respectively.

Under each genetic model, 500 repetitions are conducted to compare the performances of different methods. We employ two criteria to measure the effectiveness of each screening approach. One is the proportion for each relevant SNP Gk,kA that is selected among all the 500 repetitions and is denoted as Psk. The other is the proportion that all the relevant SNPs are simultaneous selected among these 500 repetitions, which is denoted as Pa.

  • Model I.

    Data are generated from the recessive genetic model. For the relevant SNPs Gk,(k=1,2,,6), p0k=pq0kq0k+q1k+λq2k,p1k=pq1kq0k+q1k+λq2k,p2k=pλq2kq0k+q1k+λq2k, with λ=1.8.

  • Model II.

    Data are generated from the additive genetic model. For the relevant SNPs Gk,(k=1,2,,6), p0k=pq0kq0k+λq1k+(2λ-1)q2k,p1k=pλq1kq0k+λq1k+(2λ-1)q2k,p2k=p(2λ-1)q2kq0k+λq1k+(2λ-1)q2k, with λ=1.4.

  • Model III.

    Data are generated from the dominant genetic model. For the relevant SNPs Gk,(k=1,2,,6), p0k=pq0kq0k+λq1k+λq2k,p1k=pλq1kq0k+λq1k+λq2k,p2k=pλq2kq0k+λq1k+λq2k, with λ=1.6.

  • Model IV.

    Data are generated from the mixture of three genetic models. Relevant SNPs G1 and G2 are generated as those in Model I, relevant SNPs G3 and G4 are generated as those in Model II and relevant SNPs G5 and G6 are generated as those in Model III.

For each model, the proportions Psk,k=1,2,,6 and Pa are calculated with the constant d=[n/logn], where [a] denotes the integer part of a. The results are plotted in Figs. 1, 2, 3 and 5. Since in Models I, II and III, the first six relevant SNPs are generated from the same distribution, Psk,k=1,2,6 are similar in these models. Therefore, we only plot the results for Ps1 in Figs. 1, 2 and 3. In Model IV, the relevant SNPs are generated from different genetic models, so the results for Psk,k=1,2,6 are plotted in Fig. 3. Besides, the results for Pa are all plotted in Figs. 1, 2, 3 and 5.

Figure 1.

Figure 1

Selection proportions of different methods in Model I. The left subplot is for Ps1 among 500 repetitions. The right subplot is for Pa among 500 repetitions.

Figure 2.

Figure 2

Selection proportions of different methods in Model II. The left subplot is for Ps1 among 500 repetitions. The right subplot is for Pa among 500 repetitions.

Figure 3.

Figure 3

Selection proportions of different methods in Model III. The left subplot is for Ps1 among 500 repetitions. The right subplot is for Pa among 500 repetitions.

Figure 5.

Figure 5

The venn diagram for the results of all the five procedures.

Results in Fig. 1 correspond to the recessive genetic model. It can be seen that REC-SIS performs the best, MAX-SIS comes the second, and DOM-SIS is the worst. As Fig. 1 illuminates, the ability of detecting G1 for all the screening approaches increases as sample size, the case-to-control ratio and MAF increase. In addition, it shows that PC-SIS almost fails to detect the relevant SNPs when MAF is less than 0.3.

The simulation results for Model II are presented in Fig. 2. It shows that when the underlying genetic model is exactly additive genetic model, ADD-SIS performs best, MAX-SIS ranks the second. As Fig. 2 displays, the ability to detect relevant SNPs for all the screening approaches increases as the sample size and the case-to-control ratio increase. REC-SIS has low powers when MAF is small. The detection proportions of DOM-SIS first increase and then decrease slightly as MAF increases. In general, the detection proportions of MAX-SIS and ADD-SIS increase as MAF becomes larger. Whereas, the detection proportions of PC-SIS first increase slightly and then decrease dramatically as MAF increases.

The results for Model III are exhibited in Fig. 3. It shows that when the underlying genetic model is exactly dominant genetic model, DOM-SIS performs the best and REC-SIS can hardly work. As shown in Fig. 3 , the ability of detecting G1 for all the screening approaches increases as the sample size and the case-to-control ratio increase. Furthermore, when MAF is greater than 0.25, the detection proportions for all the methods except REC-SIS decline as MAF increases. From the right subplot of Fig. 3, we can see that the performances of ADD-SIS and PC-SIS are greatly influenced by MAF, while those of DOM-SIS and MAX-SIS are robust against MAF.

As for Model IV, since the effects of sample size and case-to-control ratio on the performances of different methods have been demonstrated in the above three models, we take n=3000,w=0.2 as representative to demonstrate the effects of different genetic models. The results of Psk,k=1,2,,6 when n=3000,w=0.2 are illustrated in the left subplot of Fig. 4 and the results of Pa under all scenarios are shown in the right subplot of Fig. 4 . Since the six relevant SNPs follow different genetic models, REC-SIS, ADD-SIS and DOM-SIS can not excel MAX-SIS and PC-SIS uniformly for all the relevant SNPs. Consistent with the results shown before, REC-SIS has the highest detection proportion for G1 and G2, ADD-SIS has the highest detection proportion for G3 and G4, and DOM-SIS has the highest detection proportion for G5 and G6. None of REC-SIS, ADD-SIS and DOM-SIS has the best performance uniformly. However, no matter what the underlying genetic relationship is, MAX-SIS always has excellent performance. As for Pa, MAX-SIS outperforms all the other methods significantly.

Figure 4.

Figure 4

Selection proportions of different methods in Model IV. The left subplot is for Psk,k=1,2,,6 among 500 repetitions, when sample size is 3000 and case-to-control ratio is 0.2. The right subplot is for Pa among 500 repetitions.

From the simulation results above, we can see that sample size and case-to-control ratio are two important factors that affect the association analysis. It is rational that increasement in sample size can enhance the efficiency in identifying associated SNPs. As for case-to-control ratio, when the ratio approaches 1, all the methods have better performances than conditions with larger ratios. Given the size for case sample, increasing the size for control sample has little contribution on the performances of all the methods. For example, when w=p:q=1/3,n=3000 and w=p:q=1:5,n=4500, that is when the case sample size r=750, and the control sample size s=2250 and 3750 respectively, the selection proportions of all the five screening methods have similar results no matter how MAF varies. The effect of MAF is not monotonic. In recessive model, the selection proportions of all the five methods increase as MAF increases. However, in other models, the selection proportions of some methods first increase and later decrease as MAF increases. Under all the scenarios considered, MAX-SIS is the most robust method among these five screening methods.

Overall, we can come to the conclusion that if all the candidate SNPs follow the same known genetic model, one of REC-SIS, ADD-SIS, DOM-SIS performs the best. However, the genetic model is always complicated and unknown in practice. In this case, MAX-SIS is recommended to reach robustness and efficiency.

Application to a real dataset

We apply the proposed screening procedures to a real case-control data of type 1 diabetes for British people1. The data contains 459,446 SNPs for 2938 controls and 1963 cases. Since there exist some missing values in the genotype data, the number of observed genotypes for a single SNP varies across all the SNPs. Count the number of missing values for each partially observed SNP. And it shows that the average number and the largest number of all these counts is 16.72 and 503, respectively, and the 25%,50%,75% quantile of these counts are 4, 7 and 13, respectively. To make aCATT and aMAX statistics have similar consistency rates for all the SNPs, SNPs with missing ratio large than 1% are deleted. Besides, SNPs with only two genotypes being observed are also removed from the dataset, resulting in 352,659 SNPs to be analyzed. For each SNP, the allele with lower frequency is treated as the risk allele. We use REC-SIS, ADD-SIS, DOM-SIS, MAX-SIS and PC-SIS to screen out the redundant SNPs, with the parameter d being [4901/log(4901)]=576. The results are shown in the venn diagram in Fig. 5 to display the screening results of all the five procedures. It shows that 242 SNPs are selected by all the procedures. Among these SNPs, SNPs rs9272346 and rs9272346 have been reported to be associated with type 1 diabetes1. This indicates that there may be some important association information contained in these SNPs which need to be further investigated. We list these 242 SNPs in Table 3.

Table 3.

The 242 SNPs selected by all the five screening procedures.

rs1113523 rs203888 rs12527415 rs3130532 rs2523467 rs574710 rs9268645 rs7768538
rs1217200 rs9393881 rs11758688 rs2248880 rs2395034 rs539703 rs3135393 rs7453920
rs1230666 rs203884 rs3094123 rs6906846 rs3131631 rs3132959 rs1051336 rs6902723
rs1230658 rs1233708 rs3130649 rs2524067 rs2736177 rs926591 rs9268831 rs6903130
rs1230649 rs406511 rs3095350 rs7382297 rs1046089 rs3129900 rs9268877 rs9296044
rs6679677 rs2523443 rs6924600 rs2523537 rs760293 rs4959093 rs9270986 rs2857212
rs1217396 rs1611350 rs6457282 rs2523535 rs3130048 rs910050 rs615672 rs2071474
rs2488457 rs1736913 rs9295931 rs2523534 rs805301 rs910049 rs9271208 rs241427
rs9467704 rs2517646 rs2074508 rs2596437 rs707918 rs3129932 rs3129768 rs241403
rs9379851 rs3130391 rs3218815 rs5025315 rs376510 rs3129934 rs9272219 rs3101942
rs6933583 rs3094055 rs2074512 rs5022119 rs805292 rs9268403 rs9272346 rs241400
rs4712980 rs1012411 rs4711247 rs2523638 rs707915 rs12201454 rs9272723 rs151719
rs9393708 rs3132644 rs3132581 rs3997982 rs3130484 rs2894254 rs9273363 rs3129304
rs9358932 rs3132636 rs2530710 rs2596571 rs3131379 rs3129953 rs9275134 rs3129303
rs9379855 rs3129818 rs1634717 rs2523485 rs480092 rs2076533 rs2856688 rs10947374
rs9393713 rs3129819 rs1634718 rs3099849 rs2763979 rs9268480 rs7775228 rs9296069
rs1977 rs970269 rs1632854 rs2507976 rs550513 rs3763307 rs9469220 rs13215059
rs10456045 rs3132625 rs2844670 rs9266774 rs406936 rs6930933 rs2647015 rs13215062
rs9358945 rs3094050 rs2523865 rs9266775 rs3130287 rs2001099 rs2858308 rs376877
rs7763910 rs3094703 rs3130544 rs4081552 rs1150753 rs2001097 rs9275418 rs2179920
rs7773938 rs3094045 rs9405050 rs2596517 rs204991 rs3135378 rs9275523 rs3117242
rs4634439 rs3094034 rs1265052 rs2596464 rs204990 rs3135377 rs6936863 rs3128923
rs12190473 rs3130112 rs3130975 rs3131622 rs2071278 rs3135376 rs3916765 rs3117230
rs201002 rs3130113 rs3095324 rs2844507 rs3131294 rs2395161 rs9461799 rs872956
rs200995 rs3094694 rs13200022 rs2244579 rs377763 rs2395164 rs2227127 rs2395351
rs200991 rs8233 rs3130564 rs2248459 rs3134926 rs2395167 rs9276429 rs3116985
rs149946 rs3095329 rs2106074 rs2248462 rs424232 rs2213580 rs9276431 rs3129248
rs149969 rs3094127 rs887464 rs2248617 rs3130311 rs3135366 rs9276432 rs10807124
rs149970 rs3094122 rs1265181 rs3099844 rs1265777 rs9268557 rs9276440 rs11171739
rs202906 rs10947091 rs12199773 rs2516422 rs9268230 rs9268560 rs9276490 rs1265566
rs17696736 rs9746695

Conclusion

Screening SNPs in case-control study is a commonly encountered task in modern biomedical research. And CATT and MAX statistics are the most widely used screening measures for this issue. However, the theoretical guarantees for the application of CATT and MAX to SNP screening have not been investigated. We fill this gap by adjusting CATTs and MAX test, and proposing screening procedures based on the adjusted statistics. Sure screening properties and ranking consistency properties of these screening procedures are proved. Simulation results show that when the underlying genetic model is unknown, which is often the case in practice, MAX-SIS performs the best.

Despite of the high efficiency of the proposed procedures, there exist some factors that affect their performances. First, numerical simulations show that when both MAF and sample size are small, REC-SIS, ADD-SIS, DOM-SIS, MAX-SIS and PC-SIS all perform badly. This is because that under this situation, the number of samples possessed with minor alleles is too small to provide enough information for the association analysis. Second, it is obviously that the value of the parameter d influence the performances of different methods. We determine the value of d based on works in the previous literatures. Since how to choose an optimal d is not the focus of this work, we will conduct more detailed analysis further. Third, when there exist covariates to be adjusted for, new procedures need to be developed, which will be studied in a future work.

Supplementary Information

Author contributions

Conceptualization, Z.J. and J.W.; methodology, Z.J., H.G. and J.W.; validation, Z.J., H.G. and J.W.; formal analysis, Z.J.; writing original draft preparation, Z.J. and J.W.; and writing review and editing, Z.J., H.G. and J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been supported by China Postdoctoral Science Foundation funded project (Grant No. 2021M700433), National Natural Science Foundation of China (NSFC) (Grant No. 12101047), and Natural Science Foundation of Hubei Province (Grant No. 2022CFB942).

Data availibility

All data included in this study are available upon request by contacting with the corresponding author. To facilitate the usage for the proposed methods, the codes are available upon request by contacting with the corresponding author.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-023-35929-4.

References

  • 1.Wellcome Trust Case Control Consortium (WTCCC). Genome-wide association study of 14,000 cases of seven common diseases and 3000 shared controls. Nature447, 661–678 (2007). [DOI] [PMC free article] [PubMed]
  • 2.Easton DF, Pooley KA, Dunning AM, Pharoah PD, Thompson D, Ballinger DG, Struewing JP, Morrison J, Field H, Luben R, et al. Genome-wide association study identifies novel breast cancer susceptibility loci. Nature. 2007;447:1087–1093. doi: 10.1038/nature05887. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Zeggini E, Weedon MN, Lindgren CM, Frayling TM, Elliott KS, Lango H, Timpson NJ, Perry JR, Rayner NW, Freathy RM, et al. Replication of genome-wide association signals in UK samples reveals risk loci for type 2 diabetes. Science. 2007;316:1336–C1341. doi: 10.1126/science.1142364. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Yue WH, Wang HF, Sun LD, Tang FL, Liu ZH, Zhang HX, Li WQ, Zhang YL, Zhang Y, Ma CC, et al. Genome-wide association study identifies a susceptibility locus for schizophrenia in Han Chinese at 11p11.2. Nat. Genet. 2011;43:1228–1232. doi: 10.1038/ng.979. [DOI] [PubMed] [Google Scholar]
  • 5.Li LC, et al. Transcriptome-wide association study of coronary artery disease identifies novel susceptibility genes. Basic Res. Cardiol. 2022;117:6. doi: 10.1007/s00395-022-00917-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Li ZT, Wang B, Luo W, Xu YY, Wang JJ, Xue ZH, Niu YD, Cheng ZK, Ge S, Zhang W, Zhang JY, Li Q, Chong K. Natural variation of codon repeats in COLD11 endows rice with chilling resilience. Sci. Adv. 2022;9:eabq5506. doi: 10.1126/sciadv.abq5506. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Thomas NJ, Walkey HC, Kaur A, Misra S, Oliver NS, Colclough K, Weedon MN, Johnston DG, Hattersley AT, Patel KA. The relationship between islet autoantibody status and the genetic risk of type 1 diabetes in adult-onset type 1 diabetes. Diabetologia. 2022;66:310–320. doi: 10.1007/s00125-022-05823-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Sasieni PD. From genotypes to genes: Doubling the sample size. Biometrics. 1997;53:1253–1261. doi: 10.2307/2533494. [DOI] [PubMed] [Google Scholar]
  • 9.Freidlin B, Zheng G, Li Z, Gastwirth JL. Trend tests for case–control studies of genetic markers: Power, sample size and robustness. Hum. Hered. 2002;53:146–152. doi: 10.1159/000064976. [DOI] [PubMed] [Google Scholar]
  • 10.Zheng G, Freidlin B, Li Z, Gastwirth JL. Choice of scores in trend tests for case–control studies of candidate-gene associations. Biometric. J. 2003;45:335–348. doi: 10.1002/bimj.200390016. [DOI] [Google Scholar]
  • 11.Sladek R, Rocheleau G, Rung J, Dina C, Shen L, Serre D, Boutin P, Vincent D, Belisle A, Hadjadj S, Balkau B, Heude B, Charpentier G, Hudson TJ, Montpetit A, Pshezhetsky AV, Prentki M, Posner BI, Balding DJ, Meyre D, Polychronakos C, Froguel P. A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature. 2007;445:881–885. doi: 10.1038/nature05616. [DOI] [PubMed] [Google Scholar]
  • 12.Li Q, Zheng G, Li Z, Yu K. Efficient approximation of p-value of the maximum of correlated tests, with applications to genome-wide association studies. Ann. Hum. Genet. 2008;72:397–406. doi: 10.1111/j.1469-1809.2008.00437.x. [DOI] [PubMed] [Google Scholar]
  • 13.Zheng G, Li Q, Yuan A. Some statistical properties of efficiency robust tests with applications to genetic association studies. Scand. J. Stat. 2014;41:762–774. doi: 10.1111/sjos.12060. [DOI] [Google Scholar]
  • 14.Li Q, Yu K, Li Z, Zheng G. MAX-rank: A simple and robust genome-wide scan for case–control association studies. Hum. Genet. 2008;123:617–623. doi: 10.1007/s00439-008-0514-8. [DOI] [PubMed] [Google Scholar]
  • 15.Kim J, Sohn I, Kim DDH, Jung SH. SNP selection in genome-wide association studies via penalized support vector machine with MAX test. Comput. Math. Methods Med. 2013;2013:340678. doi: 10.1155/2013/340678. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Fan J, Song R. Sure independence screening in generalized linear models with NP-dimensionality. Ann. Stat. 2010;38:3567–3604. doi: 10.1214/10-AOS798. [DOI] [Google Scholar]
  • 18.Fan J, Feng Y, Song R. Nonparametric independence screening in sparse ultra-high-dimensional additive models. J. Am. Stat. Assoc. 2011;106:544–557. doi: 10.1198/jasa.2011.tm09779. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Li HD, Wang RH. Feature screening for ultrahigh dimensional categorical data with applications. J. Bus. Econ. Stat. 2014;32:237–244. doi: 10.1080/07350015.2013.863158. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Emily M. Power comparison of Cochran-Armitage trend test against allelic and genotypic tests in large-scale case–control genetic association studies. Stat. Methods Med. Res. 2018;27:2657–2673. doi: 10.1177/0962280216683979. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

All data included in this study are available upon request by contacting with the corresponding author. To facilitate the usage for the proposed methods, the codes are available upon request by contacting with the corresponding author.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES