Abstract
Short insertions, deletions (INDELs) and larger structural variants have been increasingly employed in genetic association studies, but few improvements over SNP-based association have been reported. In order to understand why this might be the case, we analysed two publicly available datasets and observed that 63% of INDELs called in A. thaliana and 64% in D. melanogaster populations are misrepresented as multiple alleles with different functional annotations, i.e. where the same underlying variant is represented by inconsistent alignments leading to different variant calls. To address this issue, we have developed the software Irisas to reclassify and re-annotate these variants, which we then used for single-locus tests of association. We also integrated them to predict the functional impact of SNPs, INDELs, and structural variants for burden testing. Using both approaches, we re-analysed the genetic architecture of complex traits in A. thaliana and D. melanogaster. Heritability analysis using SNPs alone explained on average 27% and 19% of phenotypic variance for A. thaliana and D. melanogaster respectively. Our method explained an additional 11% and 3%, respectively. We also identified novel trait loci that previous SNP-based association studies failed to map, and which contain established candidate genes. Our study shows the value of the association test with INDELs and integrating multiple types of variants in association studies in plants and animals.
Author summary
In this study, we develop a method for testing multiple types of variants in genome-wide association studies. We show that a multi-allelic artefact caused by inconsistent alignments was a key obstacle for testing association of insertion and deletion polymorphisms (INDELs) and for integrated association methods, such as burden testing. To address this problem, we developed the software Irisas that synchronizes variants and integrates the impact of SNPs, INDELs and structural variants for burden testing. We re-analysed two publicly available datasets with multiple traits in A. thaliana and D. melanogaster. We identified novel loci that contain well-established candidate genes that SNP-based GWAS failed to detect; INDEL-specific QTLs generally have weak local linkage disequilibrium (LD) with nearby SNPs. We showed by simulation that multiple independent loss-of-function common-allele SNPs/INDELs were challenging for direct association tests but demonstrated that they could be recovered by integrated burden testing.
Introduction
Identifying the causal loci underlying phenotypic variance is a fundamental biological challenge. Genome-wide association studies (GWAS) is a widely used and effective methodology, which tests a genome-wide set of genetic variants in different individuals for association with trait variation. Typically, single-nucleotide polymorphisms (SNP) are genotyped, either by array, re-sequencing or imputation, and then used as markers to identify loci associated with the trait. However, the total variance explained by summing mapped quantitative trait loci (QTLs) is usually much less than the overall heritability estimated from genome-wide genetic relationships [1–3]. Many explanations have been proposed, such as attributing it to variants of small effect or of low allele frequency [4].
SNP-based GWAS rely on linkage disequilibrium (LD) to tag nearby causal variants, which might include other SNPs, insertions or deletions (INDELs) or structural variants (SVs). Though INDELs and SVs have been increasingly employed for GWAS [5–7], it is unclear whether explicitly testing those variants imperfectly tagged by SNPs will increase power. For example, in A. thaliana, a 16bp insertion and 345bp complex deletion (where 376bp have been replaced with 31bp) in the FRIGIDA (FRI) gene both have been linked to flowering time [8], an important adaptive trait in plants. Interestingly, when these two variants were genotyped using dideoxy sequencing and added to a panel of array-genotyped SNPs for genome-wide significance test, neither of them reached genome-wide significance in a traditional GWAS [9]. It is thus important to understand why such established causal INDELs and SVs were missed by GWAS.
To date, there has been a lack of methodological investigations of INDELs and SVs in GWAS. These types of variants have usually been treated and encoded in the same way as SNPs, although it is well known that it is more difficult calling INDELs and SVs accurately from short-read sequence data [10]. It is unclear whether hidden factors or different methods could improve power and thus recover novel loci. However, it has been observed that even for very short INDELs, where current sequence technology can provide fairly good accuracy in variant calling, encoding them in a consistent way for GWAS is not trivial [11]. A related issue is that of complex substitutions, i.e. where a segment of DNA is replaced by another segment; these substitutions can be represented either as integrated complex events or by combinations of many smaller atomic SNPs and INDELs. In some cases their phenotypic effects will depend on their integrated context rather than on their constituent parts. This line of argument leads to the idea of an integrated burden test, which may be particularly relevant within protein coding sequence. This is supported by our earlier observations in Arabidopsis thaliana, that changes around and within coding sequences should be considered holistically in order to avoid erroneously predicting deleterious changes in which the effect of one variant was compensated by another change nearby [12].
In this paper, we show that many INDELs are misclassified as multi-allelic with potentially different functional annotations, which undermines the association. We have developed the software Irisas (Integrated region-based variant synchronization and annotation for association studies) to reclassify and re-annotate these variants. In addition, we propose a robust measure that integrates the predicted functional impact of SNPs, INDELs, and SVs for burden test. The single-locus association test using our synchronized INDELs and integrated burden test explained a large proportion of phenotypic variance additionally. We thereby map novel loci that SNP-based GWAS have failed to associate and which contain established candidate genes. Collectively, our work demonstrates a reliable framework to leverage INDELs for GWAS, and establishes the value of integrated analysis of multiple types of variants in association studies in plants and animals.
Results
Variant synchronization and integrated burden test for GWAS
We re-analysed 106 A. thaliana phenotypes [9] and 153 D. melanogaster phenotypes measured in DGRP inbred lines [13, 14]. The A. thaliana dataset contained 177 inbred accessions from the 1001 genomes project [15] (S1 and S2 Figs and S1 Table). The DGRP population consisted 212 strains for which Illumina short-read sequences were available. We genotyped each sample using Illumina short-reads with IMR/DENOM [12], an algorithm we developed previously which combines iterative short-read mapping and de novo assembly for reliable variant calling and reassembly. By comparing with a long read (Pacific Biosciences) based de novo assembly of the A. thaliana accession Ler-0 [16], we estimated that 3.1% of variants were incorrectly called and a further 2.3% of variants were mistakenly called as reference for accessible regions [15, 17] (S1 Text). We then compared IMR/DENOM’s calls around the well-characterised FRI complex of variants to the dideoxy sequences from 18 A. thaliana accessions. We confirmed that both a 16bp insertion and a 345bp complex deletion were correctly identified, including 31bp novel sequence within the complex deletion [12]. The high accuracy of our INDEL calls, allowed us to determine these complex variants accurately across large populations, and thus directly test them for association. The local densities of SNPs and INDELs correlated to each other (p-value < 2.2e-16 with Kendall rank correlation test; Fig 2A inner cycles and Fig 2B). And the INDEL/SNP ratio in CDS regions was significantly lower than that on the whole genome level (p-value < 2.2e-16, paired Wilcoxon signed rank test, Fig 2C).
When aligning a divergent sequence to a reference genome, alignment isomorphs frequently occur, where the essentially same sequence is aligned in different ways (Fig 1A). In the context of variant calling for the A. thaliana or D. melanogaster genomes under study, this ambiguity results in false multi-allelic calls for the same allele, even if the surrounding sequence is same between strains. In supporting of this, we observed that ~63.18% and ~64.49% of INDELs from the population (including insertions and deletions >1kbp, which are usually categorized as SVs) had more than one alignment isomorph among the 177 A. thaliana ecotypes and the 212 inbred lines of D. melanogaster respectively. We characterised common scenarios where false multi-allelic calls occur due to alignment isomorphism (Fig 1A and S3 Fig). These ambiguities would be expected to reduce the power of association testing since unnecessary degrees of freedom are used to estimate the effects of apparently distinct but—in reality—identical alleles.
To mitigate this problem, we developed software Irisas to synchronize INDEL and SNP variant calls. Irisas aligned haplotypes of samples using multiple sequence alignment (MSA) with overlapped windows, and re-called variants in a consistent and synchronized manner. We found that on average 39.43% of INDELs in A. thaliana and 36.30% in D. melanogaster per line were reassigned to a shared allele, compared to the original variants. In contrast, variant calling ambiguities affected only 2.40% SNPs in A. thaliana, and 3.53% in D. melanogaster per sample, the majority of those having at least one INDEL within 10bp (Fig 1B).
This analysis prompted us to create a robust burden test, in order to evaluate the joint effects of INDELs and SNPs in coding regions [18]. Traditional burden testing is based on combining the annotated effects of SNPs. Each SNP is classified (e.g. nonsynonymous, nonsense and benign) based on functional annotation of the reference genome. Our previous work [12] had shown that this approach overestimates the numbers of deleterious coding variants because gene models vary, particularly around splice sites. Irisas integrated all the variants (i.e. SNPs, INDELs and SVs) so that the functional impact of each variant was evaluated collectively and conservatively. It focused on variants or groups of variants which change open reading frames (by INDELs/SVs), destroy splice sites or cause premature stop codons (by INDELs/SVs or SNPs). We call these events “open reading frame state” (ORFS) changes hereafter. The existence of an ORFS change in a gene implies a change in protein sequence and possibly of its function. The results of our ORFS algorithm on 18 A. thaliana accessions were 98.9% concordant when compared to the previous annotation (S2 Table) which integrated RNA-seq and ab initio gene prediction [12].
INDELs and ORFSs explain a large proportion of phenotypic variance not explained by SNPs
We performed GWAS for A. thaliana and D. melanogaster [13] using linear mixed models. To adjust for the population structure, two kinship matrices were constructed, one using SNPs only and the other using all variants. Since only trivial differences for the association tests were observed using either matrix, we present only the SNP-only kinship matrix-based results (S8 Fig). As phenotypes were measured on different subsets of lines and the numbers of variants used for association tests changed accordingly, we calculated genome-wide significance thresholds for each phenotype separately using Bonferroni correction and by permutation tests, with 1000 permutations per phenotype (S1 Text). Thus, four thresholds were calculated for each phenotype: 1) a genome-wide significance threshold derived from Bonferroni correction and a genome-wide significance threshold from permutation tests using SNPs only, which serve as the benchmark for comparison, 2) an integrated Bonferroni significance threshold and an integrated permutation threshold using all Irisas variant calls. We found the integrated Bonferroni thresholds were only slightly higher than the SNP-only Bonferroni thresholds in all phenotypes, as the total numbers of tests increased by only 11–13% in both species. Integrated permutation thresholds were also slightly higher than SNP-only permutation test thresholds. In the subsequent analyses, the integrated permutation thresholds were used unless expressly stated otherwise. The association results for A. thaliana and D. melanogaster are summarized in Table 1. Manhattan and quantile-quantile plots for those 34 phenotypes in A. thaliana and ith at least one QTL passing the permutation threshold are presented in S15–S69 Figs. For simplicity, the genome-wide significant loci detected with SNPs, INDELs and ORFSs are referred to as snpQTL, indelQTL and orfsQTL respectively hereafter.
Table 1. Summary of GWAS results with three types of genotypic variants.
phenotypes with at least one significant variant detected | phenotypes with significant SNP detected | phenotypes with significant INDEL detected | phenotypes with significant ORFS detected | Mean of variance explained by SNP | Mean of extra variance explained by INDEL | Mean of extra variance explained by ORFS | |
---|---|---|---|---|---|---|---|
A. thaliana | 34 | 24 | 17 | 12 | ~26.09% | ~4.53% | ~6.40% |
D. melanogaster | 21 | 18 | 5 | 0 | ~18.52% | ~2.60% | ~0% |
Fig 2A plots snpQTL, indelQTL and orfsQTL of all traits (Fig 2A outer 3 layers for snpQTL, indelQTL and orfsQTL of all traits respectively). Approximately 25.0% of snpQTLs are also genome-wide significant with INDELs or ORFSs (Fig 2D and 2E and S5A and S5B Fig). These included the association loci from 5 phenotypes (AvrPphB, AvrRpm1, avrB, FRI, LES) highlighted in the original publication [9] in A. thaliana and one phenotype in D. melanogaster (5-pentacosene concentration for male). All these loci showed very strong LD (S14–S19 and S49 Figs). No genome-wide significant association for long INDELs (SVs) was found, concordant with the low power of SVs for GWAS analysis observed in Mouse and Human studies [19, 20].
We suspected that loci identified only by SNPs or only by INDELs were due to incomplete LD. To this end, we analysed the LD patterns of SNPs and INDELs with MAF >10% and missing rate <50%. Our analysis was solely based on A. thaliana because the very rapid LD decay in D. melanogaster made it difficult to discern any effect. LD among SNPs in A. thaliana was on average stronger than LD among INDELs but the LD half-decay distances, where LD falls to half of its maximum value, were roughly the same (~2-3kbp). LD was stronger at loci where a cluster of significant INDELs or SNPs was detected. LD around isolated indelQTLs was usually weak (Fig 2F). This indicated that weak LD regions, such as the boundaries of haplotype blocks or at recombination hotspots, could be the cause of failure for genome-wide significant association for a particular phenotype in SNP-based GWAS.
We assessed the additional contribution of INDELs and ORFSs to the phenotypic variance. Since genotypic variants used for the association are not totally independent of each other, we estimated the effect size (i.e. variance explained), using a mixed model [21] with a SNP-based kinship matrix. We estimated the effect size of snpQTLs firstly and then added indelQTLs and orfsQTLs as additional independent variables (Fig 2G and Table 1). Compared to SNP-based association studies which explained on average 26.09% and 18.52% of variance for multiple traits in A. thaliana and D. melanogaster, INDELs and ORFSs explained an additional 10.93% and 2.60% phenotypic variance, respectively. Here we did not evaluate the heritability using a restricted maximum likelihood (REML) model suggested by GCTA [22]. The model showed inflated log-likelihood values for many phenotypes for SNP-based analysis, probably due to the small sample size we used [23, 24] or “synthetic” effects, which might be responsible for”ghost” associations in A. thaliana [25] and rice [26].
Analysis of INDEL-specific and ORFS-specific genome-wide significant loci
We next inspected certain INDEL-specific QTLs in more detail. For phenotype “days before the bolt reach 5cm” (S24 Fig), the highest scoring of which contains an isolated variant, a 1bp insertion on chromosome 5 (Fig 3A, B). The insertion, with allele frequency 0.125 (9 from 72 genotyped samples), shows very low average LD with nearby variants (Fig 3D). The most strongly linked SNP within 20kbp of the insertion is located 206bp upstream with R2 = 0.64 (p-value = 0.433 for the association test). The insertion is 249bp upstream of the start codon of TERMINAL FLOWER 1 (TFL1), a region implicated in the regulation of the expression of TFL1 [27–29] (Fig 3E and 3F, S6A Fig and S4 Table). Those accessions containing the insertion originate mainly from a small region in Sweden (S7 Fig).
In D. melanogaster, a 3 base-pair deletion was significantly associated with the concentration of 11- & 9-methyltricosane (11- & 9-Me-C23) in females (S51 Fig). These are pheromones in the epicuticular wax layer of abdominal tergites [30]. Only the role of 7-methyltricosane has been determined while those of 11- & 9-methyltricosane are unclear [31]. The associated deletion we identified is in the first intron or promoter region of G-oα47A that encodes G protein α o subunit (Gαo), which may affect pheromone signalling [32].
We next hypothesized that either collective sets of mutations, or loss of function via independent mutations on the same gene [33–36], might underlie orfsQTLs. We developed an integrated burden analysis to test this hypothesis. In A. thaliana, 2481 genes containing at least two distinct loss-of-function variants were identified using a graph theoretic algorithm (Materials and Methods), a large potential gene set for integrated burden testing. Among them is FRI with a 16bp insertion and a 345bp complex deletion both affecting the protein’s functionality [8, 37]. FRI regulates FLOWERING LOCUS C (FLC) and thereby flowering time in multiple plant species [8, 38]. Interestingly, FRI did not reach genome-wide significance in SNP-based GWAS for flowering time even when these two INDELs were genotyped and tested specifically [9, 39]. Since both variants were common (allele frequency 14.12% and 16.38% respectively within the population), the loss of power here was not due to many rare variants with small effects. The distribution of loss-of-function FRI alleles in the population showed no obvious pattern (S7 Fig). Further simulations confirmed that independent loss-of-function alleles pose a big challenge to GWAS (S1 Text) especially for small population samples, but have a higher power of detection by burden testing. In our study, FRI achieved genome-wide significance for many flowering time phenotypes (S20–S29 Figs).
INDELs and ORFSs contribute to expression variation
Next, we investigated whether INDELs and ORFSs associate with gene expression variation [40, 41]. To this end, 628 A. thaliana natural accessions with RNA-seq from the 1001 Epigenomes Project [42] were chosen. The genomic variants were called using genomic sequence data from 1001 genome project and assembled with IMR/DENOM, and then synchronized and ORFS called using Irisas. Overall, ~3.87 Million SNPs, 1.96 Million INDELs and ORFSs of 14 thousand transcripts passed our quality checks for further analysis. We tested the association for the expression of 19,844 genes using linear mixed models. Among them, 10,508 genes have at least one eQTL reached integrated permutation test threshold (FDR< = 0.05).
Inspection of these eQTLs revealed that 16.28% of expression variance was explained by eQTLs from SNPs, and 13.90% from INDELs and 2.02% from ORFSs. A large portion of expression variance explanation was shared. INDELs and ORFSs explained 0.81% extra variance on average (Fig 4A, S9 Table, S11C–S11H Fig). This indicated the pronounced effect from the strong LD between SNPs and INDELs. SNPs, INDELs and ORFSs were all predominately associated as cis-eQTL (i.e. within 30kbp of the gene) (Fig 4B and 4C). Additionally, the majority of eQTLs were identified outside of coding regions (S11I Fig). The ratio of the number of eQTL inside exons to those outside is significantly higher than genome-wide levels for both SNPs and INDELs (Fisher's exact test, p-value<0.001) (Fig 4D), indicating that sequence polymorphisms inside a gene were also playing a role in gene expression regulation, possibly through nonsense-mediated decay [43] or sRNA pathways [44].
Discussion
We devised the software Irisas to perform GWAS based on INDELs and ORFSs for a large number of phenotypes in both plant and animal models. We focused on sequenced inbred populations in both species in which variant-calling is more robust. Our results show that INDELs and burden testing using ORFSs are both capable of revealing associations with causal variants that would not have been detected by SNPs alone. There are two main reasons for this successful recovery of missing heritability. The first is that we removed multi-allelic artefacts caused by inconsistent alignment isoforms, as shown in the comparison of INDEL-association with or without proposed variant synchronization procedure (S1 Text, S13 Fig). The second stems from our novel ORFS calling procedure, which ameliorated the lack of power for independent test of multiple common alleles which cause loss-of-function of the same transcript and thus have same functional effect (S1 Text).
Testing for INDELs as well as SNPs increases the number of tests by only 11–13%. Since many INDELs are in LD with SNPs, the effective number of tests will increase modestly, resulting in only slightly higher thresholds for significance, and a correspondingly larger sample size is required to maintain power. This is generally accepted and the trend across association studies in all species is to increase sample sizes and to test more exotic variants including INDELs. An alternative, where additional samples are unavailable, would be to perform an integrated analysis of certain candidate regions that are marginally significant with SNPs at a certain threshold. This strategy could uncover the loci where SNPs are only partially linked with causal INDELs, with the possibility of losing the loci containing isolated causal INDELs.
Burden analysis is often regarded as a useful supplement to independent SNP association. While SNP burden tests have been under intensive investigation, the effect of INDELs has been largely overlooked. In our GWAS, ORFSs explained a significant additional fraction of phenotypic diversity either by identifying novel loci or by increasing the power of detected loci. Notably, our tests also revealed that the loss-of-function alleles in the reference genome could seriously affect the ORFS-based association analysis. For example, in FRI, where the reference genome Col-0 contains a non-functional version of the gene (S10 Table), the locus cannot achieve genome-wide significance unless the functional version of the gene is used to determine ORFSs. This indicates that expert knowledge of gene functionality is important for burden testing. We anticipate that more advanced burden analyses will further help understand the contribution of INDELs to phenotypic diversity.
Materials and methods
Datasets used
(a) We chose a set of Arabidopsis thaliana accessions whose phenotypes had been investigated previously [9] and whose genomes have been sequenced through the 1001 genomes project [15]. In depth, we checked the overlap of those two datasets by comparing the ecotype ID. For those lines whose phenotypes have been measured but the genome sequences are not available in the 1001 genomes project, we used the genome sequence from the plants with the same accession name if the SNPs identified from SNP-array are consistent with the Illumina sequence data. The final list of samples contains 177 accessions (S1 Table), from worldwide regions (S1 Fig). The public “TAIR10” genome [45] were used as reference. (b) For expression quantitative trait loci (eQTL) experiment, the leave transcriptomes of 728 accessions were obtained from the public database [42].
The Drosophila Genetic Reference Panel (DGRP) consist 212 inbred D. melanogaster (fruit fly) lines with whole genome sequenced using Illumina shotgun sequencing. A wide range of phenotypes have been measured and released for subsets of those lines. The FB6.11 genome sequence was used as reference.
Genomic variant calling with IMR-DENOM
The variant calling was performed with IMR-DENOM v 0.5, which integrates iterative reads mapping and de novo variant calling [12]. Different k-mers were chosen for the de novo assembly part by setting parameter “-k” based on the read length and coverage. For A. thaliana, the default parameter was used when read length smaller than 75; “-k 65” when read length larger than 100 and coverage larger than 30X; “-k 45” otherwise. For Drosophila melanogaster, “-k 45” was used.
Integrated region-based variant synchronization
As demonstrated in Fig 1 and S3 Fig, the same underlying clustered variants could be presented as inconsistent alignments. To synchronize variants, we used multiple sequence alignment (MSA). However, the computational complexity of multiple sequence alignment increases exponentially with the sequence length and number of samples. It is difficult, if not impossible, to directly apply MSA to a whole chromosome in a large population study.
To alleviate the computational load of MSA, we here proposed a sliding window based scheme. In details, our algorithm consists of three steps: 1) the reference genome was split into windows of 50,000 bases overlapping by 1,000 bases (parameters can be tuned in Irisas). Within each window, each haplotype sequence was assembled with its variants; 2) perform multiple sequence alignment for all haplotypes and the reference genome, and call variants for each haplotype based on the MSA. Irisas used MAFFT v7.213 [46] (parameter—auto) by default for the MSA, though other MSA software is also supported; 3) resolve the conflict within the overlapped regions and integrate the variants. Below we explain step 3 in details.
We called the variants from MSA base-pair by base-pair, for example, a 10 base-pair deletion was treated as 10 atomic one base-pair deletions. This allows us to resolve the conflict within the overlapped regions easily based on the coordinate. When two atomic variants within overlapped regions in the two consecutive windows were in conflicts, those closer to the center of their window were chosen. The atomic variants were then linked together. For example, two 1bp deletions at adjacent positions were merged as a single 2bp deletion. Overlapping INDELs were treated independently as two variants.
Genes’ ORF state annotation for association studies
As described previously in our publication [12], alternative gene models often restore protein-coding in a natural accession. In addition, insertions and deletions can be called in different forms at different genomic positions, as shown in Fig 1A. Evaluating a gene’s ORFS based on each mutation independently would cause serious problems. The de novo annotation method suggested in the previous study [12] was computationally intensive and required the additional transcriptome sequence data, which could be unavailable in certain cases. As re-annotation indicated that the conservation of the protein coding had a dominate effect [12], we thus implemented a light-weight algorithm in Irisas to detect the change of ORFS in a gene. The algorithm includes three steps: (a) obtaining the assembled haplotype of the gene; (b) aligning both the CDS and the protein sequences of the reference to the haplotype sequence and annotating the gene structure accordingly; (c) merging the two gene models from step b and the direct lift over result, and the most conserved version was chosen for the ORFS. The details are as follows:
-
CDS based orthologous annotation
The genome sequences of each gene from each accession were extracted from the assembled sequence including 1kbp upstream and 1kbp downstream regions. The CDS sequences of reference annotation were mapped to the genome sequence using exonerate (V2.2.0) with parameters “—maxintron 30000—model est2genome -i -10—score 10—bestn 1—minintron 10” (intron length not less than 10 and not more than 30000, penalty -10 for introduction of an intron, and only report the best result with minimum score 10).
-
Protein based orthologous annotation
The protein sequences of TAIR10 were mapped to haplotype sequences using exonerate with parameters: “—bestn 1—maxintron 30000—intronpenalty -10—model protein2genome—percent 10—score 10—minintron 10”.
-
Final annotations for the ORFS
The gene annotations generated with the above two methods and the annotation from lift over were checked with the following criterions:- Are the start codon and the stop codon intact?
- Has any of the splicing sites been interrupted? A splicing site is regarded as intact if its sequences are same with reference sequences, or follow the GT-AG rule [47], or belong to one of combinations of GC-AG, GG-AG, GT-TG, GT-CG, CT-AG.
- Is there any premature stop codon?
- Is the full length of CDS (protein coding sequences) sequence of the gene is dividable by 3?
A gene’s ORFS was regarded as interrupted if each of its three different annotations had satisfied at least one of criterions, otherwise it will be regarded as ORFS-conserved.
For any gene whose haplotype in the reference genome is not functional, a working version from other natural accession was chosen. For example, FRI in the reference genome Col-0 is not functional, we chose accession Eden-1 (ecotype id 6009) as reference.
Genomic variants encoding and filtering
SNPs that were biallelic within the population were used for association. A SNP was treated as missing if: (1) The number of reads supporting it is less than 2; (2) heterozygous; 3) physically overlapped by an INDEL.
INDELs from different accessions could occur at the same position or overlap with each other (S12C Fig). Here only INDELs/SVs with the same position and same length were encoded as the same variant. An INDEL would be genotyped as missing in an accession if a different INDEL were found to occupy a part of its positions in that accession. There were totally 3820962 INDELs in A. thaliana and 2290193 INDELs in D. melanogaster encoded.
Test for the existence of homogeneous lines
Homogeneous lines could affect the power of association analysis. To filter out homogeneous lines, the identity by state (IBS) matrix was constructed for the population using PLINK [48] (v 1.9) with SNPs. In A. thaliana, SNPs passing the following procedures would be used to construct IBS matrix:
SNPs located within centromere regions were removed as the recombination is rare and the variant calling is less reliable in these regions due to a large proportion of duplications;
SNPs with minor allele frequency (MAF) less than 5% or missing rate higher than 10% were dropped;
The nearby highly linked SNPs were pruned with function:—indep-pairwise 2000 1000 0.9 (window size 2000bp, windows overlapping size 1000bp pairwise SNPs with r2>0.9 with only one being kept).
For D. melanogaster, we used the following procedures:
SNPs within 2L:0.4Mb-14.9Mb, 2R:9Mb-18Mb and 3R:6Mb-27Mb were excluded since major inversions
SNPs with MAF less than 5% or missing rate higher than 20% were dropped
The nearby highly linked SNPs were pruned with function:—indep-pairwise 2000 1000 0.9
In A. thaliana, for the group with pair-wised IBS > 0.9, we ranked each sample firstly with the ecotype ID identification and then sequencing quality. That is, in a homogeneous group, if two accession shares the same accession name but with different ecotype ID, it will be removed preferentially. We ranked D. melanogaster with only sequencing quality. The accession with the highest rank in a homogeneous group would be kept for the following GWAS analysis.
We used the number of trustable ORFSs as an indicator of the sequencing quality of each sample. Accessions sharing a high IBS index are expected to share very a similar number of trustable ORFSs. An ORFS was treated as trustable if it followed the following criteria:
Coverage of every base pair of the CDS region equal to or large than 1
All the INDEL physically falling into the CDS region could be confirmed by both de novo assembly and reference guided assembly
More reliable ORFSs were used as an indication of higher sequencing quality.
There are 207 D. melanogaster lines left for GWAS analysis. Since different sets of A. thaliana were used in different phenotyping experiments, filtering was performed for each phenotype separately with the same IBS matrix. Each variant was filtered with the phenotype specific accessions list from the original full variants dataset.
GWAS analysis
For A. thaliana, the phenotypic values were transformed according to the original report [9]. The genotypic variants were filtered with MAF (> = 0.1, as Atwell et al.[9]), missing rate (< = 0.5), minor allele number (> = 5). Those inside centromere regions were also exempted from the subsequent association tests. We used EMMAX pipeline for association tests. To correct population structure, the kinship matrixes were generated with the EMMAX-BN (Balding-Nichols) method. A modified version of EMMAX was used when analysing INDELs. We excluded an accession if it were genotyped as missing.
In D. melanogaster, the genotypic variants were filtered with MAF > = 0.05, missing rate < = 0.2 and minor allele number > = 8 before being used for association analysis. To estimate kinship matrix, the SNP dataset was further filtered with LD pattern (–indep-pairwise 1000 500 0.2 in PLINK) and SNPs within 2L:0.4Mb-14.9Mb, 2R:9Mb-18Mb and 3R:6Mb-27Mb were excluded since major inversions. The phenotypes were adjusted for the effects of Wolbachia infection and major inversions (In(2L)t, In(2R)NS, In(3R)P, In(3R)K, and In(3R)Mo) according to the DGRP2 paper [13]. The adjusted values were then transformed with the WarpedLMM [49] package. We trained WarpedLMM (with default settings) using the same SNP datasets used for kinship matrix construction. The transformed phenotype values were fitted into association algorithm FAST-LMM (v 2.0) [50]. The Manhattan and QQ plots were visualized with homemade R [51] scripts.
Proportion of phenotypic variance explained by significant genotypic variants
To assess how much of the phenotypic variance can be explained by detected genome-wide significant loci, we estimated the effect size of those variants with regression analysis.
Suppose that n measurements of a phenotype were collected across t inbred strains. A linear mixed model in model organism association mapping is typically expressed as
(1) |
where y is an n × 1 vector of observed phenotypes, and μ is the intercept. x is an n × q matrix of fixed effects. β is a 1 × q vector representing coefficients of the fixed effects. Z is an n × t incidence matrix mapping each observed phenotype to one of t inbred strains. In this study Z is always an identical matrix and could be ignored. u is the random effect of the mixed model with Var(u) = . K is the t × t kinship matrix and here t = n. And e is an n × n matrix of residual effect such that Var(e) = .
Let S be the significant SNPs as fixed variables (Eq 3).
(2) |
(3) |
where βs is the coefficients of the significant SNPs and can be estimated with the mixed model.
The proportion of phenotypic variance explained by significant SNPs, denoted by , was estimated as (Eq 4)
(4) |
where is the mean phenotype value of all the accessions. The contributions of INDELs were then added and the corresponding additionally explained phenotypic variances were estimated as:
(5) |
(6) |
(7) |
(8) |
where I is the contributions of INDELs, is the variance explained by significant SNPs and INDELs together and is the additional phenotypic variance explained by significant INDELs.
We then evaluated the contribution from ORFSs with
(9) |
(10) |
(11) |
(12) |
where O is the contributions from ORFSs, is the additional phenotypic variance explained by orfsQTLs.
Since some significant genotypic variants are highly correlated with each other, when a variant is added, we performed marginal association analysis. Only those variants that could explain significant proportion of variance (p-value<1e-4 with F-test performed on residual sum of squares, this method is expected to outperform extended Bayesian information criterion [21]) would be kept as a covariant for the subsequent analysis.
The above analysis was performed within the statistical frame work of EMMA [52] using Python version of MLMM [21]. The (additional) variances explained by a specific type of variants were reported by averaging multiple association studies where at least one QTL detected.
Detection of independent ORFS-shift mutations
We used a graph theoretic model to detect the independent ORFS-shift events. Each ORFS-shifting transcript was regarded as a node. An edge would be created between two nodes if they have variants overlap with each other in genomic position (S9A Fig). If all ORFS-shift transcripts shared the same ancestry, every node in the graph would be linked with each other and a complete graph would be formed (S9B Fig).
We detected the independent ORFS-shift transcripts using all sequences released from 1001 genomes project with the following criteria:
Only trustable transcripts were used.
The ORFSs were shifted in more than 130 accessions (analogous to the MAF of ORFS).
The graph constructed is at least one edge from being complete.
The GO enrichment analysis was performed using agriGO [53] V1.2 with default parameters.
Supporting information
Acknowledgments
We thank M. Tsiantis, M. Nordborg, D. Weigel and three anonymous reviewers for the comments on the work, N. Lachezar, J. Lempe and B. Pieper for helpful assistance and other members in the department of comparative development and genetics, Max Planck Institute for Plant Breeding Research for discussions.
Data Availability
Software Irisas and the source code are available from https://github.com/baoxingsong/Irisas. The tables of synchronized variants for Arabidopsis thaliana and Drosophila melanogaster are available from our web site http://chi.mpipz.mpg.de/gwas. The scripts and numerical data that underlies graphs are available from https://github.com/baoxingsong/GWAS/tree/master/pg.
Funding Statement
This work was funded by a Max Planck Society core grant to the Department of Comparative Development and Genetics. BS is supported by a China Scholarship Council Fellowship, no. 201306300026. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1.Gudbjartsson DF, Walters GB, Thorleifsson G, Stefansson H, Halldorsson BV, Zusmanovich P, et al. Many sequence variants affecting diversity of adult human height. Nature Genetics. 2008;40:609–15. 10.1038/ng.122 [DOI] [PubMed] [Google Scholar]
- 2.Consortium tDGRAM-aD. Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. Nature Genetics. 2012;44:981–90. 10.1038/ng.2383 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.van Rheenen W, Shatunov A, Dekker AM, McLaughlin RL, Diekstra FP, Pulit SL, et al. Genome-wide association analyses identify new risk variants and the genetic architecture of amyotrophic lateral sclerosis. Nature Genetics. 2016;48:1043–8. 10.1038/ng.3622 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Gibson G. Rare and common variants: twenty arguments. Nature Reviews Genetics. 2012;13:135–45. 10.1038/nrg3118 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Gymrek M, Willems T, Guilmatre A, Zeng H, Markus B, Georgiev S, et al. Abundant contribution of short tandem repeats to gene expression variation in humans. Nat Genet. 2016;48(1):22–9. 10.1038/ng.3461 ; PubMed Central PMCID: PMCPMC4909355. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Montgomery SB, Goode DL, Kvikstad E, Albers CA, Zhang ZD, Mu XJ, et al. The origin, evolution, and functional impact of short insertion-deletion variants identified in 179 human genomes. Genome Res. 2013;23(5):749–61. 10.1101/gr.148718.112 ; PubMed Central PMCID: PMCPMC3638132. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Liu X, Geng X, Zhang H, Shen H, Yang W. Association and Genetic Identification of Loci for Four Fruit Traits in Tomato Using InDel Markers. Front Plant Sci. 2017;8:1269 10.3389/fpls.2017.01269 ; PubMed Central PMCID: PMCPMC5515879. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Johanson U, West J, Lister C, Michaels S, Amasino R, Dean C. Molecular Analysis of FRIGIDA, a Major Determinant of Natural Variation in Arabidopsis Flowering Time. Science. 2000;290:344–7. 10.1126/science.290.5490.344 [DOI] [PubMed] [Google Scholar]
- 9.Atwell S, Huang YS, Vilhjálmsson BJ, Willems G, Horton M, Li Y, et al. Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines. Nature. 2010;465:627–31. 10.1038/nature08800 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Narzisi G, Schatz MC. The challenge of small-scale repeats for indel discovery. Front Bioeng Biotechnol. 2015;3:8 10.3389/fbioe.2015.00008 ; PubMed Central PMCID: PMCPMC4306302. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Tan A, Abecasis GR, Kang HM. Unified representation of genetic variants. Bioinformatics. 2015;31(13):2202–4. 10.1093/bioinformatics/btv112 ; PubMed Central PMCID: PMCPMC4481842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Gan X, Stegle O, Behr J, Steffen JG, Drewe P, Hildebrand KL, et al. Multiple reference genomes and transcriptomes for Arabidopsis thaliana. Nature. 2011;477:419–23. 10.1038/nature10414 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Huang W, Massouras A, Inoue Y, Peiffer J, Ràmia M, Tarone AM, et al. Natural variation in genome architecture among 205 Drosophila melanogaster Genetic Reference Panel lines. Genome Research. 2014;24:1193–208. 10.1101/gr.171546.113 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Dembeck LM, Böröczky K, Huang W, Schal C, Anholt RRH, Mackay TFC. Genetic architecture of natural variation in cuticular hydrocarbon composition in Drosophila melanogaster. eLife. 2015;4:e09861 10.7554/eLife.09861 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Alonso-Blanco C, Andrade J, Becker C, Bemm F, Bergelson J, Borgwardt KM, et al. 1,135 Genomes Reveal the Global Pattern of Polymorphism in Arabidopsis thaliana. Cell. 2016;0 10.1016/j.cell.2016.05.063 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Pacific Biosciences of California I. Sequel System Data Release: Arabidopsis Dataset and Genome Assembly 2016 [https://downloads.pacbcloud.com/public/SequelData/ArabidopsisDemoData/Assembly/Arabidopsis_assembly.fasta].
- 17.Keane TM, Goodstadt L, Danecek P, White MA, Wong K, Yalcin B, et al. Mouse genomic variation and its effect on phenotypes and gene regulation. Nature. 2011;477(7364):289–94. 10.1038/nature10413 PubMed PMID: WOS:000294852400022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Lee S, Abecasis GR, Boehnke M, Lin X. Rare-variant association analysis: study designs and statistical tests. Am J Hum Genet. 2014;95(1):5–23. 10.1016/j.ajhg.2014.06.009 ; PubMed Central PMCID: PMCPMC4085641. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Yalcin B, Wong K, Agam A, Goodson M, Keane TM, Gan X, et al. Sequence-based characterization of structural variation in the mouse genome. Nature. 2011;477(7364):326–9. 10.1038/nature10432 ; PubMed Central PMCID: PMCPMC3428933. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Sudmant PH, Rausch T, Gardner EJ, Handsaker RE, Abyzov A, Huddleston J, et al. An integrated map of structural variation in 2,504 human genomes. Nature. 2015;526:75–81. 10.1038/nature15394 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Segura V, Vilhjálmsson BJ, Platt A, Korte A, Seren Ü, Long Q, et al. An efficient multi-locus mixed-model approach for genome-wide association studies in structured populations. Nature Genetics. 2012;44:825–30. 10.1038/ng.2314 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, et al. Common SNPs explain a large proportion of the heritability for human height. Nature Genetics. 2010;42:565–9. 10.1038/ng.608 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Visscher PM, Hemani G, Vinkhuyzen AAE, Chen GB, Lee SH, Wray NR, et al. Statistical Power to Detect Genetic (Co)Variance of Complex Traits Using SNP Data in Unrelated Samples. Plos Genetics. 2014;10(4). 10.1371/journal.pgen.1004269 PubMed PMID: WOS:000335499600029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Yang J, Bakshi A, Zhu Z, Hemani G, Vinkhuyzen AA, Lee SH, et al. Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index. Nat Genet. 2015;47(10):1114–20. 10.1038/ng.3390 ; PubMed Central PMCID: PMCPMC4589513. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Kerdaffrec E, Filiault DL, Korte A, Sasaki E, Nizhynska V, Seren U, et al. Multiple alleles at a single locus control seed dormancy in Swedish Arabidopsis. Elife. 2016;5 10.7554/eLife.22502 PubMed PMID: WOS:000393418600001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Huang XH, Wei XH, Sang T, Zhao QA, Feng Q, Zhao Y, et al. Genome-wide association studies of 14 agronomic traits in rice landraces. Nature Genetics. 2010;42(11):961–U76. 10.1038/ng.695 PubMed PMID: WOS:000283540500012. [DOI] [PubMed] [Google Scholar]
- 27.Serrano-Mislata A, Fernández-Nohales P, Doménech MJ, Hanzawa Y, Bradley D, Madueño F. Separate elements of the TERMINAL FLOWER 1 cis-regulatory region integrate pathways to control flowering time and shoot meristem identity. Development. 2016:dev.135269 10.1242/dev.135269 [DOI] [PubMed] [Google Scholar]
- 28.Bradley D, Ratcliffe O, Vincent C, Carpenter R, Coen E. Inflorescence commitment and architecture in Arabidopsis. Science. 1997;275(5296):80–3. 10.1126/science.275.5296.80 PubMed PMID: WOS:A1997WA90300053. [DOI] [PubMed] [Google Scholar]
- 29.Valverde F, Mouradov A, Soppe W, Ravenscroft D, Samach A, Coupland G. Photoreceptor regulation of CONSTANS protein in photoperiodic flowering. Science. 2004;303(5660):1003–6. 10.1126/science.1091761 PubMed PMID: WOS:000188918000043. [DOI] [PubMed] [Google Scholar]
- 30.Gnatzy W, Volknandt W, Schulz S. Dufour gland of the digger wasp Liris niger: structure and developmental and biochemical aspects. Cell Tissue Res. 2004;315(1):125–38. 10.1007/s00441-003-0813-2 PubMed PMID: WOS:000188415200011. [DOI] [PubMed] [Google Scholar]
- 31.Olaniran OA, Sudhakar AVS, Drijfhout FP, Dublon IAN, Hall DR, Hamilton JGC, et al. A Male-Predominant Cuticular Hydrocarbon, 7-Methyltricosane, is used as a Contact Pheromone in the Western Flower Thrips Frankliniella occidentalis. J Chem Ecol. 2013;39(4):559–68. 10.1007/s10886-013-0272-5 PubMed PMID: WOS:000317606200011. [DOI] [PubMed] [Google Scholar]
- 32.Blackwell E, Halatek IM, Kim HJ, Ellicott AT, Obukhov AA, Stone DE. Effect of the pheromone-responsive G(alpha) and phosphatase proteins of Saccharomyces cerevisiae on the subcellular localization of the Fus3 mitogen-activated protein kinase. Mol Cell Biol. 2003;23(4):1135–50. 10.1128/MCB.23.4.1135-1150.2003 ; PubMed Central PMCID: PMCPMC141143. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Xiang Y, Song B, Née G, Kramer K, Finkemeier I, Soppe W. Sequence Polymorphisms at the Reduced Dormancy 5 Pseudophosphatase Underlie Natural Variation in Arabidopsis Dormancy. Plant Physiology. 2016:pp.005252016. 10.1104/pp.16.00525 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Barboza L, Effgen S, Alonso-Blanco C, Kooke R, Keurentjes JJB, Koornneef M, et al. Arabidopsis semidwarfs evolved from independent mutations in GA20ox1, ortholog to green revolution dwarf alleles in rice and barley. Proceedings of the National Academy of Sciences. 2013;110:15818–23. 10.1073/pnas.1314979110 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Alcázar R, García AV, Kronholm I, de Meaux J, Koornneef M, Parker JE, et al. Natural variation at Strubbelig Receptor Kinase 3 drives immune-triggered incompatibilities between Arabidopsis thaliana accessions. Nature Genetics. 2010;42:1135–9. 10.1038/ng.704 [DOI] [PubMed] [Google Scholar]
- 36.Rivas MA, Beaudoin M, Gardet A, Stevens C, Sharma Y, Zhang CK, et al. Deep resequencing of GWAS loci identifies independent rare variants associated with inflammatory bowel disease. Nat Genet. 2011;43(11):1066–73. 10.1038/ng.952 ; PubMed Central PMCID: PMCPMC3378381. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Schmalenbach I, Zhang L, Ryngajllo M, Jimenez-Gomez JM. Functional analysis of the Landsberg erecta allele of FRIGIDA. Bmc Plant Biology. 2014;14 10.1186/s12870-014-0218-2 PubMed PMID: WOS:000341317900001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Shindo C, Aranzana MJ, Lister C, Baxter C, Nicholls C, Nordborg M, et al. Role of FRIGIDA and FLOWERING LOCUS C in Determining Variation in Flowering Time of Arabidopsis. Plant Physiology. 2005;138:1163–73. 10.1104/pp.105.061309 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Grimm DG, Roqueiro D, Salome P, Kleeberger S, Greshake B, Zhu W, et al. easyGWAS: A Cloud-based Platform for Comparing the Results of Genome-wide Association Studies. Plant Cell. 2016. 10.1105/tpc.16.00551 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Cannavo E, Koelling N, Harnett D, Garfield D, Casale FP, Ciglar L, et al. Genetic variants regulating expression levels and isoform diversity during embryogenesis. Nature. 2017;541(7637):402–6. 10.1038/nature20802 . [DOI] [PubMed] [Google Scholar]
- 41.Kilpinen H, Goncalves A, Leha A, Afzal V, Alasoo K, Ashford S, et al. Common genetic variation drives molecular heterogeneity in human iPSCs. Nature. 2017;546(7658):370–5. 10.1038/nature22403 ; PubMed Central PMCID: PMCPMC5524171. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Kawakatsu T, Huang S-sC, Jupe F, Sasaki E, Schmitz RJ, Urich MA, et al. Epigenomic Diversity in a Global Collection of Arabidopsis thaliana Accessions. Cell. 2016;166:492–505. 10.1016/j.cell.2016.06.044 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Brogna S, Wen JK. Nonsense-mediated mRNA decay (NMD) mechanisms. Nat Struct Mol Biol. 2009;16(2):107–13. 10.1038/nsmb.1550 PubMed PMID: WOS:000263286600005. [DOI] [PubMed] [Google Scholar]
- 44.Borsani O, Zhu JH, Verslues PE, Sunkar R, Zhu JK. Endogenous siRNAs derived from a pair of natural cis-antisense transcripts regulate salt tolerance in Arabidopsis. Cell. 2005;123(7):1279–91. 10.1016/j.cell.2005.11.035 PubMed PMID: WOS:000234584500016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Lamesch P, Berardini TZ, Li D, Swarbreck D, Wilks C, Sasidharan R, et al. The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Research. 2012;40:D1202–D10. 10.1093/nar/gkr1090 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Katoh K, Standley DM. MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Molecular Biology and Evolution. 2013;30:772–80. 10.1093/molbev/mst010 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Benoist C, O'Hare K, Breathnach R, Chambon P. The ovalbumin gene—sequence of putative control regions. Nucleic Acids Research. 1980;8:127–42. 10.1093/nar/8.1.127 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses. The American Journal of Human Genetics. 2007;81:559–75. 10.1086/519795 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Fusi N, Lippert C, Lawrence ND, Stegle O. Warped linear mixed models for the genetic analysis of transformed phenotypes. Nature Communications. 2014;5:4890 10.1038/ncomms5890 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Lippert C, Listgarten J, Liu Y, Kadie CM, Davidson RI, Heckerman D. FaST linear mixed models for genome-wide association studies. Nature Methods. 2011;8:833–5. 10.1038/nmeth.1681 [DOI] [PubMed] [Google Scholar]
- 51.Team RC. R: A language and environment for statistical computing. R Foundation for Statistical Computing. [Google Scholar]
- 52.Kang HM, Zaitlen NA, Wade CM, Kirby A, Heckerman D, Daly MJ, et al. Efficient Control of Population Structure in Model Organism Association Mapping. Genetics. 2008;178:1709–23. 10.1534/genetics.107.080101 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Du Z, Zhou X, Ling Y, Zhang ZH, Su Z. agriGO: a GO analysis toolkit for the agricultural community. Nucleic Acids Research. 2010;38:W64–W70. 10.1093/nar/gkq310 PubMed PMID: WOS:000284148900012. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Software Irisas and the source code are available from https://github.com/baoxingsong/Irisas. The tables of synchronized variants for Arabidopsis thaliana and Drosophila melanogaster are available from our web site http://chi.mpipz.mpg.de/gwas. The scripts and numerical data that underlies graphs are available from https://github.com/baoxingsong/GWAS/tree/master/pg.