Understanding the biology of infections requires knowledge about the intricate molecular dialogue between plants and pathogens. Some components of this molecular dialogue are well‐conserved across taxa and the interacting molecules can often be inferred by homology. Yet in all specialized plant‐pathogen interactions, a substantial portion of the molecular dialogue is based on proteins that are unique to the pathosystem (e.g. most effector proteins). Identifying these proteins is very challenging because the pathogen often gained the genes encoding these proteins in the recent past and the proteins often share only minor similarities among pathogens. Yet the speed of pathogen evolution can be exploited to identify these crucial components of the molecular dialogue. Pathogen populations often harbor both virulent and avirulent strains, because the adaptation to exploit a new host genotype is not yet fixed within the species. Hence, genomic analyses can point to the genetic differences between the evolved (i.e. virulent) and the ancestral (i.e. avirulent) genotypes. The complication is though that most genetic differences between such groups of strains are unrelated to the actual gain in virulence. A technique that was recently invented to identify the mutations responsible for human genetic diseases provides a solution to this dilemma.
Genome‐wide association studies (GWASs) of pathosystems have focused predominantly on the resistance mechanisms to a variety of pathogens by the analysis of diverse host populations, yet the virulence factors in pathogens remain largely unexplored (Bartoli and Roux, 2017). Advances in whole‐genome analyses of large pathogen populations have recently enabled a series of effector discoveries. The gene encoding the avirulence effector AvrStb6 of Zymoseptoria tritici, which is recognized by wheat cultivars encoding the Stb6 resistance gene, was identified by a combination of GWAS and classic linkage mapping (Zhong et al., 2017). GWAS precisely pointed to a small genic island encoding the AvrStb6 gene, surrounded by large blocks of transposable elements. The precision of the GWAS association was important because the gene was located in proximity to the telomeric end of a chromosome. The region contained no previously annotated gene and classic linkage mapping efforts were unable to narrow down the locus. A different GWAS on Z. tritici discovered the gene 8_609, which encodes an effector that is probably recognized by the wheat cultivar Toronit (Hartmann et al., 2017). In this study, the GWAS did not directly identify a single nucleotide polymorphism (SNP) within the effector gene 8_609 but, rather, an SNP at a distance of 1.5 kb. Analyses of linkage disequilibrium showed that the SNP was in perfect association with sequence rearrangements that caused the effector gene deletion. The sequence rearrangements were most likely triggered by the presence of a large block of transposable elements (Fig. 1). These recent studies highlight the power of GWAS to efficiently discover effector genes and to simultaneously provide insights into the evolutionary dynamics at effector loci.
Figure 1.
Stages of a genome‐wide association study (GWAS). (A) The basic principle of GWAS is to associate phenotypic variation with single nucleotide polymorphisms (SNPs) in a population. Blue and red circles represent different alleles at SNP loci. Phenotypic traits can be of discrete or continuous nature. At the highlighted locus, the blue allele is associated with higher phenotypic trait values. (B) An example of a phenotypic trait used for GWAS in Zymoseptoria tritici. Pycnidia density on infected wheat leaves is highly heritable and controlled by polymorphism at effector genes. (C) Statistical significance of individual SNP loci in a GWAS of pycnidia density on wheat (Hartmann et al., 2017). (D) Comparative genomic analyses of two complete genomes. Gene and transposable element regions are highlighted in black and orange, respectively. The GWAS locus revealed the loss of an effector gene in virulent strains (Hartmann et al., 2017). Parts of this figure were reproduced with permission.
Our aims with this Opinion piece are to introduce how GWAS can be implemented in different pathosystems, to draw on the lessons from the first applications to fungal pathogens and to highlight the promise to significantly speed up the discovery of effectors.
The basic principle of GWAS is to screen either one or multiple populations for phenotypic differences between individuals (Weigel and Nordborg, 2015). This phenotypic variation needs to have a genetic basis, meaning that the variation is reproducible under standardized conditions, such as in a glasshouse infection assay. In this sense, GWAS shares similarities with a forward random mutagenesis screen that artificially introduces mutations in a population originating from the same clone. The major difference from a random mutagenesis screen is that many observable phenotypic differences in a natural population may have adaptive value to the pathogen. Hence, by design, GWAS is likely to map loci under selection in pathogen populations. There are constraints on the range of phenotypic traits that are suitable for analysis. For example, the ability to surmount resistance triggered by a specific host resistance protein or the capacity to gain resistance to a particular fungicide (such as strobilurines) may be fixed in populations of a pathogen. GWAS is particularly powerful for phenotypes that are determined by one or a few loci of major effects (mono‐ or oligogenic traits). However, given a sufficiently large and diverse GWAS panel, nearly any phenotypic trait is amenable to GWAS.
A successful GWAS depends on a set of key requirements that must be satisfied (Bergelson and Roux, 2010). First, GWAS has no power to detect loci contributing to virulence if the analysed pathogen populations are (mostly) clonal. Hence, GWAS is most successful if applied to pathogens that undergo regular sexual cycles. Second, the selection of isolates to build a GWAS panel for genotyping and phenotyping has a crucial impact on the ability of GWAS to detect genes contributing to a phenotypic trait. The GWAS panel should maximize the phenotypic diversity amongst individuals (i.e. virulence on a specific host genotype), but must avoid spanning highly differentiated pathogen populations. Hence, an excellent GWAS panel could be made from a single field population as long as there is significant phenotypic variation amongst isolates (Bartoli and Roux, 2017). Conversely, GWAS based on a global pathogen collection will be successful if the populations are well connected by gene flow. Everything else being equal, the size of the GWAS panel is directly correlated with the power to identify loci contributing to a phenotypic trait. A phenotypic trait with a simple genetic architecture (e.g. based on gene‐for‐gene interactions) can be mapped with small GWAS panels of 100 isolates or less (Hartmann et al., 2017; Zhong et al., 2017). More complex traits, such as colony growth rates or variation in aggressiveness, will probably require larger GWAS panels.
The advances in whole‐genome sequencing technologies have made the genotyping of a GWAS panel relatively straightforward. The most rigorous approach is to sequence genomic DNA to sufficient coverage that SNPs can be called reliably against a high‐quality reference genome. Reduced representation sequencing, such as genotyping‐by‐sequencing (GBS), restriction‐associated DNA sequencing (RAD‐seq) or exome capture, may reduce costs, but can lead to problematic gaps in markers. The major issue with reduced representation sequencing is that an important locus can be missed if no SNP is in sufficient physical proximity. The key parameter is the decay of linkage disequilibrium in the pathogen populations. For a GWAS to successfully associate a phenotype to genetic variation, the causal mutation needs to be in strong linkage disequilibrium with a genotyped SNP. Obviously, the ideal scenario is when the causal mutation is included in the SNP dataset, but this may not be the case if the reference genome contains deletions compared with other isolates of the same species.
SNP genotypes are ideally identified using joint genotyping of all isolates at once. A large number of tools are available for SNP calling from whole‐genome sequencing data, including VarScan, Samtools, Freebayes and the Genome Analysis Tool Kit (GATK; McKenna et al., 2010). GATK is one of the best established pipelines and includes a highly accurate variant caller (GATK HaplotypeCaller) that performs local de novo assemblies in regions with high sequence variation. The GATK GenotypeGVCFs tool can jointly genotype all isolates at once to ensure consistency. Furthermore, GATK provides a set of handy tools to filter variants (GATK VariantFiltration). After standard quality filters, SNP loci should also be filtered for high genotyping rate amongst samples. Similarly, loci with low minor allele frequencies should be excluded before GWAS analyses. VCFtools (https://vcftools.github.io) provides convenient options for variant calling format (VCF) file manipulation.
It is essential to identify and define precisely the traits of interest before a GWAS panel is established, because different sets of isolates will vary for different traits depending on their demographic history and past selection pressures. For example, a population of pathogens will often vary for a specific (a)virulence factor only when a host with a matching resistance gene is grown in sympatry. The choice of an appropriate phenotypic trait to conduct GWAS necessitates a detailed knowledge of the biological cycle and of the infection process of the pathogen (Fig. 1B). We generally distinguish (a)virulence as a qualitative trait describing the capacity of the pathogen to cause disease on a host carrying a specific resistance gene, and aggressiveness as a quantitative trait describing the efficiency of the pathogen to infect a susceptible host. Although GWAS is typically a quantitative genetics approach, it can be used successfully for both categories of phenotypes (Fig. 1A). The phenotyping efforts of a GWAS panel do not have to be limited to pathogenicity traits, but can include resistance to pesticides, tolerance to low/high temperatures and other abiotic stress factors, such as pH or reactive oxygen species. The performance of a pre‐screen of the GWAS panel for the phenotypic traits of interest is essential for success.
In recent years, automation of plant phenotyping has developed exponentially in terms of throughput and precision of the acquired data (Mutka et al., 2016). The facilities and methods established for high‐throughput plant phenotyping can be adopted to screen pathogen panels for GWAS. However, the main constraint in the analysis of large pathogen panels resides in the fact that the handling of multiple isolates is very labour intensive, as it requires the preparation of calibrated inoculum for each isolate and the individual inoculation of plants. Even greater challenges would be posed by the phenotyping of pathogen panels under natural field conditions because of the high levels of environmental stochasticity. Yet, the development of phenotyping techniques under field conditions will be a highly rewarding challenge.
GWAS analyses tools are numerous and many are very user‐friendly. Highly recommended tools include TASSEL (Bradbury et al., 2007), which has a graphical user interface (GUI) and conveniently integrates a range of statistical models. TASSEL can also perform basic analyses of population structure and linkage disequilibrium decay, which are essential components to validate the outcome of GWAS analyses. TASSEL also provides data visualization options to quickly explore different sets of analyses. Given some basic proficiency in the R language, GAPIT by Lipka et al. (2012) can be used to perform a similarly complete set of GWAS analyses, produce different output formats and basic plots. Regardless of the choice of software, special attention should be paid to whether the tools can deal with haploid data (if applicable) and whether SNPs with a fraction of unknown genotypes are accepted.
To control for false positive associations, genetic substructure in the GWAS panel needs to be accounted for. Substructure arises if highly related (or even clonal) isolates are accidentally included or if the panel covers multiple populations that show genetic differentiation. The most frequently used models to control for non‐random relatedness are mixed linear models (MLMs; Zhang et al. 2010). As the computation of MLMs can be time consuming for very large mapping populations (≫1000), compression approaches can be used. However, if compression is applied for small mapping populations, the impact of compression should be carefully analysed. TASSEL provides convenient models to control for non‐random relatedness. In addition, GAPIT can perform a model selection procedure to select the appropriate number of covariates.
Finally, as one performs one statistical test for every SNP in the genome, chance associations of phenotype and genotype (false positives) must be controlled. The most stringent control is to apply a Bonferroni correction by dividing the threshold for statistical significance (e.g. P = 0.05) by the number of markers. False discovery rate (FDR) thresholds are more lenient. Such corrections can be performed directly by the GWAS tools mentioned above. It is important to note that the significance of an association expresses both the importance of a particular locus for the analysed phenotype and how many additional loci may contribute to the same phenotype. In general, individual P values of a highly polygenic trait are low compared with the P value obtained for a monogenic trait.
A successful GWAS provides a number of SNPs in the genome of a pathogen that are robustly associated with variation in a phenotypic trait. However, to gain a full understanding of how the genotype in a region identified by GWAS impacts the phenotype can be challenging. The simplest scenarios arise when an SNP falls within the coding sequences of a gene and can be linked to a non‐synonymous substitution in the encoded protein. However, it is important to note that any other polymorphic locus in high linkage disequilibrium with the SNP could also be the causative mutation. This could include either other SNPs in coding sequences or SNPs that could underlie regulatory differences. A careful inspection of linkage disequilibrium surrounding each SNP obtained from GWAS is therefore unavoidable. Luckily, linkage disequilibrium decays rapidly in many microorganisms with large population sizes and frequent sexual reproduction. Nevertheless, the distance can range from a few hundred base pairs to tens of thousands of base pairs, depending on the study organism.
Rapidly evolving pathogenicity loci often display complex patterns of sequence rearrangement within the species. This could be the case if a gene that encoded a recognized avirulence factor was recently lost in the genome of some isolates. As SNPs per se are uninformative with regard to sequence rearrangements, additional analyses may be necessary to identify the causal mutation. Although the analyses of sequence rearrangements can be challenging, the relative simplicity of many pathogen genomes is often amenable to more detailed analyses. The starting point of a more detailed analysis of a locus is always that the SNP identified by GWAS is in linkage disequilibrium with rearrangements in the chromosomal sequence. Gene losses in proximity to an SNP are detectable by the lack of sequence coverage in specific isolates. The inspection of read alignment files in genome browsers or the generation of draft genome assemblies of some isolates will probably provide strong hints on the nature of the sequence rearrangements. Yet, the most complex sequence rearrangements can only be detected reliably by the assembly of additional genomes using long read technologies (e.g. PacBio or Oxford Nanopore sequencing).
Loci identified by GWAS can be used to significantly improve our understanding of infection biology. GWAS loci probably encode proteins (or regulate the transcription thereof) that underlie major differences in virulence amongst isolates of a pathogen population. Therefore, functional validation of the candidate genes is an excellent strategy to pursue the analyses of their role in host interactions. Major mutant phenotypes are expected for genes underlying qualitative differences in virulence. The disruption of an avirulence gene, for instance, will normally lead to a gain of virulence and its heterologous expression will lead to an incompatible interaction. AvrStb6 of Z. tritici has been shown to be specifically recognized by wheat cultivars carrying the resistance gene Stb6 by ectopic expression of the avirulence allele in an otherwise virulent isolate of Z. tritici (Zhong et al., 2017). A major strength of GWAS is to directly identify functional differences amongst alleles segregating at a major virulence locus. Hence, analyses of natural variation at these loci by allele swaps or gene editing will enable the quantification of the contribution of individual SNPs to virulence and reveal alternative infection strategies of the pathogen. Furthermore, GWAS can directly guide functional analyses to identify the exact residues recognized by the host (in the case of avirulence genes).
Finally, the selection of isolates with which functional validation will be pursued can be crucial under some circumstances. Such a scenario might arise when recognition of the avirulence factor involves more than one host component. For example, in wheat powdery mildew, Leptosphaeria maculans and Fusarium oxysporum f.sp. lycopersici, recognition of specific pathogen races involves an avirulence factor, a corresponding resistance protein and a suppressor of avirulence. In such cases, the avirulence gene must be heterologously expressed in the absence of the suppressor in order to detect the contribution to virulence.
Recent effector discoveries in a major crop pathogen have demonstrated how powerful the contribution of GWAS can be (Hartmann et al., 2017; Zhong et al., 2017). In these studies, GWAS pointed to very narrow chromosomal regions that harboured single genes. Hence, functional validation of the effectors was relatively straightforward. However, GWAS also holds the promise to more broadly identify loci contributing to pathogenicity. Carefully designed studies may even be able to point to genes contributing to an individual development stage of an infection. However, the most powerful aspect of GWAS may be that the technique is designed to capture adaptive mutations still segregating in pathogen populations. Mutations identified by GWAS can be traced back to their origin in the species and probable evolutionary scenarios can be tested. A theme emerging from such analyses is that chromosomal rearrangements play a major role in the generation of novel virulence loci. GWAS studies that expand beyond pathogenicity traits will also open the door to a better understanding of how abiotic factors influence epidemics and point to agronomic practices that could help to manage diseases.
Acknowledgements
This work benefited from interactions promoted by the SUSTAIN COST Action FA1208 (https://www.cost-sustain.org). FEH was funded by a European Prestige ‐ Marie Curie Fellowship incoming grant 2016‐4‐0013 and the University Paris‐Sud. TM was supported by a Short‐Term Scientific Mission grant awarded by the SUSTAIN COST Action. DC was supported by the Swiss National Science Foundation (grant 31003A_173265).
References
- Bartoli, C. and Roux, F. (2017) Genome‐wide association studies in plant pathosystems: toward an ecological genomics approach. Front. Plant Sci. 8, 763. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bergelson, J. and Roux, F. (2010) Towards identifying genes underlying ecologically relevant traits in Arabidopsis thaliana . Nat. Rev. Genet. 11, 867–879. [DOI] [PubMed] [Google Scholar]
- Bradbury, P.J. , Zhang, Z. , Kroon, D.E. , Casstevens, T.M. , Ramdoss, Y. and Buckler, E.S. (2007) TASSEL: software for association mapping of complex traits in diverse samples. Bioinformatics, 23, 2633–2635. [DOI] [PubMed] [Google Scholar]
- Hartmann, F.E. , Sanchez‐Vallet, A. , McDonald, B.A. and Croll, D. (2017) A fungal wheat pathogen evolved host specialization by extensive chromosomal rearrangements. ISME J. 11, 1189–1204. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lipka, A.E. , Tian, F. , Wang, Q. , Peiffer, J. , Li, M. , Bradbury, P.J. , Gore, M.A. , Buckler, E.S. and Zhang, Z. (2012) GAPIT: genome association and prediction integrated tool. Bioinformatics, 28, 2397–2399. [DOI] [PubMed] [Google Scholar]
- McKenna, A. , Hanna, M. , Banks, E. , Sivachenko, A. , Cibulskis, K. , Kernytsky, A. , Garimella, K. , Altshuler, D. , Gabriel, S. , Daly, M. and DePristo, M.A. (2010) The genome analysis toolkit: a MapReduce framework for analyzing next‐generation DNA sequencing data. Genome Res. 20, 1297–1303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mutka, A.M. , Fentress, S.J. , Sher, J.W. , Berry, J.C. , Pretz, C. , Nusinow, D.A. and Bart, R. (2016) Quantitative, image‐based phenotyping methods provide insight into spatial and temporal dimensions of plant disease. Plant Physiol. 172, 650–660. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weigel, D. and Nordborg, M. (2015) Population genomics for understanding adaptation in wild plant species. Annu. Rev. Genet. 49, 315–338. [DOI] [PubMed] [Google Scholar]
- Zhang, Z. , Ersoz, E. , Lai, C.‐Q. , Todhunter, R.J. , Tiwari, H.K. , Gore, M.A. , Bradbury, P.J. , Yu, J. , Arnett, D.K. , Ordovas, J.M. and Buckler, E.S. (2010) Mixed linear model approach adapted for genome‐wide association studies. Nature, 42, 355–360. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhong, Z. , Marcel, T.C. , Hartmann, F.E. , Ma, X. , Plissonneau, C. , Zala, M. , Ducasse, A. , Confais, J. , Compain, J. , Lapalu, N. and Amselem, J. (2017) A small secreted protein in Zymoseptoria tritici is responsible for avirulence on wheat cultivars carrying the Stb6 resistance gene. New Phytol. 214, 619–631. [DOI] [PubMed] [Google Scholar]