Abstract
Powerful approaches to inferring recent or current population structure based on nearest neighbor haplotype “coancestry” have so far been inaccessible to users without high quality genome-wide haplotype data. With a boom in nonmodel organism genomics, there is a pressing need to bring these methods to communities without access to such data. Here, we present RADpainter, a new program designed to infer the coancestry matrix from restriction-site-associated DNA sequencing (RADseq) data. We combine this program together with a previously published MCMC clustering algorithm into fineRADstructure—a complete, easy to use, and fast population inference package for RADseq data (https://github.com/millanek/fineRADstructure; last accessed February 24, 2018). Finally, with two example data sets, we illustrate its use, benefits, and robustness to missing RAD alleles in double digest RAD sequencing.
Keywords: population structure, RAD-seq, inference
Understanding of shared ancestry in genetic data sets is often key to their interpretation. The chromoPainter/fineSTRUCTURE package (Lawson et al. 2012) represents a powerful model-based approach to investigating population structure using genetic data. It offers especially high resolution in inference of recent shared ancestry, as shown for example in the investigations of worldwide human population history (Hellenthal et al. 2014) and of genetic structure of the British population (Leslie et al. 2015). Further advantages, when compared with other model-based methods, such as STRUCTURE (Pritchard et al. 2000) and ADMIXTURE (Alexander et al. 2009), include the ability to deal with a very large number of populations, explore relationships between them, and to quantify ancestry sources in each population.
The high resolution of chromoPainter/fineSTRUCTURE and related methods derives from utilizing haplotype linkage information and from focusing on the most recent coalescence (common ancestry) among the sampled individuals. This approach derives a “coancestry matrix,” a summary of nearest neighbor haplotype relationships in the data set; that is, of the cases where pairs of individuals had the most similar haplotypes one to another. However, the existing pipeline for coancestry matrix inference was designed for large scale human genetic single nucleotide polymorphism (SNP) data sets, where chromosomal locations of the markers are known, haplotypes are typically assumed to be correctly phased (although it is possible to perform the analysis without this assumption), and missing data also need to have been imputed. Therefore, these methods have so far been generally inaccessible for investigations beyond model organisms.
With no requirements for prior genomic information (e.g., no need for a reference genome) and relatively low cost, RADseq data are fuelling a boom in ecological and evolutionary genomics, especially for nonmodel organisms (Andrews et al. 2016), where questions on population structure and relative ancestry are among the most commonly asked. Therefore, we have developed RADpainter, a simple method designed to infer the coancestry matrix from RADseq data. RADpainter is designed to take full advantage of the sequence of all the SNPs from each RAD locus, finding (one or more) closest relatives for each allele. The information about the nearest neighbors of each individual is then summed up into the coancestry similarity matrix. We package RADpainter together with the fineSTRUCTURE Markov chain Monte Carlo (MCMC) clustering algorithm into an easy to use population inference package for RADseq data called fineRADstructure.
New Approaches
Briefly, the coancestry matrix is calculated as follows: for each RAD locus and each individual (a recipient), we calculate the number of sequence differences (i.e., SNPs) between that individual’s allele(s) and the alleles in all other individuals (potential donors). The closest relative (donor) for each allele is its nearest neighbor allele, that is, the allele with the least number of differences. In the case of multiple equally distant nearest neighbors, an equal proportion of coancestry is assigned to each “donor.” Finally, we sum these local coancestry values across all loci to obtain the coancestry matrix for the full data set. A basic outline of the coancestry estimation procedure for haploid individuals is shown in supplementary Algorithm S1, Supplementary Material online.
Differences in ploidy are handled by averaging coancestry across alleles in the same individual. Figure 1 provides a further illustration of the method, showing co-ancestry matrix calculation at a single locus for four diploid individuals. However, we have implemented RADpainter in a way to handle arbitrary number of alleles per individual (i.e., any ploidy levels), depending on the input; ploidy can even vary across individuals in a single analysis. We expect these features to be of particular use to the plant research community.
Concatenating all the variable sites (SNPs) at each locus to define alleles increases the resolution of RADpainter when compared with methods that only use a single SNP from each locus. This can be seen clearly by considering the fact that a single biallelic SNP splits the alleles at a locus into only two groups, whereas multiple linked SNPs typically enable more refined assignment of nearest neighbor relationships, as illustrated in figure 1. Given the short length of each RAD locus (typically < 250 bp), we assume that all SNPs within the locus are in almost complete linkage disequilibrium (LD) (i.e., that D′ 1) and, therefore, historical recombination can be disregarded.
On the other hand, the summation across loci assumes frequent recombination between different RAD sequences—each RAD locus is counted as if providing independent evidence with regards to coancestry. This assumption is not always realistic, in particular when the data set contains pairs of loci from the two sides of each restriction site. However, we account for any linkage between RAD loci by adjusting the normalizing constant c which is passed on to the fineSTRUCTURE clustering algorithm together with the co-ancestry matrix and determines its sensitivity (Lawson et al. 2012).
Briefly, the fineSTRUCTURE clustering algorithm (Lawson et al. 2012) uses a MCMC scheme to explore the space of population configurations (sample “partitions”) by proposing merging or splitting populations, merging then resplitting, or moving individuals. A proposed population configuration is accepted with a probability derived from the ratio of the likelihood with the previous configuration, a likelihood that in turn depends on the terms of the coancestry matrix scaled by the c value. Given the final fineSTRUCTURE output, we can infer the number of clusters, deal with a very large number of potential clusters, quantify ancestry sources in each group, and also explore relationships between groups, including with a simple tree-building algorithm (Lawson et al. 2012). Our new fineRADstructure package includes a set of R scripts that can be used to plot the results, including the clustered coancestry matrix, the tree with posterior population assignment probabilities, and a matrix with coancestry averages per population.
As in the original chromoPainter tool, the aim of estimating the c parameter is to correct for the true underlying variance of the entries of the coancestry matrix (), so that the multinomial likelihood used in the clustering step matches the true statistical uncertainty in the matrix. In order to estimate an appropriate c value for each data set, RADpainter calculates empirical variance () for each entry of the coancestry matrix by jackknife, by default in blocks of 100 consecutive RAD loci. The entries are divided by the theoretical variance under the multinomial distribution for the coancestry matrix, that is, assuming that an element follows the binomial distribution: where is the total number of RAD loci in the data set and is the probability of individual j being the closest relative of individual i at any particular locus.
If LD between RAD loci is weak or if the data have been mapped and sorted according to genome coordinates (so that loci in LD are grouped together within the jackknife blocks), then RADpainter correctly estimates the effective number of independent loci. However, when loci are not sorted and LD between them is strong then the estimation procedure may underestimate c and so be overconfident in population splits. To combat this we provide a script which reorders the RAD loci according to LD. We recommend users with unmapped data to run this before the RADpainter procedure in order to ensure they obtain a conservative upper bound on the number of statistically identifiable clusters. To test this approach, we constructed a RAD data set with pairs of loci in perfect LD by duplicating all loci in a RAD data set and randomly shuffling the positions of the duplicates. We found that c doubled after LD-based reordering of loci, estimating the same number of effectively independent loci correctly.
Unknown nucleotides (Ns) can be present within alleles and their positions are ignored in all pairwise sequence comparisons. Where the entire donor and/or the recipient alleles are missing, their coancestry is assumed to be proportional to the amount of coancestry observed between them in the rest of the data (i.e., “missing data” coancestry is shared proportionally to “observed data” coancestry). However, it is well-known that one of the causes of missing alleles in RAD data is the presence of genetic polymorphisms in the restriction sites, referred to as allele dropout. Because allele dropout can lead to nonrandom missingness and thus influence inferences of population divergence (Arnold et al. 2013; Gautier et al. 2013), we suggest caution. The most problematic loci may be removed by filtering loci with a large excess of null alleles. In addition, we recommend that users should check for occurrence of large systematic differences in missingness between putative populations, and urge caution in interpreting the results if such differences occur. In some cases, it may be appropriate to remove outlier individuals with large amount of missingness prior to a rerun of the analysis pipeline. To assist these steps, each run of RADpainter outputs missingness (the proportion of missing alleles) per individual.
To facilitate the use of fineRADstructure with existing RAD-seq processing pipelines, the input file can be generated by the widely used Stacks RAD-seq tool set (Catchen et al. 2013). This can be done directly, via output from the export_sql.pl Stacks program. In addition, for users who do not use the Stacks SQL database, we provide a data conversion and filtering script Stacks2fineRAD.py for processing the output from the core Stacks populations program. A third party utility script (https://github.com/edgardomortiz/fineRADstructure-tools; last accessed February 24, 2018) also enables conversion from the format of the pyRAD and ipyrad toolkits (Eaton 2014). Finally, our input format is a simple flat text file and we provide an example data set to enable the users of other pipelines to prepare their data.
Results
We applied fineRADstructure to a single-digestion RAD data set including 120 individuals from 12 populations of the alpine plant species complex Heliosperma pusillum. The data set comprises 1,097 loci which have been assembled through the Stacks pipeline without a reference genome (Trucchi et al. 2017). The complicated network of relationships among these twelve populations belonging to two phylogenetically intertwined species (H. pusillum and Heliosperma veselskyi), with contrasting ecology and a postglacial history of divergence in some of the six sampled localities, make it an excellent case to study the performance of our approach (fig. 2).
The fineRADstructure results (a clustered coancestry matrix; fig. 2A) make the presence of twelve populations immediately clear, with substructure suggested in some of the populations. The relationships between some of the populations at localities A, B, C, and D are clearly not tree-like with strong evidence of heterogeneous gene flow between the species (Trucchi et al. 2017). A variable level of intrapopulation co-ancestry, likely related to different degree of isolation is also visible across the populations (e.g., the two populations sampled at locality F are highly isolated from each other and also the most distinct from populations sampled at all the other localities).
A number of benefits of our method can be seen in comparison against other approaches commonly used for population inference. In particular, the analysis using fastSTRUCTURE (Raj et al. 2014) at K = 6 supported a clustering mainly by locality, with the exception of populations in B and C which clustered by ecology (fig. 2B). The choice of model complexity K = 6 was based on the rate of decrease in the value of the Bayesian Information Criterion (BIC; fig. 2B) as returned by the find.clusters function in the R package adegenet (Jombart and Ahmed 2011) and follows the methodology used in the original manuscript that presented the data (Trucchi et al. 2017). Principal Component Analysis (PCA) implemented by the glPCA function in adegenet (Jombart and Ahmed 2011) was also unable to clearly partition the genetic diversity in twelve populations (fig. 2C). The genetic structure was partly resolved by PCA by rerunning the analysis including only samples from populations in localities A, B, C, D (fig. 2C; inset), but substantial overlaps remained between the population clusters, thus still providing lower resolution than fineRADstructure.
Overall, in terms of biological insight, a major improvement of our approach over previous results in this data set is the clear visibility, at the same time, of both a global structure produced by a postglacial history of recolonization of newly ice-free mountain areas and a local structure related to parallel ecological divergence; previously these two phenomena were inferred only by applying a combination of multiple different analyses (Trucchi et al. 2017).
Next we used a test data set produced using the double digest RAD (ddRAD) approach (Peterson et al. 2012) to assess the robustness of fineRADstructure to nonrandom data missingness, specifically to the batch effect commonly present especially in ddRAD data sets, and to allele dropout. The data set includes 76 samples from two different subspecies and three hybrid individuals. The ddRAD protocol is very sensitive to inconsistent size selection across different libraries as shifts in the size range directly influences which loci are included and sequenced in each library. Five different libraries (L1–L5), each of which underwent size selection (on a ThermoFischer E-gel) and sequencing on a PGM Ion torrent machine separately, are included in this data set. We preprocessed the data using the common Stacks RADseq pipeline (Catchen et al. 2013). Briefly, after trimming to 200 bp, raw reads were demultiplexed and quality filtered using the process_radtags.pl program, RAD loci were then assembled without a reference genome and SNPs called using the denovo_map.pl program (parameters -m5,-M6,-n8) and, finally, the populations program was used to filter and export the loci (parameters -r0.8,-p1,--min_maf0.05--vcf_haplotypes).
We the used our script Stacks2fineRAD.py (a part of our software package) to remove samples with >20% missing loci (fig. 3A) and loci with more than five SNPs. To assess the effect of missing data, our script transforms the resulting genotype matrix into a presence/absence matrix (replacing any genotype with 1 and missing data with 0) and produces a PCA ordination of the samples (sklearn.decomposition package in python).
To evaluate the effect of systematic missingness biases on fineRADstructure results, we compared the “missingness” PCA with the co-ancestry matrix (fig. 3B and C). This comparison clearly shows that nonrandom missingness due to batch (i.e., library) effect is not at all mirrored in the genetic structure inferred by fineRADstructure; individual samples from the different sequencing libraries are intermixed within the inferred subspecies clusters (fig. 3C). Additionally, genetic hybrids between the two subspecies are clearly identified by fineRADstructure (as having more equal levels of coancestry with both of the subspecies; fig. 3C), despite their missing data profile being more similar to subspecies 1. Thus, despite considerable nonrandom missingness due to genetic divergence (i.e., allele dropout), the true genetic relationship is reflected in the coancestry matrix and the missingness signal does not dominate.
Discussion
In this manuscript, we have described software that enables fine population structure inference based on nearest neighbor (first coalescence) relationship between haplotypes inferred from RAD-seq data. Thus, our software brings the benefits of this approach especially to genomics of nonmodel organisms, which currently relies heavily on RAD-seq. In this context, it is useful to emphasize two considerations related to the use of fineRADstructure across a range of species.
First, because we assume independence between RAD loci, the coancestry matrix entries do not depend on RAD loci being ordered along the genome. Therefore, analyses with and without a reference genome will produce identical coancestry matrices. Second, our method benefits from linkage (LD) between multiple polymorphisms within the same RAD locus, enabling a more specific assignment of nearest neighbor relationships between individuals, as shown in figure 1. Therefore, unlike many methods which assume markers are unlinked and require removal of polymorphisms that are in high LD with each other (e.g., PCA based methods, STRUCTURE), fineRADstructure gains additional power from their presence within a RAD locus, as will be the case especially in species with high nucleotide polymorphism levels.
fineRADstructure aims to detect and report all excess sample relatedness that rises above statistical noise in the coancestry matrix, that is, a population is defined as a group of individuals with indistinguishable genetic ancestry in the data set. This approach resulted in biologically meaningful clustering for all the data sets we tested. However, we would caution against interpreting the number of clusters as a definite estimate of the “correct” number of populations. As seen for example in figure 2A, species often contain different levels of population substructure. Therefore, we suggest that especially in complex data sets users may sometimes obtain a better intuition about the relationships between their samples by repeating the clustering step several times with varying sensitivity (c parameter values) and interpreting the results jointly.
Finally, we would like to point out that although the ad hoc tree building approach generally performs well, it is intended to be illustrative of the relationships between the populations, rather than to represent the true population history. Therefore, we suggest that where the true historical relationships are known (e.g., as a result of independent phylogenetic analysis), the clustering step can be skipped and the coancestry matrix ordered according to the pre-existing information. Under those circumstances, the entries of the coancestry matrix provide an independent source of information on recent sample relatedness, complementing the information about older historical relationships; this may be especially informative in cases of recent gene-flow between populations.
Supplementary Material
Supplementary data are available at Molecular Biology and Evolution online.
Supplementary Material
Acknowledgments
This work was supported by the Medical Research Council (MR/M501608/1 to D.F.), the Austrian Climate Research Programme (ACRP5-EpiChange-KR12AC5K01286), the Austrian Science Fund (FWF, Y661-B16 to E.T.) Wellcome Trust and Royal Society grant (WT104125MA to D.J.L), and the Wellcome Trust (097677/Z/11/Z to M.M.). The idea for RADpainter and fineRADstructure arose in discussions at the 2016 Workshop on Population and Speciation Genomics in Cesky Krumlov (http://evomics.org; last accessed February 24, 2018); we would like to thank all the faculty and participants in the workshop.
References
- Alexander DH, Novembre J, Lange K.. 2009. Fast model-based estimation of ancestry in unrelated individuals. Genes Dev. 199:1655–1664. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Andrews KR, Good JM, Miller MR, Luikart G, Hohenlohe PA.. 2016. Harnessing the power of RADseq for ecological and evolutionary genomics. Nat. Rev. Genet. 172:81–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Arnold B, Corbett-Detig RB, Hartl D, Bomblies K.. 2013. RADseq underestimates diversity and introduces genealogical biases due to nonrandom haplotype sampling. Mol Ecol. 2211:3179–3190. [DOI] [PubMed] [Google Scholar]
- Catchen J, Hohenlohe PA, Bassham S, Amores A, Cresko WA.. 2013. Stacks: an analysis tool set for population genomics. Mol Ecol. 2211:3124–3140. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eaton DAR. 2014. PyRAD: assembly of de novo RADseq loci for phylogenetic analyses. Bioinformatics 3013:1844–1849. [DOI] [PubMed] [Google Scholar]
- Gautier M, Gharbi K, Cezard T, Foucaud J, Kerdelhué C, Pudlo P, Cornuet J-M, Estoup A.. 2013. The effect of RAD allele dropout on the estimation of genetic variation within and between populations. Mol Ecol. 2211:3165–3178. [DOI] [PubMed] [Google Scholar]
- Hellenthal G, Busby GBJ, Band G, Wilson JF, Capelli C, Falush D, Myers S.. 2014. A genetic atlas of human admixture history. Science 3436172:747–751. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jombart T, Ahmed I.. 2011. adegenet 1.3-1: new tools for the analysis of genome-wide SNP data. Bioinformatics 2721:3070–3071. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lawson DJ, Hellenthal G, Myers S, Falush D.. 2012. Inference of population structure using dense haplotype data. PLoS Genet. 81:e1002453.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Leslie S, Winney B, Hellenthal G, Davison D, Boumertit A, Day T, Hutnik K, Royrvik EC, Cunliffe B, Wellcome Trust Case Control Consortium 2, et al. 2015. The fine-scale genetic structure of the British population. Nature 5197543:309–314. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Peterson BK, Weber JN, Kay EH, Fisher HS, Hoekstra HE.. 2012. Double digest RADseq: an inexpensive method for de novo SNP discovery and genotyping in model and non-model species. PLoS One 75:e37135.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pritchard JK, Stephens M, Donnelly P.. 2000. Inference of population structure using multilocus genotype data. Genetics 1552:945–959. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Raj A, Stephens M, Pritchard JK.. 2014. fastSTRUCTURE: variational inference of population structure in large SNP data sets. Genetics 1972:573–589. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Trucchi E, Frajman B, Haverkamp THA, Schönswetter P, Paun O.. 2017. Genomic analyses suggest parallel ecological divergence in Heliosperma pusillum (Caryophyllaceae). New Phytol. 2161:267–278. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.