Abstract
Despite hundreds of sequenced Arabidopsis genomes, very little is known about the degree of genomic collinearity within single species, due to the low number of chromosome-level assemblies. Here, we report chromosome-level reference-quality assemblies of seven Arabidopsis thaliana accessions selected across its global range. Each genome reveals between 13–17 Mb rearranged, and 5–6 Mb non-reference sequences introducing copy-number changes in ~5000 genes, including ~1900 non-reference genes. Quantifying the collinearity between the genomes reveals ~350 euchromatic regions, where accession-specific tandem duplications destroy the collinearity between the genomes. These hotspots of rearrangements are characterized by reduced meiotic recombination in hybrids and genes implicated in biotic stress response. This suggests that hotspots of rearrangements undergo altered evolutionary dynamics, as compared to the rest of the genome, which are mostly based on the accumulation of new mutations and not on the recombination of existing variation, and thereby enable a quick response to the biotic stress.
Subject terms: Evolutionary genetics, Genetic variation, Genome evolution, Plant evolution
Despite tremendous genomic resources in the Arabidopsis community, only a few whole genome de novo assemblies are available. Here, the authors report chromosome-level reference-quality assemblies of seven A. thaliana accessions and reveal hotspots of rearrangements with altered evolutionary dynamics.
Introduction
The individual genomes of sexually reproducing species are typically highly collinear to enable physical exchange of alleles during meiosis. This exchange ensures the generation of diversity and the removal of deleterious alleles1 and at the same time protects the offspring from major mutations changing the karyotype of a genome2. Despite the obvious importance of preserving a common karyotype, the presence of genomic rearrangements suggests that the genomes are in fact not entirely collinear. Genomic rearrangements (and the resulting lack of allelic exchange) have been shown to contribute to population diversification including the evolution of different sexes3 or life-history traits4.
But even though the absence of collinearity can have drastic effects, there is hardly anything known about the actual degree of collinearity within populations as most of the current genome studies are not based on chromosome-level assemblies. The first complete assembly of a plant genome was the reference sequence of A. thaliana (Col-0), which was based on a minimal tiling path of BACs sequenced with Sanger technology5. Since then multiple hundred Arabidopsis genomes have been studied, however, most of these studies relied on short-read based resequencing or reference-guided assembly, where the identification of genomic rearrangements remained challenging6–12. In contrast, reference-independent, chromosome-level assemblies with almost complete reconstruction of the nucleotide sequence enable accurate identification of all sequence differences and would therefore reveal the degree of synteny across the genome13. So-far, however, there are only a few whole-genome de novo assemblies for A. thaliana available including a re-assembly of the reference accession Col-0 as well as assemblies of four different accessions including Cvi-0, KBS-Mac-74, Ler, and Nd-1, which have been generated in different studies and have not been compared against each other14–18.
Here we release chromosome-level assemblies of seven Arabidopsis thaliana accessions. We identify 13–17 Mb genomic rearrangements, 5–6 Mb non-reference sequence in each genome. We find genic copy-number variations in around 5000 genes, including ~1900 non-reference genes. We develop a metric called synteny diversity to quantify the collinearity between the genomes and identify 350 euchromatic hotspots of rearrangements regions where genome collinearity between the genomes are strongly impaired. Further evolutionary analysis suggests these regions are undergoing different evolutionary dynamics as compared to the rest of the genome, which contribute to the rapid response to the biotic stress.
Results
Chromosome-level assemblies of seven A. thaliana genomes
Using deep PacBio (45–71×) and Illumina (56–78×) whole-genome shotgun sequencing, we assembled the genomes of seven accessions from geographically diverse regions including An-1 (Antwerpen, Belgium), C24 (Coimbra, Portugal), Cvi-0 (Cape Verde Islands), Eri-1 (Eringsboda, Sweden), Kyo (Kyoto, Japan), Ler (Gorzów Wielkopolski, Poland) and Sha (Shahdara, Tadjikistan) (Supplementary Table 1) (see Methods). The assembly of Ler was already described in the context of the development of a whole-genome comparison tool used in this study13. The seven accessions (together with the reference accession Col-0) were initially used as the founder lines of Arabidopsis Multi-parent Recombination Inbreeding Lines (AMPRIL)19 population and were selected to maximize the genetic diversity in this set. The contig assemblies featured N50 values from 4.8 to 11.2 Mb and were thus similar to other long-read assemblies of A. thaliana genomes. Chromosome-normalized L50 (CL50)20 values were 1 or 2 indicating that nearly all chromosomes were assembled into a few contigs only (Fig. 1, Table 1 and Supplementary Table 2). In comparison with the reference sequence, we found less collapsed repeat regions in each of the assemblies as well as 41 (out of 70) reference sequence gaps, which could be bridged with contigs of the other assemblies, suggesting that the reference sequence could be improved using long-read assembly (Supplementary Table 3).
Table 1.
Col-0a | An-1 | C24 | Cvi-0 | Eri-1 | Kyo | Ler | Sha | |
---|---|---|---|---|---|---|---|---|
Contigs | – | 151 | 167 | 140 | 200 | 230 | 149 | 143 |
Pseudomolecules | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 |
Contig N50 (Mbp) | – | 8.2 | 4.8 | 7.4 | 4.8 | 9.1 | 11.2 | 7.0 |
Contig CL50b | – | 2 | 2 | 2 | 2 | 2 | 1 | 1 |
Chr. length (Mbp) | 119.1 | 118.4 | 117.7 | 118.3 | 117.7 | 118.8 | 118.5 | 118.4 |
Genes | 27,445 | 27,342 | 27,214 | 27,098 | 27,285 | 27,574 | 27,376 | 27,293 |
aReference sequence.
bChromosome number normalized L5020.
We arranged 43–73 contigs of each assembly to chromosome-level pseudomolecules based on homology to the reference sequence. Even though these assemblies do rely on the reference sequence, we would like to point out that the sequence assembly itself was independent of the reference sequence, and that the contigs were large in general, implying that it is unlikely that we misplaced any of the contigs. To confirm this, we compared two of the chromosome-level assemblies with three different genetic maps, where we did not find even a single misplaced contig (Supplementary Table 4). The seven chromosome-level assemblies reached a total length of 117.7–118.8 Mb, which is very similar to the 119.1 Mb of the reference sequence (Table 1) and even included parts of the highly complex regions of centromeres, telomeres and rDNA clusters (Supplementary Data 1 and Supplementary Table 5). The remaining unanchored contigs had a total length of 1.5–3.3 Mb and consisted almost entirely of repeats. This agrees with gaps between the contigs, which were mostly introduced due to repetitive regions (Supplementary Table 6). Overall, we annotated 27,098–27,574 protein-coding genes in each of the assemblies, which is similar to the 27,445 genes annotated in the reference sequence21 (Table 1, Supplementary Data 2 and Supplementary Tables 7–8) (see Methods).
Identification of syntenic and rearranged regions
By comparing each of the new assemblies against the reference sequence using the whole-genome comparison tool SyRI (V1.1)13, we found 102.2–106.6 Mb of collinear regions and 12.6–17.0 Mb of rearranged regions in each of the genomes (Fig. 2a). The rearrangements included 1.5–4.2 Mb (33–46) inversions, 1.8–2.9 Mb (729–1192) translocations, and—most abundantly—polymorphic duplications, which comprised 7.2–8.7 Mb (4288–5150) within each of the individual genomes (Supplementary Table 9). Similar to small-scale sequence variation22, rearrangements were not evenly distributed along the chromosomes, but were enriched in pericentromeres (Supplementary Table 10). Their lengths ranged from a few dozen bp to hundreds of kb and even Mb scale (Fig. 2b), including a 2.48 Mb inversion specific to chromosome 3 of Sha (Supplementary Fig. 1 and Supplementary Table 11), which explains the suppression of meiotic recombination in this region in hybrids including the Sha haplotype23–25. Sequence divergence in rearranged regions was generally higher as compared to collinear regions mostly due to an excess of local copy-gain and copy-loss variation in rearranged regions (Fig. 2a, Supplementary Fig. 2 and Supplementary Table 12).
Gene copy-number variations and pan-genome
Genomic rearrangements have the potential to delete, create or duplicate genes resulting in gene copy number variation (CNV). Based on the clustering of orthologous genes across all eight accessions26 we found 22,040 gene families with conserved copy number, while 4957 gene families showed differences in gene copy numbers in at least one accession (Fig. 2c and Supplementary Table 13). Almost 99% of these copy-variable gene families had a maximum copy number of 5 or less, while only less than 10% of them showed more than two different copy numbers across the eight accessions (Supplementary Fig. 3). Among the copy-variable genes we found 1941 non-reference gene families including 891 gene families present in at least two of the other accessions (Fig. 2c). Around 23% of the non-reference gene families featured orthologs in the closely related genome of Arabidopsis lyrata and, according to RNA-seq read mapping, 26–40% of them showed evidence of expression (Supplementary Table 14). The remaining 1,050 non-reference (accessions-specific) gene families were evenly distributed across the accessions (Fig. 2c), with the exception of Cvi-0, where we found nearly twice as many (214) accession-specific genes, which is in agreement with the divergent ancestry of this relict accession8,27.
Based on all possible pairwise genome comparisons, we identified 5.1–6.5 Mb accession-specific sequence and used this to estimate a pan-genome size of ~135 Mb including ~30,000 genes and a core-genome size of ~105 Mb with ~24,000 genes (Fig. 2d)28 illustrating that one reference genome is not sufficient to capture the entire sequence diversity within A. thaliana29. Deeper sampling including accessions from other populations (for example by including more of the highly divergent accessions from Africa27) could lead to higher estimates of the pan-genome. This has been observed in short-read sequencing-based pan-genome analyses of rice and tomatoes30,31, even though such comparisons are difficult not only due to the different samplings (even including the integration of subspecies), but also due to the high-contiguity of chromosome-level assemblies, which will reveal more of the hidden genes in complex genomic regions.
Quantification of genome collinearity
As only a few chromosome-level assemblies are available, hardly anything is known about genome collinearity within A. thaliana. In contrast, our chromosome-level assemblies allow for an analysis of the conservation of the genome collinearity between multiple individuals. To quantify collinearity we developed a parameter called synteny diversity πsyn, which is similar to nucleotide diversity32, however, instead of measuring average sequence differences it measures the degree of collinearity between the genomes of a population (see Methods). πsyn values can range from 0 to 1, where 1 refers to the complete absence of collinearity between any of the genomes and 0 to regions where all genomes are collinear. πsyn can be calculated in any given region; however, the annotation of collinearity still needs to be established within the context of the whole genomes to avoid false assignments of homologous but non-allelic regions.
We calculated πsyn in 5-kb sliding windows across the genome using pairwise comparisons of all eight accessions (Fig. 3a). As expected, πsyn was generally high in pericentromeric regions and low in chromosome arms. Overall, this revealed around 90 Mb (76% of the genome) where all genomes were collinear to each other, while for the remaining 29 Mb (24%) the collinearity between the genomes was not conserved. This, for example, included a region on chromosome 3 (ranging from Mb ~2.8–5.3), where πsyn was increased to ~0.25 (i.e., one genome is not collinear to all other seven genomes) due to the 2.48 Mb inversion in the Sha genome (Fig. 3a, arrow labelled with (A)).
Hotspots of rearrangements
Unexpectedly, however, some regions featured πsyn values even larger than 0.5. This implied that not only two, but also multiple independent, non-collinear haplotypes segregate in these regions. In turn, this suggests that these regions are more likely to undergo or conserve complex mutations as compared to the rest of the genomes and thereby create hotspots of rearrangements (HOT regions) where multiple accessions independently evolved diverse haplotypes. Overall, we found 576 of such HOT regions with a total size of 10.2 Mb including 351 HOT regions in the gene-rich chromosome arms with a total length of 4.1 Mb (or 4% of euchromatic genome) (Supplementary Data 3).
Even though HOT regions in euchromatic regions included more transposable elements and fewer genes as compared to the collinear regions, they still contained substantial numbers of genes, many of which occurred at high and variable copy-number between the accessions (Fig. 3b, c). For example, a HOT region on chromosome 4, which overlapped with the RPP4/RPP5 R gene cluster33, displayed 5–15 intact or truncated copies of the RPP5 gene across the eight genomes (Fig. 3d and Supplementary Table 15). The different gene copies were primarily introduced by an accumulation of forward tandem duplications and large indels (Fig. 3e).
This remarkable pattern of forward tandem duplications and large indels was shared by many of the HOT regions (Fig. 3c and Supplementary Fig. 4). The clear pattern of almost exclusively forward tandem duplications suggested higher mutation (duplication) rates, which are specific to these regions in each of the accessions. In contrast, the borders of the HOT regions were surprisingly well conserved across the accessions (Fig. 3f). This suggested that either different selection regimes introduced clear-cut borders between the HOT regions and their vicinity, or that HOT regions are specific targets of increased tandem duplication rates. Such a local increase of mutation rates could potentially be mediated by non-allelic homologous recombination, which could be triggered by the high number of local repeats in these regions34. Figure 4 shows two more examples of these complex regions.
In contrast, meiotic recombination in Arabidopsis was shown to be supressed by structural diversity35. To test if HOTs are indeed depleted for meiotic recombination, we overlapped rearranged regions with 15,683 crossover (CO) sites previously identified within Col-0/Ler F2 progenies35,36. Only 64 of them partially overlapped with non-syntenic regions while all other COs were found in syntenic regions (Fig. 5a), suggesting that HOT regions are almost completely silenced for COs (χ2 test, p < 0.001). In consequence, this would imply that HOT regions are segregating as large non-recombining regions. To test this, we analysed the linkage disequilibrium (LD) within 1135 genomes of the 1001 Genomes Project8 around and across the HOT regions. LD increased in the vicinity of the HOT regions, with increasing LD close to the HOT regions implying reduced recombination in the regions surrounding the HOT regions. Likewise, LD was also high within HOT regions corroborating the recombination suppression in HOT regions. However, when calculated across the border of these regions, LD was significantly lower (one-sided U test, p < 0.001) supporting the idea that HOT regions are not strongly linked to the surrounding haplotypes and that they hardly exchange alleles (Fig. 5b).
Reduced meiotic recombination has been linked to the accumulation of new (deleterious) mutations37. In agreement with this, HOT regions showed an accumulation of SNPs with low allele frequencies and potentially deleterious variation (one-sided U test, p < 0.001) as compared to other regions in the genome (Fig. 5c, d and Supplementary Fig. 5). Moreover, reduced recombination combined with geographic isolation can provide the basis for the development of alleles, which are incompatible with distantly related haplotypes leading to intra-species incompatibilities38. To test this, we searched the location of nine recently reported genetic incompatible loci39 (DM1-9) and found that all except of one overlapped with HOT regions, while DM3, the locus which did not overlap with a HOT region, was closely flanked by two HOT regions (Figs. 3d, 4a and Supplementary Fig. 6–11). In addition, we also checked the locus of a recently published single-locus genetic incompatibility40 and found that it was also residing in a HOT region (Supplementary Fig. 12).
The high structural diversity of the HOT regions was reminiscent of the patterns that have been described for R gene clusters41–44. In fact, the 808 reference genes in HOT regions were significantly enriched for genes involved in defense response, signal transduction and secondary metabolite biosynthesis (Fig. 5e) suggesting a reoccurring role of HOT regions in the adaptation to biotic stress. Further comparison with the outcrossing sister species A. lyrata showed that 504 (87.5%) of 576 HOT regions actually have no homologous sequences in A. lyrata. The flanking regions of nearly one third HOT regions remained collinear with A. lyrata, while flanking regions of the other regions are structurally rearranged, suggesting that HOT regions are likely to evolve in non-conserved regions between two species.
Discussion
As biotic stress is an evolving environmental challenge, the Red Queen hypothesis suggests that the genomes of A. thaliana are in the constant need to diversify their offspring45. It has been proposed that in response to this, meiotic recombination might increase and thereby diversified offspring is generated46. However, exclusively shuffling existing variation might not be sufficient to respond to the evolution of pathogens. Instead, it has been proposed that the accumulation of new gene duplicates could enable a rapid genomic response of plants against pathogens34,47,48. The hotspots of rearrangements have the potential to build the basis for such a response, as frequent gene duplications could build the basis for an evolutionary playground to evolve a quick response to the challenges of biotic stress and overcome fitness valleys during the evolution of more complex function. This, in turn, comes at the costs of loss of synteny and the loss of meiotic recombination between distant haplotypes. Though it still needs to be analyzed whether local populations show the same level of diversity or if their haplotypes in HOT regions are more similar and still exchange alleles, we have observed the negative consequences of reduced meiotic recombination in this small world-wide population including the accumulation of deleterious alleles and incompatible epistatic effects between distant genotypes.
Taken together, using chromosome-level genome assemblies of a small, highly diverse population of A. thaliana, we have identified regions where genome collinearity was lost through genome-specific accumulation of mutations. These quickly evolving sequences do not spread through the population based on meiotic recombination-based exchange between haplotypes (as recombination is suppressed by structural variation) or based on haplotype-specific drift or selection (similar to an inversion allele), as the haplotypes change more rapidly than they are distributed through the population. Instead it occurs that these regions evolve through rapid mutations. We propose that these regions, which we call hotspots of rearrangements or HOT regions, facilitate evolutionary responses to rapidly changing environmental challenges and that these regions are thus undergoing different evolutionary dynamics as compared to the rest of the genome, where each region segregates with only few haplotypes. Future genome-wide screens for selection patterns should take such regions and their specific characteristics into account in particular as they might be missed with conventional marker-based selection scans.
Methods
Plant material and whole-genome sequencing
We received the seeds of all seven accessions from Maarten Koornneef (MPI for Plant Breeding Research), and grew them under normal greenhouse conditions. The stock center ID of seeds are shown in the Supplementary Table 1. DNA preparation and next generation sequencing was performed by the Max Planck Genome center. DNA was extracted from multiple individuals using the NucleoSpin® Plant II Maxi Kit from Macherey-Nagel, prepared using SMRTbell Template Prep Kit 1.0-SPv3 with SMRTbell Damage Repair Kit -SPv3 and BluePippin size selection for fragments >9/10 kb, and sequenced with a PacBio Sequel system. For each accession, data from two SMRT cells were generated. Besides, Illumina paired-end libraries were prepared and sequenced on the Illumina HiSeq system.
Genome assembly
PacBio reads were filtered for short (<50 bp) or low quality (QV < 80) reads using SMRTLink5 package. De novo assembly of each genome was initially performed using three different assembly tools including Falcon17, Canu49, and MECAT50. The resulting assemblies were polished with Arrow from the SMRTLink5 package and then further corrected with mapping of Illumina short reads using BWA51 to remove small-scale assembly errors which were identified with SAMTools52. For each genome, the final assembly was based on the Falcon assembly as these assemblies always showed highest assembly contiguity. A few contigs were further connected or extended based on whole-genome alignments between Falcon and Canu or MECAT assemblies. Contigs were labelled as organellar contigs if they showed alignment identity and coverage both larger than 95% when aligned against the mitochondrial or chloroplast reference sequences. A few of contigs aligned to multiple chromosomes and were split if no Illumina short-read alignments supported the conflicting regions. Assembly contigs larger than 20 kb were combined to pseudo-chromosomes according to their alignment positions when aligned against the reference sequence using MUMmer453. Contigs with consecutive alignments were concatenated with a stretch of 500 Ns. To note, the assembly of the Ler accession was already described in a recent study13.
Assembly evaluation
We evaluated the assembly completeness by aligning the reference genes against each of the seven genomes using Blastn54. Reference genes which were not aligned or only partially aligned might reveal genes which were missed during the assembly. To examine whether they were really missed, we mapped Illumina short reads from each genome against the reference genome using the BWA51 and checked the mapping coverage of these genes. The genes, which were missing in the assembly, should show full alignment coverage (Supplementary Table 7).
Centromeric and telomeric tandem repeats were annotated by searching for the 178 bp tandem repeat unit55 and the 7 bp tandem repeat unit of TTTAGGG56. rDNA clusters were annotated with Infernal version 1.157.
The assembly contiguity of Cvi-0 and Ler were further tested using three previously published genetic maps24,58,59 (Supplementary Table 4). For this we aligned the marker sequences against the chromosome-level assemblies and checked the order of the markers in the assembly versus their order in the genetic map. The ordering of contigs to chromosomes was perfectly supported by all three maps. Overall, only six (out of 1156) markers showed conflicts between the genetic and physical map. In all six cases we found evidence that the conflict was likely caused by structural differences between the parental genomes.
We also searched for potentially collapsed regions in each assembly. For this, we checked the normalized mapping coverage in non-overlapping 100 bp windows based on Illumina short-read mapping (using BWA). Collapsed regions are expected to have significantly higher coverage than the correctly assembled regions. Here, windows with two-fold increase of mapping coverage were defined as collapsed regions. Continuous collapsed regions were merged.
Gene annotation
Protein-coding genes were annotated based on ab initio gene predictions, protein sequence alignments and RNA-seq data. Three ab initio gene prediction tools were used including Augustus60, GlimmerHMM61 and SNAP62. The reference protein sequences from the Araport 1121 annotation were aligned to each genome assembly using exonerate63 with the parameter setting “–percent 70–minintron 10–maxintron 60000”. For five accessions (An-1, C24, Cvi-0, Ler, and Sha) we downloaded a total of 155 RNA-seq datasets from the NCBI SRA database (Supplementary Data 2). RNA-seq reads were mapped to the corresponding genome using HISAT264 and then assembled into transcripts using StringTie65 (both with default parameters). All different evidences were integrated into consensus gene models using Evidence Modeler66.
The resulting gene models were further evaluated and updated using the Araport 1121 annotation. Firstly, for each of the seven genomes, the predicted gene and protein sequences were aligned to the reference sequence, while all reference gene and protein sequences were aligned to each of the other seven genomes using Blast54. Then, potentially mis-annotated genes including mis-merged (two or more genes are annotated as a single gene), mis-split (one gene is annotated as two or more genes) and unannotated genes were identified based on the alignments using in-house python scripts. Mis-annotated or unannotated genes were corrected or added by incorporating the open reading frames generated by ab initio predictions or protein sequence alignment using Scipio67.
Noncoding genes were annotated by searching the Rfam database68 using Infernal version 1.157. Transposon elements were annotated with RepeatMasker (http://www.repeatmasker.org). Disease resistance genes were annotated using RGAugury69. NB-LRR R gene clusters were defined based on the annotation from a previous study70.
Pan-genome analysis
Pan-genome analyses were performed at both sequence and gene level. To construct a pan-genome of sequences, we generated pairwise whole-genome sequence alignments of all possible pairs of the eight genomes using the nucmer in the software package MUMmer453. A pan-genome was initiated by choosing one of the genomes, followed by iteratively adding the non-aligned sequence of one of the remaining genomes. Here, non-aligned sequences were required to be longer than 100 bp without alignment with an identity of more than 90%. The core genome was defined as the sequence space shared by all sampled genomes. Like the pan-genome, the core-genome analysis was initiated with one genome. Then all other genomes were iteratively added, while excluding all those regions, which were not aligned against each of the other genomes. The pan- and core-genome of genes was built in a similar way. The pan-genome of genes was constructed by selecting the whole protein-coding gene set of one of the accessions followed by iteratively adding the genes of one of the remaining accessions. Likewise, the core-genome of genes was defined as the genes shared in all sampled genomes.
For each pan or core-genomes analysis, all possible combinations of integrating the eight genomes (or a subset of them) were evaluated (Eq. 1). The exponential regression model (Eq. 2) was then used to model the pan-genome/core-genomes by fitting medians using the least square method implemented in the nls function of R.
1 |
2 |
Analysis of structural rearrangements and gene CNV
All assemblies were aligned to the reference sequence using nucmer from the MUMmer453 toolbox with parameter setting “-max -l 40 -g 90 -b 100 -c 200”. The resulting alignments were further filtered for alignment length (>100) and identity (>90). Structural rearrangements and local variations were identified using SyRI13. The functional effects of sequence variation were annotated with snpEff 71. The gene CNV were identified according to the gene family clustering using the tool OrthoFinder26 based on all protein sequences from the eight accessions.
Definition of synteny diversity
Synteny diversity was defined as the average fraction of non-syntenic sites found within all pairwise genome comparisons within a given population. Here we denote synteny diversity as (Eq. 3)
3 |
where xi and xj refer to the frequencies of sequence i and j and πij to the average probability of a position to be non-syntenic between sequence i and j . Note, πsyn can be calculated in a given region or for the entire genome. However even when calculated for small regions the annotation of synteny still needs to be established within the context of the whole genomes to avoid false assignments of homologous but non-allelic sequence. Here we used the annotation of SyRI to define syntenic regions. πsyn values can range from 0 to 1, with higher values referring to a higher average degree of non-syntenic regions between the genomes.
Analysis of hotspots of rearrangements
For the analyses, we calculated πsyn in 5-kb sliding windows with 1 kb step-size across the entire genome. HOT regions were defined as regions with πsyn larger than 0.5. Neighboring regions were merged into one HOT region if their distance was shorter than 2 kb.
The nucleotide and haplotype diversity were calculated with the R package PopGenome72 using SNP markers (with MAF > 0.05) from 1001 Genomes Project8. LD were calculated as correlation coefficients r2 using SNP markers with MAF > 0.05. GO enrichment analysis was performed using the webtool DAVID73.
We performed a synteny comparison between A. thaliana HOT regions and A. lyrata74. Although the two species have rearranged karyotypes, they share collinear regions, so-called Ancestral Crucifer Karyotype blocks (ACK blocks)75. The genome sequences were split into ACK blocks, and aligned with the tool nucmer. The syntenic regions were defined by the tool SyRI. We checked the alignment of each HOT region and its 5 kb flanking regions to see whether the regions were in syntenic or rearranged regions as compared to A. lyrata.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Supplementary information
Acknowledgements
The authors would like to thank Beth A. Rowan (UC Davis) for providing the CO breakpoint list prior to publication, Bruno Hüttel (Max Planck Genome center) for support in genome sequencing, Sigi Effgen and Maarten Koornneef (Max Planck Institute for Plant Breeding Research) for providing seeds, Onur Dogan (Max Planck Institute for Plant Breeding Research) for help in the greenhouse, Angela M. Hancock (Max Planck Institute for Plant Breeding Research) for helpful discussions, and Raphael Mercier and Padraic J. Flood (Max Planck Institute for Plant Breeding Research) for helpful comments on the manuscript and the interpretation of HOT regions. K.S. gratefully acknowledges support from European Research Council (ERC) Grant “INTERACT” (802629).
Source data
Author contributions
W.-B.J. and K.S. designed the study. W.-B.J. performed all analysis. K.S. supervised the study. W.-B.J. and K.S. wrote the manuscript. All authors read and approved the final manuscript.
Data availability
Data supporting the findings of this work are available within the paper and its Supplementary Information files. The datasets generated and analyzed during the current study are available from the corresponding author upon request. Raw sequencing data, assemblies and annotations can be accessed in the European Nucleotide Archive under the project accession number PRJEB31147. Assemblies, annotation, variation and orthologs can be found on the 1001 Arabidopsis thaliana Genomes webpage [https://1001genomes.org/data/MPIPZ/MPIPZJiao2020/releases/current/]. Previously reported RNA-seq data from the five accessions (An-1, C24, Cvi-0, Ler, and Sha) are downloaded from the NCBI SRA database (the NCBI and ENA accession codes are included in Supplementary Data 2). The SNP markers resulted from 1001 Genomes Project can be downloaded from the webpage https://1001genomes.org/data/GMI-MPI/releases/v3.1/. The source data underlying Figs. 1–5 and Supplementary Figs. 1–12 are provided as a Source Data file.
Code availability
Custom code used in this study can be freely accessed at https://github.com/schneebergerlab/AMPRIL-genomes.
Competing interests
The authors declare no competing interests.
Footnotes
Peer review information Nature Communications thanks Levi Yant, Christopher Town and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary information is available for this paper at 10.1038/s41467-020-14779-y.
References
- 1.McDonald MJ, Rice DP, Desai MM. Sex speeds adaptation by altering the dynamics of molecular evolution. Nature. 2016;531:233–236. doi: 10.1038/nature17143. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Heng HHQ. Elimination of altered karyotypes by sexual reproduction preserves species identity. Genome. 2007;50:517–524. doi: 10.1139/G07-039. [DOI] [PubMed] [Google Scholar]
- 3.Lamichhaney S, et al. Structural genomic changes underlie alternative reproductive strategies in the ruff (Philomachus pugnax) Nat. Genet. 2015;48:84–88. doi: 10.1038/ng.3430. [DOI] [PubMed] [Google Scholar]
- 4.Lowry DB, Willis JH. A widespread chromosomal inversion polymorphism contributes to a major life-history transition, local adaptation, and reproductive isolation. PLoS Biol. 2010;8:e1000500. doi: 10.1371/journal.pbio.1000500. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.The Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature408, 796–815 (2000). [DOI] [PubMed]
- 6.Cao J, et al. Whole-genome sequencing of multiple Arabidopsis thaliana populations. Nat. Genet. 2011;43:956–965. doi: 10.1038/ng.911. [DOI] [PubMed] [Google Scholar]
- 7.Long Q, et al. Massive genomic variation and strong selection in Arabidopsis thaliana lines from Sweden. Nat. Genet. 2013;45:884–890. doi: 10.1038/ng.2678. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Alonso-Blanco C, et al. 1,135 Genomes reveal the global pattern of polymorphism in Arabidopsis thaliana. Cell. 2016;166:481–491. doi: 10.1016/j.cell.2016.05.063. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Schneeberger K, et al. Reference-guided assembly of four diverse Arabidopsis thaliana genomes. Proc. Natl Acad. Sci. USA. 2011;108:10249–10254. doi: 10.1073/pnas.1107739108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Gan X, et al. Multiple reference genomes and transcriptomes for Arabidopsis thaliana. Nature. 2011;477:419–423. doi: 10.1038/nature10414. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Schneeberger K, et al. Simultaneous alignment of short reads against multiple genomes. Genome Biol. 2009;10:R98. doi: 10.1186/gb-2009-10-9-r98. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Schmitz RJ, et al. Patterns of population epigenomic diversity. Nature. 2013;495:193–198. doi: 10.1038/nature11968. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Goel M, Sun H, Jiao W-B, Schneeberger K. SyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies. Genome Biol. 2019;20:277. doi: 10.1186/s13059-019-1911-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Zapata L, et al. Chromosome-level assembly of Arabidopsis thaliana Ler reveals the extent of translocation and inversion polymorphisms. Proc. Natl Acad. Sci. USA. 2016;113:E4052–E4060. doi: 10.1073/pnas.1607532113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Michael TP, et al. High contiguity Arabidopsis thaliana genome assembly with a single nanopore flow cell. Nat. Commun. 2018;9:1–8. doi: 10.1038/s41467-017-02088-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Pucker B, et al. A chromosome-level sequence assembly reveals the structure of the Arabidopsis thaliana Nd-1 genome and its gene set. PLoS One. 2019;14:e0216233. doi: 10.1371/journal.pone.0216233. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Chin C-S, et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods. 2016;13:1050–1054. doi: 10.1038/nmeth.4035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Berlin K, et al. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol. 2015;33:623–630. doi: 10.1038/nbt.3238. [DOI] [PubMed] [Google Scholar]
- 19.Huang X, et al. Analysis of natural allelic variation in Arabidopsis using a multiparent recombinant inbred line population. Proc. Natl Acad. Sci. USA. 2011;108:4488–4493. doi: 10.1073/pnas.1100465108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Jiao W-B, et al. Improving and correcting the contiguity of long-read genome assemblies of three plant species using optical mapping and chromosome conformation capture data. Genome Res. 2017;27:778–786. doi: 10.1101/gr.213652.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Cheng CY, et al. Araport11: a complete reannotation of the Arabidopsis thaliana reference genome. Plant J. 2017;89:789–804. doi: 10.1111/tpj.13415. [DOI] [PubMed] [Google Scholar]
- 22.Clark RM, et al. Common sequence polymorphisms shaping genetic diversity in Arabidopsis thaliana. Science. 2007;317:338–342. doi: 10.1126/science.1138632. [DOI] [PubMed] [Google Scholar]
- 23.Loudet O, Chaillou S, Camilleri C, Bouchez D, Daniel-Vedele F. Bay-0 x Shahdara recombinant inbred line population: a powerful tool for the genetic dissection of complex traits in Arabidopsis. Theor. Appl. Genet. 2002;104:1173–1184. doi: 10.1007/s00122-001-0825-9. [DOI] [PubMed] [Google Scholar]
- 24.Simon M, et al. Quantitative trait loci mapping in five new large recombinant inbred line populations of Arabidopsis thaliana genotyped with consensus single-nucleotide polymorphism markers. Genetics. 2008;178:2253–2264. doi: 10.1534/genetics.107.083899. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Salomé PA, et al. The recombination landscape in Arabidopsis thaliana F2 populations. Heredity. 2012;108:447–455. doi: 10.1038/hdy.2011.95. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Emms DM, Kelly S. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol. 2015;16:157. doi: 10.1186/s13059-015-0721-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Durvasula A, et al. African genomes illuminate the early history and transition to selfing in Arabidopsis thaliana. Proc. Natl Acad. Sci. USA. 2017;114:5213–5218. doi: 10.1073/pnas.1616736114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Tettelin H, et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: Implications for the microbial ‘pan-genome’. Proc. Natl Acad. Sci. USA. 2005;102:13950–13955. doi: 10.1073/pnas.0506758102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Van de Weyer AL, et al. A Species-Wide Inventory of NLR Genes and Alleles in Arabidopsis thaliana. Cell. 2019;178:1260–1272.e14. doi: 10.1016/j.cell.2019.07.038. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Wang W, et al. Genomic variation in 3,010 diverse accessions of Asian cultivated rice. Nature. 2018;557:43–49. doi: 10.1038/s41586-018-0063-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Gao L, et al. The tomato pan-genome uncovers new genes and a rare allele regulating fruit flavor. Nat. Genet. 2019;51:1044–1051. doi: 10.1038/s41588-019-0410-2. [DOI] [PubMed] [Google Scholar]
- 32.Nei M, Li WH. Mathematical model for studying genetic variation in terms of restriction endonucleases. Proc. Natl Acad. Sci. USA. 1979;76:5269–5273. doi: 10.1073/pnas.76.10.5269. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Parker JE. The Arabidopsis downy mildew resistance gene RPP5 shares similarity to the toll and interleukin-1 receptors with N and L6. Plant Cell. 1997;9:879–894. doi: 10.1105/tpc.9.6.879. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Leister D. Tandem and segmental gene duplication and recombination in the evolution of plant disease resistance genes. Trends Genet. 2004;20:116–122. doi: 10.1016/j.tig.2004.01.007. [DOI] [PubMed] [Google Scholar]
- 35.Rowan BA, et al. An ultra high-density Arabidopsis thaliana crossover map that refines the influences of structural variation and epigenetic features. Genetics. 2019;213:771–787. doi: 10.1534/genetics.119.302406. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Serra H, et al. Massive crossover elevation via combination of HEI10 and recq4a recq4b during Arabidopsis meiosis. Proc. Natl Acad. Sci. USA. 2018;115:2437–2442. doi: 10.1073/pnas.1713071115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Kondrashov AS. Deleterious mutations and the evolution of sexual reproduction. Nature. 1988;336:435–440. doi: 10.1038/336435a0. [DOI] [PubMed] [Google Scholar]
- 38.Bomblies K, Weigel D. Hybrid necrosis: Autoimmunity as a potential gene-flow barrier in plant species. Nat. Rev. Genet. 2007;8:382–393. doi: 10.1038/nrg2082. [DOI] [PubMed] [Google Scholar]
- 39.Chae E, et al. Species-wide genetic incompatibility analysis identifies immune genes as hot spots of deleterious epistasis. Cell. 2014;159:1341–1351. doi: 10.1016/j.cell.2014.10.049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Smith LM, Bomblies K, Weigel D. Complex evolutionary events at a tandem cluster of Arabidopsis thaliana genes resulting in a single-locus genetic incompatibility. PLoS Genet. 2011;7:e1002164. doi: 10.1371/journal.pgen.1002164. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.McDowell JM, et al. Intragenic recombination and diversifying selection contribute to the evolution of downy mildew resistance at the RPP8 locus of Arabidopsis. Plant Cell. 1998;10:1861–1874. doi: 10.1105/tpc.10.11.1861. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Botella MA, et al. Three genes of the Arabidopsis RPP1 complex resistance locus recognize distinct Peronospora parasitica avirulence determinants. Plant Cell. 1998;10:1847–1860. doi: 10.1105/tpc.10.11.1847. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Barragan CA, et al. RPW8/HR repeats control NLR activation in Arabidopsis thaliana. PLoS Genet. 2019;15:e1008313. doi: 10.1371/journal.pgen.1008313. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Guo YL, et al. Genome-wide comparison of nucleotide-binding site-leucine-rich repeat-encoding genes in Arabidopsis. Plant Physiol. 2011;157:757–769. doi: 10.1104/pp.111.181990. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Bell, G. The Masterpiece of Nature: The Evolution and Genetics of Sexuality (University of California Press, Berkeley, 1982).
- 46.Singh ND, et al. Fruit flies diversify their offspring in response to parasite infection. Science. 2015;349:747–750. doi: 10.1126/science.aab1768. [DOI] [PubMed] [Google Scholar]
- 47.Dangl JL, Jones JDG. Plant pathogens and integrated defence responses to infection. Nature. 2001;411:826–833. doi: 10.1038/35081161. [DOI] [PubMed] [Google Scholar]
- 48.Kondrashov FA. Gene duplication as a mechanism of genomic adaptation to a changing environment. Proc. R. Soc. B Biol. Sci. 2012;279:5048–5057. doi: 10.1098/rspb.2012.1108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Koren S, et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 2017;27:722–736. doi: 10.1101/gr.215087.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Xiao CLe, et al. MECAT: Fast mapping, error correction, and de novo assembly for single-molecule sequencing reads. Nat. Methods. 2017;14:1072–1074. doi: 10.1038/nmeth.4432. [DOI] [PubMed] [Google Scholar]
- 51.Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–1760. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Li H, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Marçais G, et al. MUMmer4: a fast and versatile genome alignment system. PLoS Comput. Biol. 2018;14:e1005944. doi: 10.1371/journal.pcbi.1005944. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
- 55.Heslop-Harrison JS, Murata M, Ogura Y, Schwarzacher T, Motoyoshi F. Polymorphisms and genomic organization of repetitive DNA from centromeric regions of Arabidopsis chromosomes. Plant Cell. 1999;11:31–42. doi: 10.1105/tpc.11.1.31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Richards EJ, Ausubel FM. Isolation of a higher eukaryotic telomere from Arabidopsis thaliana. Cell. 1988;53:127–136. doi: 10.1016/0092-8674(88)90494-1. [DOI] [PubMed] [Google Scholar]
- 57.Nawrocki EP, Eddy SR. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics. 2013;29:2933–2935. doi: 10.1093/bioinformatics/btt509. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Singer T, et al. A high-resolution map of Arabidopsis recombinant inbred lines by whole-genome exon array hybridization. PLoS Genet. 2006;2:e144. doi: 10.1371/journal.pgen.0020144. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Giraut L, et al. Genome-wide crossover distribution in Arabidopsis thaliana meiosis reveals sex-specific patterns along chromosomes. PLoS Genet. 2011;7:e1002354. doi: 10.1371/journal.pgen.1002354. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Stanke M, Waack S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics. 2003;19:ii215–ii225. doi: 10.1093/bioinformatics/btg1080. [DOI] [PubMed] [Google Scholar]
- 61.Majoros WH, Pertea M, Salzberg SL. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics. 2004;20:2878–2879. doi: 10.1093/bioinformatics/bth315. [DOI] [PubMed] [Google Scholar]
- 62.Korf I. Gene finding in novel genomes. BMC Bioinforma. 2004;5:59. doi: 10.1186/1471-2105-5-59. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Slater GSC, Birney E. Automated generation of heuristics for biological sequence comparison. BMC Bioinforma. 2005;6:31. doi: 10.1186/1471-2105-6-31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Kim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods. 2015;12:357–360. doi: 10.1038/nmeth.3317. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Pertea M, et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 2015;33:290–295. doi: 10.1038/nbt.3122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Haas BJ, et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol. 2008;9:R7. doi: 10.1186/gb-2008-9-1-r7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Keller O, Odronitz F, Stanke M, Kollmar M, Waack S. Scipio: using protein sequences to determine the precise exon/intron structures of genes and their orthologs in closely related species. BMC Bioinforma. 2008;9:1–12. doi: 10.1186/1471-2105-9-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Kalvari I, et al. Rfam 13.0: Shifting to a genome-centric resource for non-coding RNA families. Nucleic Acids Res. 2018;46:D335–D342. doi: 10.1093/nar/gkx1038. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Li P, et al. RGAugury: a pipeline for genome-wide prediction of resistance gene analogs (RGAs) in plants. BMC Genomics. 2016;17:852. doi: 10.1186/s12864-016-3197-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Choi K, et al. Recombination rate heterogeneity within Arabidopsis disease resistance genes. PLoS Genet. 2016;12:1–30. doi: 10.1371/journal.pgen.1006179. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Cingolani P, et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly. (Austin). 2012;6:80–92. doi: 10.4161/fly.19695. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Pfeifer B, Wittelsbürger U, Ramos-Onsins SE, Lercher MJ. PopGenome: an efficient swiss army knife for population genomic analyses in R. Mol. Biol. Evol. 2014;31:1929–1936. doi: 10.1093/molbev/msu136. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 2009;4:44–57. doi: 10.1038/nprot.2008.211. [DOI] [PubMed] [Google Scholar]
- 74.Hu TT, et al. The Arabidopsis lyrata genome sequence and the basis of rapid genome size change. Nat. Genet. 2011;43:476–481. doi: 10.1038/ng.807. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Schranz ME, Lysak MA, Mitchell-Olds T. The ABC’s of comparative genomics in the Brassicaceae: building blocks of crucifer genomes. Trends Plant Sci. 2006;11:535–542. doi: 10.1016/j.tplants.2006.09.002. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Data supporting the findings of this work are available within the paper and its Supplementary Information files. The datasets generated and analyzed during the current study are available from the corresponding author upon request. Raw sequencing data, assemblies and annotations can be accessed in the European Nucleotide Archive under the project accession number PRJEB31147. Assemblies, annotation, variation and orthologs can be found on the 1001 Arabidopsis thaliana Genomes webpage [https://1001genomes.org/data/MPIPZ/MPIPZJiao2020/releases/current/]. Previously reported RNA-seq data from the five accessions (An-1, C24, Cvi-0, Ler, and Sha) are downloaded from the NCBI SRA database (the NCBI and ENA accession codes are included in Supplementary Data 2). The SNP markers resulted from 1001 Genomes Project can be downloaded from the webpage https://1001genomes.org/data/GMI-MPI/releases/v3.1/. The source data underlying Figs. 1–5 and Supplementary Figs. 1–12 are provided as a Source Data file.
Custom code used in this study can be freely accessed at https://github.com/schneebergerlab/AMPRIL-genomes.