Abstract
Long-term climate change and periodic environmental extremes threaten food and fuel security1 and global crop productivity2–4. Although molecular and adaptive breeding strategies can buffer the effects of climatic stress and improve crop resilience5, these approaches require sufficient knowledge of the genes that underlie productivity and adaptation6—knowledge that has been limited to a small number of well-studied model systems. Here we present the assembly and annotation of the large and complex genome of the polyploid bioenergy crop switchgrass (Panicum virgatum). Analysis of biomass and survival among 732 resequenced genotypes, which were grown across 10 common gardens that span 1,800 km of latitude, jointly revealed extensive genomic evidence of climate adaptation. Climate–gene–biomass associations were abundant but varied considerably among deeply diverged gene pools. Furthermore, we found that gene flow accelerated climate adaptation during the postglacial colonization of northern habitats through introgression of alleles from a pre-adapted northern gene pool. The polyploid nature of switchgrass also enhanced adaptive potential through the fractionation of gene function, as there was an increased level of heritable genetic diversity on the nondominant subgenome. In addition to investigating patterns of climate adaptation, the genome resources and gene–trait associations developed here provide breeders with the necessary tools to increase switchgrass yield for the sustainable production of bioenergy.
Subject terms: Evolutionary genetics, Genetic variation, Genome evolution, Bioalcohols, Plant breeding
The genome of the biofuel crop switchgrass (Panicum virgatum) reveals climate–gene–biomass associations that underlie adaptation in nature and will facilitate improvements of the yield of this crop for bioenergy production.
Main
Switchgrass (P. virgatum) is both a promising biofuel crop and an important component of the North American tallgrass prairie. Historically, tallgrass prairies were one of the largest temperate biomes on Earth, and they remain important sinks for atmospheric carbon7,8. However, most extant natural switchgrass populations are restricted to ‘relic’ sites, which represent crucial but dwindling genetic resources for the future conservation and breeding of tallgrass prairie.
Biomass production is the principal breeding target for switchgrass as a forage and bioenergy crop9 and is a strong proxy for seed production and evolutionary fitness10. Since the US Department of Energy named switchgrass a model herbaceous biofuel feedstock, biomass yield trials have demonstrated the economic viability of switchgrass bioenergy production, and cultivars have been bred that substantially out-produce maize and other cellulosic feedstocks11. However, individual cultivars tend to be productive across only a narrow climatic niche. Therefore, to maximize gains, switchgrass breeding and biotechnology should focus on developing climate–genotype matches12,13 through the identification of the genomic basis of biomass accumulation and climate adaptation in breeding panels. This will bolster future yields14 and cement switchgrass as an economically and environmentally sustainable bioenergy product.
The tetraploid switchgrass genome
Although abundant quantitative genetic variation underlies climate-associated stress tolerance and biomass production15,16, the fragmented and incomplete nature of previous switchgrass genome sequences have impeded the discovery of candidate genes and other molecular breeding efforts. The genome of the AP13 switchgrass genotype is large (haploid genome size = 1,129.9 megabases (Mb)), repetitive (56.9% repeats) (Fig. 1a, Extended Data Fig. 1) and polyploid. In contrast to some other outcrossing species such as maize (which is represented by the inbred B73 reference genome), AP13 is outbred. Its genome retains a commensurate level of heterozygosity within the range of naturally outcrossing populations (Extended Data Fig. 1). Despite this complexity, our deep PacBio long-read sequencing coupled with deep short-read polishing and bacterial artificial chromosome (BAC) clone validation produced a highly contiguous ‘v5’ AP13 genome assembly (Extended Data Fig. 1; data are available from Phytozome at https://phytozome-next.jgi.doe.gov). We pruned the resulting large contigs (N50 = 5.5 Mb) to a single representative haplotype, and then oriented and ordered into chromosome pseudomolecules using the consensus of two high-density genetic maps (Supplementary Data 1). Chromosomes were assigned to subgenomes via genetic distance to Panicum rudgeii17 (the sister taxa to the K subgenome of P. virgatum), and via de novo repeat clustering. The final assembly contains only 0.4% gaps, a 75-fold decrease relative to a previous v4 release from 2016 (https://phytozome-next.jgi.doe.gov/info/Pvirgatum_v4_1). Importantly, the genome assembly was co-linear with three sources of genetic information, despite being assembled independently from all three: the assembly of a close diploid relative (Panicum hallii), the marker order of a pseudo-F2 genetic map and the gene order of the alternative subgenome (Fig. 1a, Extended Data Fig. 1, Supplementary Data 2). These co-linearities demonstrated that we have developed a single haploid assembly and annotation for each subgenome.
Crucially, we were able to distinguish gene and repeat sequences between the two subgenomes. The gene annotation—which is derived from Illumina RNA sequencing (nlibraries = 88, nconditions = 18, >3 billion reads) and PacBio Iso-Seq (nconditions = 9, > 4.5 million reads, Supplementary Data 3)—encompasses 80,278 primary and 49,664 alternative transcripts and is as complete as the genome assembly (BUSCO = 99.4%) (Extended Data Fig. 1). We leveraged these annotations to build multiple sequence alignments and time-scaled phylogenetic trees, which date subgenome–progenitor species divergence to about 6.7 million years ago (Ma). Long-terminal repeat sequence analysis of subgenome-specific proliferation of retrotransposons sets an upper bound of the polyploidy event that formed switchgrass at ≤4.6 Ma (Fig. 1b), which indicates that tetraploid switchgrass arose during the Pliocene, or the glacial–interglacial cycles of the early Pleistocene epoch.
Climate adaptation drives biomass yield
Although there are two reproductively isolated18 switchgrass cytotypes (tetraploid (4×) and octoploid (8×)), tetraploids represent the majority of cultivars19 and span a broader geographical range than octoploids20. To investigate the genetic basis of climate adaptation, stress tolerance and biomass production, we therefore developed a diversity panel of 732 exclusively tetraploid genotypes (Supplementary Data 4). We clonally propagated and transplanted this panel in up to 10 common gardens that spanned 1,862 km of latitude, from southern Texas to South Dakota (USA) (nplants = 5,521) (Fig. 2a) and resequenced each genotype via deep (median = 59×) coverage 2 × 150-bp paired-end PCR-free Illumina libraries. Importantly, resequencing coverage was not biased towards either subgenome (likelihood ratio test χ2 = 1.32, degrees of freedom = 1, P = 0.25). Our resequencing yielded 33.8 million single-nucleotide polymorphisms (SNPs) (minor allele frequency ≥ 0.5%) mapped against the genome. We also de novo-assembled a 252-genotype subset of these deeply resequenced libraries and called presence–absence and structural variants (for example, 100–1,500-bp insertions and deletions) on the resulting contigs. To connect trait and molecular variation with climate, we extracted 46 climate variables21,22 from the georeferenced collection location of each genotype and clustered these data into seven groups that explained the majority of climatic variation across the diversity panel (Extended Data Fig. 2).
Climate-associated adaptation in switchgrass has previously been hypothesized to underscore divergence between northern upland and southern lowland ecotypes and is exemplified by divergent leaf and whole-plant morphologies13,23–26. In silico classification from morphological data, coupled with ecotype assignments by experts across our diversity panel (Supplementary Data 5), revealed upland (n = 268), lowland (n = 99) and a third, coastal ecotype (n = 184). The coastal ecotype was broadly sympatric with the lowland ecotype but displayed upland leaf characters and lowland plant architecture (Fig. 2a, Extended Data Fig. 2).
We observed strong evidence that adaptive evolution has contributed to ecotype divergence. Whereas winter-kill mortality was rare among northern upland plants (2.4%), nearly half of all coastal (42.1%) and lowland (42.8%) genotypes perished during the winter of 2018–2019 across the 4 northernmost gardens (Fig. 2b). Winter kill was especially severe in the three northwestern plains sites, probably owing to a period of severe cold from late January to early March 2019 (Extended Data Fig. 2). In total, genotypes from the northern 30% of the panel were 218× (Fisher’s test odds ratio = 218.17, P < 1 × 10−15) more likely to survive the winter of 2018–2019 in the northern 4 sites than the southernmost 30% of the genotypes.
The latitude gradient across our common gardens also served as the major axis of biomass variation. Among the seven groups of correlated climatic variables, the strongest predictors of biomass variation were always related to temperature (Extended Data Fig. 2). We observed particularly strong signals of extreme 30-year-minimum temperature as a predictor of biomass in the winter-kill-susceptible lowland and coastal ecotypes (Fig. 2c). For both ecotypes, genotypes collected from sites with colder historical extreme minimum temperatures out-performed genotypes from sites with a milder climate in the northern gardens. However, no climate-of-origin-dependent trade-off was observed in the winter-kill-tolerant upland ecotype. It is possible that a more intensely cold winter than that of 2018–2019 could introduce differential survival in the upland genotypes and produce a trade-off similar to that observed within the two more southern ecotypes. These results add support to our observation that susceptibility to cold temperatures acts both as an agent of natural selection and as a limiter of northern range expansion.
Furthermore, biomass yield for each genotype was generally maximized in the gardens with climates that were most similar to their collection locations (Fig. 2d). As such, local adaptation is manifest not only through survival and stress tolerance, but also through higher biomass accumulation in climates similar to those in which each genotype evolved.
Ecotype convergence among gene pools
Knowledge of the structure and diversity of gene pools within switchgrass is critical to projecting future gains from molecular breeding and understanding the genetic basis of climate adaptation12,13. Several previous population genetic studies of switchgrass assumed that there should be strong correspondence between population genetic structure and the morphological clustering that is used to define ecotypes20,27,28. Analysis of our 33.8-million genome-wide SNP database revealed that our diversity panel is strongly subdivided into three major genetic subpopulations that are, in general, geographically distinct (which we refer to as Midwest, Atlantic and Gulf) (total FST = 0.27) (Fig. 3a). The clustering of presence–absence and structural variants largely recapitulates SNP-based subpopulation structure (Extended Data Fig. 3), providing consistent evidence of subpopulation differentiation that may include large-effect mutations at several molecular scales.
Population genetic structure was discordant with variation in morphological ecotype, which segregated strongly within genetic subpopulations. Plants with upland ecotype traits were present in both the Atlantic (37%) and Midwest (63%) gene pools. Similarly, 54% and 46% of coastal ecotype accessions were assigned to Atlantic and Gulf subpopulations, respectively (Fig. 3a). All plants with lowland morphology were clustered within the Gulf subpopulation. However, these Gulf lowland plants had approximately equal proportions of individuals that survived and perished during the northern winter (Fig. 2c). Thus, important genetic diversity for breeding was present within genetic subpopulations—a pattern that was validated through realized genetic gains of biomass and winter survival within several switchgrass breeding populations29,30.
Despite ecotypic convergence among subpopulations, coalescent simulations dated the divergence of the subpopulations to the mid-Pleistocene epoch (>358,000 generations (0.7–1.4 Ma, assuming a 2–4 year generation time)) (Extended Data Fig. 3). Thus, extant switchgrass gene pools have been diverging for nearly half of the evolutionary history of polyploid switchgrass. In contrast to the deep sequence divergence among subpopulations, we observed very little molecular genetic differentiation between upland and coastal ecotypes within the Atlantic subpopulation (FST = 0.03), or between lowland and coastal ecotypes within the Gulf subpopulation (FST = 0.03) (Extended Data Fig. 3).
Admixture appears to be common between the Gulf and Atlantic subpopulations; comparisons of plants with coastal ecotype traits from both of these subpopulations were molecularly more similar (FST = 0.19) than for noncoastal Gulf and Atlantic plants (FST = 0.24). By contrast, the plants with upland morphologies in the Midwest and Atlantic subpopulations were no more similar than other plants from those subpopulations (FST for both = 0.30). This convergence of upland morphologies in two highly differentiated genetic subpopulations could be the result of independent genetic origins of the upland ecotype or rare but evolutionarily important31 admixture events. We evaluate these hypotheses below.
Genetic targets for yield improvement
To detect the genetic basis of climate adaptation and fitness within the diversity panel, we conducted multivariate adaptive shrinkage32 on genome-wide association mapping (GWAS) results within and across genetic subpopulations. Multivariate adaptive shrinkage shares GWAS peak effect size and direction between univariate tests to improve power to detect significant, shared results. Multivariate adaptive shrinkage results were determined for both fitness GWAS (which mapped winter survival and biomass in the three largest common gardens (MI, MO and TX2)), and climate GWAS (which detected associations between SNP variation and the climate of origin (seven representative climate variables)). To make direct comparisons among subpopulations (which have different segregating SNPs), we summarized the 12,239 significant linkage-disequilibrium block ‘peaks’ of multivariate adaptive shrinkage (log10-transformed Bayes factor > 2)33 into 10,090 20-kb regions (20 kb represents the inflection point at which linkage disequilibrium decay flattens) (Extended Data Fig. 3) for climate (nregions = 9,856) and fitness (nregions = 332) GWAS (Supplementary Data 6). A weighted list of candidate genes—including putative SNP effects, the existence of presence–absence or structural variants, gene co-expression and physical proximity to the GWAS peaks—can be found in Supplementary Data 7.
GWAS peaks explained the majority of heritable phenotypic and climatic variation (SNP–heritability) both across and within gene pools (Fig. 3b). SNP–heritability of fitness (h2= 51.5 ± 15.4% (mean ± s.e.m.)) and climate-associated peaks (h2 = 70.5 ± 14.0%) collectively explained over threefold-more variation than the polygenic background (fitness = 19.5 ± 9.1%, climate = 18.2 ± 9.5%) (Extended Data Table 1). The high heritability of these climate and biomass associations indicated that relatedness at a small subset of all variants out-predicted overall relatedness and provides breeders with genetic diversity to target for switchgrass improvement in local environments.
Extended Data Table 1.
Heritability of traits and climate-of-origin variation was partitioned to SNPs in GWAS hits (‘SNP–h2’) and background or polygenic variation. SNP–heritability and standard errors are presented for each of seven climate variables, biomass in three gardens within and across (‘full’) each of the subpopulations. Response variable-by-subpopulation combinations marked with ‘NA’ indicate that the model did not converge.
Loci that are associated with both fitness and climate of origin are probably involved in local adaptation34, and are strong targets for the breeding of locally adapted cultivars. Overall, we observed nearly 2× more overlap of 20-kb regions associated with both climate and fitness than expected by chance (Fisher’s test odds ratio = 1.92, P < 1 × 10−6). This overlap was especially strong within the two northern subpopulations (Midwest, odds = 11.5× and P < 1 × 10−15; Atlantic, odds = 17.8× and P < 1 × 10−15) (Fig. 3c), where we expected to see the strongest effect of selection on survival during cold winters.
Many regions of climate and fitness overlap were polymorphic only within a single genetic subpopulation, which highlights several, possibly independent, genetic paths to climate adaptation in switchgrass. However, 9.5% (940) of the 20-kb climate intervals were polymorphic in several genetic subpopulations. Given the substantial evidence of admixture between the Gulf and Atlantic subpopulations (Fig. 3a), we expected that contemporary gene flow would be the major contributor to shared polymorphisms. Contrary to this hypothesis, the majority (511 regions) of all multi-subpopulation GWAS intervals were shared between the two most genetically distinct gene pools (Atlantic and Midwest). Given the deep divergence time between these subpopulations, rare or ancient gene flow35 may have created these shared adaptive polymorphic regions.
Evolutionary convergence via introgression
To explicitly address how introgressions may have shaped the distribution of climate–SNP associations, we investigated physically contiguous regions of admixture across the genome using a hidden Markov model36. Introgressions between subpopulations represented 2.98% of the content of our resequenced genomes (Fig. 4a), but were >1.5× more likely to contain shared GWAS intervals across subpopulations than expected by chance (Fisher’s test odds ratio = 1.55, P < 1 × 10−8), indicating that adaptive introgressions underlie at least a portion of heritable variants shared among subpopulations.
Of particular interest were a suite of introgressions from the Midwest to the Atlantic subpopulation that dated to about 8,700 generations before present (17–34 thousand years ago (ka)), which coincides with a northern range expansion after the Last Glacial Maximum (about 22 ka). Atlantic genotypes with higher levels of Midwest introgressions exhibited a more-upland suite of traits (Fig. 4b) and were overrepresented along the northern margin of the otherwise subtropical and temperate range of the Atlantic subpopulation (Fig. 4c). Consistent with adaptive roles for genomic introgressions in other systems31,37, these findings suggest that introgression of putatively northern-adapted alleles from the Midwest into the Atlantic subpopulation could have facilitated the post-glacial colonization by switchgrass of colder habitats in the northeastern coastal region of the USA. To test this hypothesis, we conducted redundancy analyses to relate the presence of introgression blocks with climatic, geographical and phenotypic factors. Overall, Midwest introgressions in the Atlantic subpopulation were over four times more strongly associated with climate (percentage of variance explained = 46.5%) than geography (11.5%). Although 532 and 651 introgressions from the Midwest to the Atlantic subpopulation were associated with climate of origin or biomass, respectively, 254 introgressions were outliers for both analyses—representing a nearly 7-fold enrichment over expectations of independence between each set (odds ratio = 6.99, P < 1 × 10−15). These results reinforce the hypothesis that Midwest introgressions have shaped the climatic niche and phenotypic distribution of the northern Atlantic genotypes and support a growing body of evidence that demonstrates that adaptive introgressions can facilitate both range expansion and ecotype evolution38,39.
Reduced heritability of dominant subgenomes
Polyploidy is common among lineages of flowering plants and can increase the genetic diversity available to selection40,41, which can lead to adaptive evolution or sorting that alters ecological niche characteristics42. This process may explain the generally greater prevalence of polyploids in poleward latitudes and higher elevations that were once covered by ice sheets during glacial cycles43.
Genes duplicated during the formation of a polyploid can subfunctionalize (divide ancestral gene functions among paralogous genes), neofunctionalize (evolve new gene function for paralogues) or simply be lost44. Following polyploid speciation, one subgenome commonly retains more genes and exhibits, on average, higher expression levels than the other subgenome, a phenomenon known as subgenome dominance45. As with other polyploids46–48, subgenome dominance and subfunctionalization were clear in switchgrass. Relative to the N subgenome, the K subgenome had higher gene density (77.4 versus 68.0 genes per Mb, binomial P < 1 × 10−15), more upregulated genes (5,445 versus 4,477, binomial P < 1 × 10−15) and lower rates of mutation accumulation (5,255 genes in the K subgenome with a synonymous mutation rate (Ks) greater than that in the N subgenome, versus 6,751 genes in the N subgenome with Ks greater than that in the K subgenome, binomial P < 1 × 10−15). Combined, all 11 of our subgenome statistics (Extended Data Fig. 4) point to stronger evolutionary constraint of and bias towards the K subgenome, which suggests that the potential for adaptive evolution may be differentially partitioned between subgenomes.
Given the evolutionary biases towards retention of the K subgenome, we expected to see stronger signals of climate adaptation44, biomass and survival among SNPs on the K subgenome. Instead, 75.9% of biomass SNP–heritability was attributable to the N subgenome, and only 24.1% to the K subgenome across the 10 common gardens (Extended Data Fig. 4). Furthermore, 54.3% of Midwest introgressions into the Atlantic subpopulation were found on the N subgenome, a significant enrichment (binomial test P < 1 × 10−7), even when correcting for the 7.5% expansion of the N subgenome (binomial test P = 0.0012). The abundance of introgressions and heritable biomass variation attributable to the N subgenome may appear to be at odds with subgenome evolutionary biases towards the K subgenome. One potential explanation for this counterintuitive finding is that relaxed evolutionary constraint (reduced purifying selection) on the N subgenome may have allowed for accumulation of adaptive genetic variation through directional or diversifying selection. As such, the N subgenome has accumulated heritable variation49 that future breeding regimes can target to shape natural switchgrass populations and improve biofuel yield.
Discussion
As the climate and the natural environment change, it is increasingly critical to qualify expectations of genetic improvements in domesticated species and the adaptive potential of wild populations50. Indeed, plant genomes offer glimpses into the past and future of crop and wild plant populations. Adaptation to glacial–interglacial cycles offers an instructive analogue for current and future environmental change, one that we explore here to investigate the past, present and future genomic mechanisms of climate adaptation and yield improvement in switchgrass.
However, the complexity of plant genomes has also presented a major barrier to the development of genetic resources that facilitate fast and effective molecular breeding. Our methodology and success in sequencing the complex genome of switchgrass will facilitate ecological and agricultural genomics in nearly any system. For example, our results demonstrate that adaptation to northern climates has been facilitated by introgressions between anciently diverged subpopulations, which provides further support for the hypothesis that admixture between divergent genomes can enhance adaptation to novel environments37. Such adaptive introgressions and heritable subgenome-specific genetic variation49 may provide the genetic paths of least resistance that permit colonization of novel habitats during periods of environmental variability. Combined, obligate outcrossing and polyploidy—traits that are often consciously avoided when selecting genomic study systems—are the primary drivers of switchgrass adaptation in nature and the sources of genetic variation available for selection to improve biofuel yield through a changing future.
Methods
No statistical methods were used to predetermine sample size. The experiments were completely randomized, and investigators were not aware of genotype identifiers while conducting experiments or sequencing.
Plant collections, propagation, cultivation and phenotyping
To form the diversity panel, seeds, rhizomes and clonal propagules from natural and common garden sources were collected from 2010 to 2018. Plants grown from seed followed a standard growth procedure16. In brief, 10–15 seeds were sown in 9-cm square pots containing a mixture of ProMix BX potting soil (Premier Tech Horticulture) and Turface MVP calcined clay (Turface Athletics) and vernalized for 7 days at 4 °C. Pots were then placed in a lit greenhouse with 14-h day length and 30-°C/22-°C day/night temperature. Seedlings were thinned at the 3-leaf stage to 1 plant per pot and allowed to grow until the 5-tiller stage. Rhizome propagules and 5-tiller seedlings were transferred to 5-gallon pots containing finely ground pine bark mulch (Lone Star Mulch) and time-release fertilizer (Osmocote 14-14-14, ScottsMiracleGro). All individual plants were propagated in Austin by clonal division from 2016 to 2018, targeting >10 clones per unique accession. Cleary 3336F systemic fungicide (Cleary Chemicals) was applied to the plants as necessary to control fungal pathogens. Plants were placed in 1-gallon pots for the final propagation.
Planting in the field sites occurred from 15 May to 10 July 2018 and followed previously published methods16. In brief, plants were transported to each site by truck, where each field was covered with one layer of DeWitt weed cloth. Plants were placed in holes that were cut into the weed cloth into a honeycomb design in which each plant had four nearest neighbours, all located 1.56 m from one another. To prevent edge effects, the lowland Blackwell cultivar was planted at every edge position. Plants were hand-watered following transplantation. Aboveground portions of all plants were left to stand over the winter of 2018–2019 and removed in the spring of 2019 before spring tiller emergence. At the end of the 2019 season, plants were tied upright as a bunch and harvested with sickle bar mowers.
We generated two measures of fitness for the 2019 growing season: log-transformed biomass (kg) and proportion of winter survival (Supplementary Data 8). Biomass data were obtained from all living individuals during harvest in October and November 2019. Plants with an estimated mass <750 g were placed in paper bags and dried whole at 60 °C until no additional moisture loss occurred, then weighed for total dry biomass. Plants with an estimated mass >750 g were weighed in the field for wet biomass on a hanging scale with a ±5-g resolution. To determine biomass of these plants, approximately 500 g of whole tillers were subsampled from each plant, weighed, dried as above and reweighed. The wet biomass of the whole-plant sample was then multiplied by the per cent moisture in the subsample to approximate total dry biomass. Plants were considered to have experienced winter mortality during the 2018–2019 winter season when no new growth was seen from plant crowns by 1 June 2019. The dead plant crowns were excised from the experiment and replaced with plants of the Blackwell cultivar in July or September 2019.
Genome assembly and polishing
We sequenced the Alamo switchgrass genotype AP13 using a whole genome shotgun sequencing strategy and standard sequencing protocols at the Department of Energy Joint Genome Institute and the HudsonAlpha Institute for Biotechnology. The genome was assembled and polished from 4,520,785 PacBio reads (121.66× raw sequence coverage from a total of 59 P6C4 2.0 and 2.1 chemistry cells with 10-h movie times and a p-read yield of 91.76 Gb) (Extended Data Fig. 1) using the MECAT assembler52 and ARROW polisher53. Final genome polishing and error correction was conducted with one 400 bp insert 2 × 150 bp Illumina HiSeq fragment library (177.1×). Reads with >95% simple sequence repeats and reads <50 bp after trimming for adaptor and quality (q < 20, 5-bp window average) were removed. The final read set consisted of 1,259,053,614 reads for a total of 168× coverage of high-quality Illumina bases. This produced an initial diploid assembly of 6,600 scaffolds (6,600 contigs), with a contig N50 of 1.1 Mb, 3,489 scaffolds larger than 100 kb and a total 2C (diploid) genome size of 2,013.4 Mb.
Assembling a haploid genome in an outbred individual, such as AP13, will generally yield both haploid copies in heterozygous regions, necessitating computational steps to represent each chromosome as a single-copy haplotype without duplicate copies being unnecessarily repeated. Our initial assembly was approximately double the expected haploid (1C) genome size of 1.2 Gb. Therefore, to detect putative meiotically homologous haplotypes, we identified and counted shared 24-mers that occurred exactly twice in the assembly and binned contigs accordingly. A total of 3,152 shorter and redundant alternative haplotypes and 2,387 overlapping contig ends were identified, comprising a total sequence of 871.2 Mb. The remaining 1,142.2 Mb of sequence was ordered and oriented into 18 chromosomes by aligning genetic markers from 2 available maps (Supplementary Data 1) to the MECAT assembly; 563 joins and 57 breaks were made, with 10,000 Ns representing the unsized gap sequence. Overall, 97.2% of the assembled sequence was contained in the chromosomes. Telomeric sequence was identified using the (TTTAGGG)n repeat and properly oriented. The remaining scaffolds were screened against GenBank bacterial proteins and organelle sequences and removed if found to match these sequences. To resolve minor overlapping regions on contig ends, adjacent contig ends were aligned to one another using BLAT54; a total of 47 adjacent duplicate contig pairs were collapsed.
We conducted two rounds of error correction. First, we corrected homozygous SNPs and insertions and/or deletions (indels) by aligning the Illumina 2 × 150 bp library to the release consensus sequence using bwa mem55 and identifying homozygous SNPs and indels with the UnifiedGenotyper tool of GATK56. A total of 690 homozygous SNPs and 80,199 homozygous indels were corrected in the release. Second, we computationally finished 11,343 assembled contigs sequenced from BAC clones with a combination of ABI 3730XL capillary sequencers57 and single index Illumina clone pools and aligned this set of switchgrass clones to the SNP-fixed genome to find heterozygous SNPs that were out of phase with their neighbours. To resolve these phase-switched alleles, the full set of the raw PacBio reads was aligned to the assembly. For each read, the phase of each heterozygous site was determined and 62,732 out-of-phase heterozygous sites were corrected.
To distinguish the N and K subgenomes, we used a de novo repeat-clustering method and validated this with phylogenetic distances to a related species. We searched for ‘diagnostic’ 15-mers via Jellyfish58 in LTR regions of Gypsy, Copia and Pao insertions (identified by RepeatMasker59 and LTRHarvest60) that distinguished each set of homologous chromosomes (≤1 hit in one homologue and ≥100 in the other). The LTR sequences that shared common 15-mers were grouped as superfamilies and were aligned within each superfamily by BLAST. Superfamily members with significant BLAST hits (e < 0.01, ≥90% length) were assigned into families and aligned by Mafft61. Jukes–Cantor distances between LTR families were computed by the R ape package62, and clustered into two distinct sets of subgenomes. Clustering was identical between LTRs and alignments to P. rudgei (K.M.D. and E. Kellogg, unpublished data), which is an ancient relative of the K subgenome17, giving high confidence that we have effectively assigned all chromosomes to the correct subgenomes. Finally, we assigned chromosome identifiers and oriented each chromosome pseudomolecule via synteny with Setaria italica63. The final haploid version 5.0 release contained 1,125.2 Mb of sequence, consisting of 626 contigs with a contig N50 of 5.5 Mb and a total of 97.2% of assembled bases in chromosomes.
Gene annotation
Transcript assemblies were made from about 2 billion pairs of 2 × 150-bp stranded paired-end Illumina RNA-seq reads, about 1 billion pairs of 2 × 100-bp paired-end Illumina RNA-seq reads and 454 reads (Supplementary Data 3) using PERTRAN (details of which have previously been published64). In brief, PERTRAN conducts genome-guided transcriptome short-read assembly via GSNAP65 and builds splice alignment graphs after alignment validation, realignment and correction. In total, around 4.5 million PacBio Iso-Seq circular consensus sequences66 were corrected and collapsed, resulting in approximately 677,000 putative full-length transcript assemblies. Subsequently, 668,176 transcript assemblies were constructed using PASA67 from RNA-seq reads, full-length cDNA, Sanger expressed sequence tags, and corrected and collapsed PacBio circular consensus sequence reads. Loci were determined by EXONERATE68 alignments of switchgrass transcript assemblies and proteins from Arabidopsis thaliana69, soybean70, Kitaake rice71, Setaria viridis72, P. hallii var. hallii64, Sorghum bicolor73, Brachypodium distachyon74, grape and Swiss-Prot75 proteomes. These alignments were accomplished against a repeat-soft-masked switchgrass genome using RepeatMasker59 (repeat library from RepeatModeler76 and RepBase77) with up to 2,000-bp extension on both ends unless extending into another locus on the same strand. Incomplete gene models, which had low homology support without full transcriptome support, or short single exon genes (<300-bp coding DNA sequences (CDS)) without protein domain or good expression were removed.
Comparative genomics
Syntenic orthologues and paralogues were inferred for the two switchgrass subgenomes via the GENESPACE pipeline64, using default parameters and two outgroups: P. hallii var. hallii64 and S. bicolor73. In brief, GENESPACE parses protein similarity scores into syntenic blocks and runs orthofinder78 on synteny-constrained blast results. The resulting block coordinates and syntenic orthology networks give high-confidence anchors for evolutionary inference.
To calculate the ancestral states of CDS regions, we first determined sequences that share common ancestry using genomes from Phytozome79. The final number of hits to the switchgrass genome were 38,960 and 33,772 for P. hallii, and S. bicolor, respectively. For any given orthology network, we built two multiple sequence alignments in mafft61, one excluding the focal switchgrass sequence (msa0) and one forcing msa0 to align to the coordinate system of the focal sequence via the --keeplength parameter. We then extracted marginal character states with the maximum likelihood algorithm in Phangorn80. For each reconstruction, only the internal node closest to the switchgrass branch was used as the ancestral state. Overall, we analysed 40,943 switchgrass gene models (216,157 exons) covering 54.95 Mb (Supplementary Data 9).
Subgenome evolution and dating
To infer the ages of the subgenomes and tetraploid switchgrass, we took a conservative set of orthologues with simple 2:1:1 networks between P. virgatum, P. hallii and S. italica. This yielded 45,045 switchgrass proteins aligning to 24,549 P. hallii proteins, resulting in 20,496 homologue pairs and 4,053 singletons (2,396 for K subgenome and 1,660 for N subgenome) from the cross-species analysis. We aligned the translated CDS of these sequences using Dialign-TX81. The aligned CDS sequences were concatenated and fed to Gblocks82 using default parameters. Gblocks filtered the alignment of 18,044,244 CDS nucleotides to 16,321,302 positions, in 50,334 blocks. The resulting alignment was then used in PhyML83 to build a maximum-likelihood tree using the general-time reversible model. This tree was used as an input to r8s84, to compute a time tree and calibrate the Panicum–Setaria node of the tree to 13.1 Ma63. To date subgenome divergence and therefore the timing of polyploid switchgrass speciation, we leveraged burst distances, which refer to all distances within an LTR family (whereas pairwise distances refer to the distance between the 5′ and 3′ LTRs of the same insertion). The 5′ versus 3′ distances of the N- or K-subgenome-specific retrotransposons were used to date the insertion times of those elements. This method cannot be used for the P. virgatum-specific or Panicum-specific families because the more recent expansions of those elements dominate the distributions. Instead, we relied on comparing the best cross-species alignments to estimate the LTR distances of the P. virgatum–P. hallii and Panicum–Setaria nodes. This way, we have calibration points to compare the LTR distances to the more confident protein-coding gene divergences between species.
Subfunctionalization and gene expression analyses
To assess whether the subgenome evolution biases observed at the protein-coding sequence scale were manifest in phenotypes, we explored gene expression biases between homologues from biologically replicated AP13 leaf tissue (n ≥ 5) collected at two sites (TX2 and MI). Illumina paired-end RNA-seq 150-bp reads were quality trimmed (Q ≥ 25) and reads shorter than 50 bp after trimming were discarded. High-quality sequences were aligned to P. virgatum v5.1 reference genome using GSNAP65 and counts of reads uniquely mapping to annotated genes were obtained using HTSeq v.0.11.285. The test for differential expression was conducted through a likelihood ratio test in DESeq286. Library sizes were calculated before splitting the reads by subgenome; these sizes were used as the size factors in the analysis of differential expression. Subfunctionalization was defined as a significant subgenome-by-environment interaction from the likelihood ratio test. Subgenome expression bias was tested for both the field gardens and annotation libraries using post hoc Wald-test contrasts between subgenomes within conditions. Significant bias was defined as differential expression false-discovery-rate-adjusted P < 0.05. Weighted gene coexpression clustering of AP13 gene annotation RNA-seq libraries was conducted with WGCNA87 with a power of 6. Raw counts can be found in Supplementary Data 10.
Ploidy assessment
We used a LSRFortessa SORP Flow Cytometer (BD Biosciences) to determine ploidy levels of the resequenced accessions. For each plant, 200–300 mg of young leaf tissue was macerated in a Petri dish with a razor blade and treated for 15 min with 1 ml Cystain PI Absolute P nuclei extraction buffer (Sysmex Flow Cytometry) mixed with 1 μl 2-mercaptoethanol. Samples were filtered to isolate free nuclei with a CellTrics 30-μm filter (Sysmex) and treated for 20 min on wet ice with 2 ml of Cystain PI Absolute P staining buffer (Sysmex), 12 μl of propidium iodide and 6 μl of RNase A. Samples were run on the flow cytometer to determine nuclei size with a minimum of 10,000 nuclei analysed per sample. Output from the flow cytometer was analysed with FlowJo software (BD Biosciences) and samples were binned into three categories on the basis of the average units of fluorescence per nuclei (Supplementary Fig. 1). Ploidy level of the sample was considered 4× if the cell population had 40,000–80,000 units of fluorescence, 6× for 80,000–100,000 units and 8× for 100,000–140,000 units. The binning parameters were established with flow cytometry data from several P. virgatum accessions of known ploidy.
We also assessed ploidy of the samples via the distribution of variant allele frequency at biallelic SNPs (as described in ‘Variant calling’). This method assumes that tetraploids and octoploids follow different allele frequency distribution patterns, with tetraploids having 0.5/0.5 (reference and variant depths) and octoploids having a mixture of 0.75/0.25 and 0.5/0.5. If the proportion of hits with 0.48 ≤ x ≤ 0.52 was <0.035, the library was considered octoploid and if it was ≥0.035, tetraploid; 837 out of 870 samples (96.2%) that had flow cytometry data matched with these results.
Variant calling
A total of 789 tetraploid diversity samples were resequenced at a median depth of 59× (range 20×–140×). Of these, 732 were used for further analysis after filtering for missing data, outlier elevated heterozygosity and collection site discrepancies. The samples were sequenced using Illumina HiSeq X10 and Illumina NovaSeq 6000 paired-end sequencing (2 × 150 bp) at HudsonAlpha Institute for Biotechnology and the Joint Genome Institute. To account for different library sizes, reads were pruned to ≤50× coverage, then mapped to the v5 assembly using bwa-mem55.
SNPs were called by aligning Illumina reads to the AP13 reference with BWA-mem. The resulting .bam file was filtered for duplicates using Picard (http://broadinstitute.github.io/picard) and realigned around indels using GATK 3.056. Multi-sample SNP calling was done using SAMtools mpileup88 and Varscan V2.4.089 with a minimum coverage of eight and a minimum alternate allele count of four. Genotypes were called via a binomial test. SNPs within 25 bp of a 24-mer repeat were removed from further analyses. Only SNPs with ≤20% missing data and minor allele frequencies >0.005 were retained, resulting in 33,905,042 SNPs across 75% of the genome at a coverage depth between 8× and 500×. Phasing was performed using SHAPEIT390. FST calculations were accomplished via vcftools91. We tested for subgenome read-mapping bias by generating mean coverage per Mb for each of the 732 libraries and 18 chromosomes. We then fit a mixed effects linear model to these data in lme492 in which the chromosome number (1–9) was a random effect, to test the main effect of subgenome. Models with and without the main effect term were compared via a likelihood ratio test.
Individual de novo assemblies for the 732 short read libraries were constructed using HipMer93 with a k-mer size of 101 to maximize haplotype splitting among contigs. As the assemblies varied in quality and contiguity, the sample set considered for gene presence–absence and structural variant detection was narrowed to 251 samples (pan-genome set) based on total assembly size, contig N50 length and total gene alignments per library.
To assess presence–absence variation of genes across the pan-genome, we aligned all AP13 proteins and a unique set of 6,161 proteins from Oropetium thomaeum (nproteins = 1,476)94, S. italica (n = 1,085)63, Setaria viridis (n = 891)72, P. hallii var. filipes (n = 1,048)64, S. bicolor (n = 878)95 and P. hallii var. hallii (n = 772)64. These unique genes were extracted from single-copy orthology networks inferred via orthofinder78 and selection owing to a lack of orthology to switchgrass. All proteins (≥100 amino acids) were aligned to all de novo assemblies using BLAT54. Gene alignments from AP13 proteins were considered present if they aligned with greater than or equal to 80% identity and 75% coverage, whereas other grass proteins were considered present with alignments greater than 70% identity and 75% coverage (to allow greater divergence among species). Variable (pan-genome shell) genes (considered present across 40–60% of the population; n = 5,432) were extracted from the presence–absence variation matrix and used to visualize differences among non-admixed individuals from the Atlantic, Gulf and Midwest subpopulations. Testing genes that were significantly over- or under-represented within each subpopulation was conducted with a χ2 test with a Benjamini–Hochberg multiple testing correction (P ≤ 0.05).
To detect structural variants across the pan-genome, contigs (≥2 kb) from each library were aligned to the AP13 reference genome using ngmlr96 with default settings for PacBio reads. The resulting .bam file was sorted using samtools88 and used for calling structural variants with sniffles96. Individual structural variant calls were merged across samples using SURVIVOR97, with a maximum allowed distance of 1 kb. The resulting .vcf file was filtered using bcftools88 using a minimum minor allele frequency of 0.1, and considering only insertions and deletions between 100 and 1,500 bp in length.
Population genomics
To assess the genetic population structure of the 732 tetraploid libraries (Supplementary Data 4), we extracted all fourfold degenerate sites (putatively neutral) with ancestral state calls (Supplementary Data 9) from the ancestral state alignments. This list of sites, which represents our highest confidence neutral loci, was then linkage-disequilibrium-pruned using a threshold of |r| ≤ 0.6, resulting in 59,789 sites for downstream analyses in the R package SNPRelate98.
The extent of linkage disequilibrium for the population was determined from SNPs99 in PLINK100. Linkage disequilibrium (r2) was calculated using plink (--ld-window 500--ld-window-kb 2000). The r2 value was averaged every 500 bp. A nonlinear model was fit for this data in R using the nls function, and the extent was determined as to when the linkage disequilibrium (r2) nonlinear curve stabilized.
Population genetic structure was assessed hierarchically. Given the presence of highly divergent ecotypes across the study range, we first analysed the broadest genetic population structure using discriminant analysis of principal components (DAPC)101 in adegenet v.2.0.1102. This method does not rely on common assumptions (for example, Hardy–Weinberg equilibrium and linkage disequilibrium) that underlie many population clustering approaches and therefore provides a valuable tool to look at broad structural divisions. DAPC demonstrated a strong set of gene pools and separated Midwest genotypes from all others. We then evaluated the genetic population structure and potential admixture of the remaining non-Midwest individuals using a Bayesian clustering algorithm implemented in STRUCTURE v.2.3.4103 via the admixture model with correlated allele frequencies. The analysis consisted of 20,000 burn‐in steps and 30,000 replicates of 1–6 genotypic groups, each of which was run 10 times. Ancestry coefficients across all subpopulations were assigned post hoc through eigenvector decomposition in SNPRelate.
We inferred the demographic history of the switchgrass samples using Multiple Sequentially Markovian Coalescent (MSMCv.2.0104), which is a population genetic method used to infer demographic history and population structure through time from sequence data. This method models an approximate version of the coalescent under recombination, and produces tests of both population size and divergence time. MSMC was run using four haplotypes for each subpopulation, skipping ambiguous sites, an estimated rhoOverMu of 0.25 and a time segment pattern of 10 × 2 + 20 × 5 + 10 × 2. We estimated rhoOverMu as 0.25 as the mean value from 100 iterations without the fixed recombination parameter for 5 sets of 4 haplotypes in each subpopulation and averaged them. To estimate scaled divergence time in generations, we assumed a mutation rate of 6.5 × 10−8. To make estimates of initial divergence time, we compared adjacent relative cross-coalescence rate (RCCR) values (past to present) (Supplementary Data 11). If there was a decline, either at a single time segment or within contiguous segments or within two interleaved time segments (>0.01; observed range 0.01–0.28), and the following neighbours were nearly zero (≤0.009; observed range: −0.1–0.009), we considered that to be a starting point for population separation. However, if there was another decline within five time segments, we considered the latter as the start of population separation. We replicated the analyses with 16 sets of different individuals for each subpopulation contrast.
Population structure was visualized across SNPs, structural variants and presence–absence variants via eigenvector decomposition of a distance matrix. First, a Euclidean distance matrix was calculated among 0/1/2 (reference homozygote, heterozygous, alternative homozygote) library × marker matrices for each of the three variant call types. The Euclidean matrix was then scaled and centred to remove among-library coverage variance via Gower’s centred similarity matrix, implemented in the R package MDMR105.
Ecotype classification
Mature switchgrass accessions at or near anthesis were surveyed for 16 plant traits (leaf: length, width, length/width ratio, area, lamina thickness and lamina/midrib thickness ratio; whole plant: number of tillers, tiller height, product of tiller height × number, tiller height/count ratio, panicle height, panicle height/count ratio, leaf canopy height and tiller/leaf height ratio; phenology: date of green-up and date of panicle emergence) to determine ecotype identity during the summer of 2019 at the University of Texas J. J. Pickle Research Campus (PKLE; or TX2 (Austin, Texas, USA) and Michigan State University Kellogg Biological Station (KBSM; or MI (Hickory Corners, Michigan, USA)) common gardens (see Supplementary Data 5 for detailed descriptions of these variables). The phenology measurements, including green-up (when the first green vegetative structures emerge from the rhizome crown) and panicle emergence (when the first reproductive structures emerge from the tiller), were assayed daily. Detailed leaf morphology was assessed on a representative leaf of each plant by measuring length and width (in mm), midrib and lamina thickness (in μm) (Mitutoyo 547-500S caliper) and leaf area (in mm2) (Licor 3100C leaf area meter). In addition to these quantitative traits, we also generated a qualitative upland–lowland index for both the leaf and whole-plant appearance, collected at the end of the summer 2019 in Austin (TX2 site). Each plant characteristic was assessed on a 1–5 scale from most lowland-like to most upland-like. The established cultivars Alamo and Dacotah were used for baseline measurements of lowland and upland characters, respectively. Plant characters assessed included: tiller appearance, from thickest and most lowland-like to thinnest and most upland-like; leaf appearance, from widest, longest and most lowland-like to shortest, thinnest and most upland-like; canopy colour from bluest and most typically lowland to darkest green and most typically upland. This visual approach is akin to basic selection criteria often used by switchgrass breeders.
To assess phenotypic structure in these data, we used a DAPC101. Prior groups were determined by first transforming the phenotypic data using principal component analysis (PCA), then the first 10 principal components were used in a k-means algorithm to classify individuals into 3 possible groupings aiming to maximize the variation between groups. Next, DAPC was implemented on the 10 retained principal components to provide an efficient description of the ecotypic clusters using two synthetic variables, which are linear combinations of the original phenotypic variables that have the largest between-group variance and the smallest within-group variance (that is, the discriminant functions).
We classified each of the 651 tetraploid genotypes surveyed for the 16 traits at the MI and TX2 gardens (34 total features, 32 quantitative and 2 qualitative ordinal traits) to 1 of the 3 ecotypes through a low-capacity neural network with 1 hidden layer and 5 units (Supplementary Data 5). The neural network was implemented in caret106 and was trained on seven cultivars with known ecotypes (lowland: Kanlow and Alamo; coastal: High Tide and Stuart; upland: Summer, Dacotah and Sunburst) and 78 additional genotypes that were in the same SNP-based genetic cluster (Extended Data Fig. 3), collected in the same states and clustered most closely in phenotypic PCA space with the exemplar cultivars. These high-affinity exemplar genotypes are printed in Supplementary Data 5. Ecotypes for the remaining 582 genotypes that were phenotyped for the ecotype classification traits were predicted with caret106. By using traits collected at gardens representing both the northern and southern switchgrass range, we hoped to avoid local climate bias on plant phenotype and subsequent ecotype classification. Furthermore, the neural network classification approach offers one notable advantage over both DAPC and expert’s qualification: because the neural network is anchored to known and published genotypes, experimentation that includes these common cultivars will be able to more effectively recapitulate our assignments.
Admixture and introgression block calculation and dating
We built a database of admixture-informative SNPs through a two-step pipeline. First, ancestry coefficients were calculated as in ‘Population Genomics’ from fourfold degenerate sites that had associated ancestral-state calls. The 30 samples with the least missing data and proportion of genome-wide admixture ≤ 0.001 for each subpopulation were used to define subpopulation-specific allele frequencies. These libraries were used to find SNPs with at least one pairwise FST value >0.4, as calculated with the ‘W&C84’ method in the snpRelate function snpGdsFst. Second, these global ancestry-informative sites were parsed within each subpopulation to those with minor allele frequencies > 0.05 and missingness < 0.05. These sites were further pruned within subpopulations first to sites with |r| < 0.9 (10 SNPs or 1,000-bp windows), then to |r| < 0.95 (1,000 SNPs or 10,000-bp windows) in snpRelate. This process resulted in the following SNP and library counts for each subpopulation: Atlantic, 579,468 SNPs and 284 libraries; GULF, 641,975 SNPs and 215 libraries; and Midwest, 481,563 SNPs and 196 libraries.
To test for the physical locations of admixture blocks between each pair of subpopulations, we used Ancestry_HMM36,107. This approach leverages allele frequencies in putative parental populations to determine regions of likely introgressions in a test population. For each of the three subpopulations, we sought to determine the timing, extent and current positions of admixture block introgressions. In each case, we permitted two pulses from each of the other two subpopulations. Ancestry_HMM can optimize the number of generations before present when an ancestry pulse occurred and the proportion of individuals involved in the admixture pulse. However, 8-parameter optimization with >480,000 sites and >150 libraries was not computationally feasible. Therefore, we optimized parameters using 40 randomly sampled libraries with admixture coefficients within the 0.2–0.8 quantiles of the admixture proportion distribution and SNPs only on chromosome 4 of the N subgenome. We chose this chromosome as representative of others because of a lack of obvious large high-frequency introgressions. The resulting ancestry pulse parameter optimizations were founded on an initially unadmixed population 10,000 generations before present, and two subsequent admixture pulses for each of the other two subpopulations; the optimized pulses are as follows (source–reference): Midwest–Atlantic (ngenerations = 8,658 and Padmixed = 0.001%; 67 and 0.7%), Gulf–Atlantic (85 and 1.1%; 17 and 0.25%), Atlantic–Gulf (79 and 1.9%; 11 and 0.38%), Midwest–Gulf (79 and 0.86%; 11 and 0.14%), Atlantic–Midwest (66 and 0.27%; 14 and 0.036%), and Gulf–Midwest (71 and 0.15%; 14 and 0.033%). These pulses were supplied to the full model with all individuals and chromosomes, along with an error probability of 0.001, maximum number of generations before present of 10,000 and effective population size of 100,000. Posterior ancestry probabilities were decoded into haplotype blocks and blocks were binned into clusters of similarly positioned blocks.
Landscape genomics
Geographical maps were made with publicly available layers downloaded from Natural Earth (https://www.naturalearthdata.com/). Various plotting routines rely on the sf108 and raster109 packages in the R environment for statistical computing110. Climate data were downloaded from WorldClim22 (19 bioclimatic variables, 0.5-arcmin resolution 1960–2000) and ClimateNA21. The distribution of climate variables across collections sites was explored via dynamic clustering111 followed by partitioning around medoids clustering112 with k = 7. The most representative climate variables were defined as those most correlated with the first eigenvector of variation within each cluster. Six of the seven clusters included WorldClim variables.
Weather data were downloaded from the NOAA portal for the most proximate weather station to each garden site that had complete daily temperature (minimum–maximum), and precipitation data from 1 September 2018 to 31 October 2019. The NOAA weather station identifiers used for each garden are as follows: IL (USC00110338), MI (USW00014815), MO (USW00003945), NE (USC00255362), OK (USW00053926), SD (USC00391076), TX1 (USC00414810), TX2 (USC00410433), TX3 (USC00418862) and TX4 (USW00003901).
Climate–phenotype associations across gardens were conducted on both raw data and imputed data. Latitude–survival associations (Fig. 2b) were accomplished on raw data with logistic regressions via glm with a binomial family in R. Imputations, which were accomplished in base R using nearest neighbours across all available phenotypes (k = 5), were used exclusively for tests of the rank order of gardens (Fig. 2c, d). Climate similarity–biomass associations were accomplished in mixed linear models via lmer92, comparing the full model (fixed = climate distance + intercept, random = genotype identifier) to a reduced model without the climate distance fixed effect using a likelihood ratio test.
Species distribution modelling (SDM) was used to simulate modern-day potential ranges for all ecotypes (upland, lowland and coastal) of P. virgatum. The final datasets used to build the SDMs comprised 277 (upland), 199 (coastal) and 121 (lowland) occurrence records. Six environmental predictors were used in our final SDM modelling (BIO1 = annual mean temperature, BIO2 = mean diurnal range, BIO4 = temperature seasonality, BIO5 = maximum temperature of warmest month, BIO16 = precipitation of wettest quarter and BIO17 = precipitation of driest quarter). SDMs were then generated with BIOMOD2 v.3.3113 with seven modelling algorithms: generalized linear models, boosted regression trees, artificial neural networks, flexible discriminant analysis, random forest, classification tree analysis and multivariate adaptive regression splines. For each model, the occurrence data were coupled with 500 pseudo-absence data generated randomly within the modelled study area with equal weighting for presences and pseudo-absences114. Models were trained with 80% of the coupled occurrences and pseudo-absence data and tested with the remaining 20%. Each modelling algorithm was run 100 times for a total of 700 models, which were evaluated via true skill statistics (TSS)115. TSS values ranging from 0.2 to 0.5 were considered poor, from 0.6 to 0.8 useful, and >0.8 good to excellent116. Unique ensemble SDMs were computed from approximately the 50 best SDMs out of 700 models for the three ecotypes on the basis of TSS threshold values (upland TSS threshold = 0.96, lowland TSS threshold = 0.93 and coastal TSS threshold = 0.965). The final ensemble SDMs were projected onto present climate layers to visualize modern-day potential ranges (Supplementary Data 12).
We examined how the presence of Midwest introgressions in the Atlantic subpopulation were associated with the independent and joint influences of climate, geography and kinship, by implementing redundancy analysis in vegan117–120. To partition explainable variance in introgression presence attributable to climate, kinship and geography, we ran four models: one full model with introgression presence (potential introgression blocks were coded as 0 for Atlantic inheritance or 1 for Midwest introgression) explained by climate (that is, the seven representative climate variables), kinship (the first two principal components calculated from the set of putatively neutral markers) and geography (latitude and longitude), and three models for each of these three factors conditioned on the other two. The inertia (that is, variance) values from the constrained matrix of each model were compared to determine the relative importance of climate, kinship, geography and their joint effect. Furthermore, to find introgression regions strongly linked to climate and survival-corrected biomass, we extracted the loadings for the redundancy analysis axes from two additional models: (1) one predicted by only climate and (2) one predicted only by survival-corrected biomass. Both models were significant according to permutation tests (n = 999; P < 0.001 for both), and all axes were approximately normally distributed. SNPs loading at the tails of each axis were more likely to indicate selection related to the predictors (that is, climate or survival-corrected biomass), so we identified all markers that were at least 2.5 s.d. (two-tailed P = 0.012) from the centre as introgressions putatively under selection119.
GWAS
Owing to the large sizes of our common garden datasets, we developed a pipeline—the switchgrassGWAS R package (https://github.com/Alice-MacQueen/switchgrassGWAS)—to allow fast, less-memory-intensive GWAS on the diversity panel, and to analyse the extent to which SNP effects were similar or different for phenotypes measured at different sites. This package leverages bigsnpr121 to perform fast (>300× faster than TASSEL) statistical analysis of massive SNP arrays encoded as matrices. It also incorporates current gold standards in the human genetics literature for SNP quality control, pruning and imputation, as well as population structure correction in GWAS. To test the significance of many effects in many conditions (for example, multiple sites, climate variables and so on), we used mashr32, a flexible, data-driven method that shares information on patterns of effect size and sign in any dataset for which effects can be estimated on a condition-by-condition basis for many conditions and SNPs. We determined which SNPs had evidence of significant phenotypic effects using local false sign rates, which are analogous to false discovery rates but more conservative (in that they also reflect the uncertainty in the estimation of the sign of the effect)122. We used these values to find SNPs with log10-transformed Bayes factors > 2. Here, the Bayes factor was the ratio of the likelihood of one or more significant phenotypic effects at a SNP to the likelihood that the SNP had only null effects. Following previous work33, a Bayes factor of >102 is considered decisive evidence in favour of the hypothesis that a SNP has one or more significant phenotypic effects.
To calculate regional heritability for climate- and fitness-associated SNPs we followed a previously described two-step method123. Variance component analysis was accomplished with ASReml (VSN International), using genomic relationship matrices calculated using the van Raden method124. Genomic relationship matrices were calculated within each subpopulation and for the full diversity panel. A kinship matrix based on all SNPs used in the univariate GWAS was calculated (G), as well as a kinship matrix based on SNPs significantly associated with climate in that subpopulation (log10-transformed Bayes factor > 2; Qclimate) and a kinship matrix based on SNPs significantly associated with biomass or winter survival in that subpopulation (log10-transformed Bayes factor > 2, or >1.385 for Gulf subpopulation; Qfitness). These kinship matrices were used for regional heritability mapping123 as in a previous publication125, using mixed models of the form:
in which the vector y represents the biomass values, Z is the design matrix for random effects, u is the whole genomic additive genetic effect, v is the regional genomic additive genetic effect and e is the residual. Matrix G is the whole genomic relationship matrix using all SNPs for the whole genome additive effect. Matrix Q is the regional genomic relationship obtained as above: one of Qclimate or Qfitness. I is the rank-y identity matrix, in which y is equal to the number of biomass values. Whole genomic, regional genomic and residual variances are , and , respectively. Phenotypic variance () is + + . Whole genomic heritability, regional heritability and total heritability are = (/), = (/) and = ( + /), respectively.
These models were run for the three locations where subpopulation GWAS were conducted: Columbia, Missouri; Hickory Corners, Michigan; and Austin, Texas. This resulted in 80 models: 4 sets of populations (the full diversity panel and 3 subpopulations), 2 model types (one model with G only and a G + Q model), for 10 phenotypes (biomass at 3 sites and 7 environmental variables).
Variance component analyses were also used to partition variance between the K- and N-subgenomes. Only SNPs with ancestral state calls (Supplementary Data 9) were used in this analysis, resulting in 460,429 SNPs used for each population subset. Kinship matrices based on all SNPs on a particular chromosome were calculated (QChr01K to QChr09K, and QChr01N to QChr09N), resulting in 18 kinship matrices. These kinship matrices were used for regional heritability mapping, using mixed models of the form:
in which the vector y represents the biomass values, Z is the design matrix for random effects, v1K (to v9K) or v1N (to v9N) (collectively designated vi) are the chromosome-specific genomic additive genetic effects and e is the residual. Matrices Qi are the chromosome-specific genomic relationship matrices for the nine chromosomes of the N and K subgenomes. Chromosome-specific and residual variances are and , respectively. Chromosome-specific heritability is = ( /), and subgenome-specific heritability is the sum of these variances across the nine chromosomes within each subgenome.
Candidate gene exploration
We integrated multiple data structures to rank and provide meaningful culling criteria for candidate genes within introgression intervals and physical proximity to quantitative trait loci peaks. In the case of GWAS peaks, candidate genes were defined as those loci within a 20-kb interval surrounding the mashr peak. Candidate genes for genomic introgressions must have at least partially overlapped the introgression interval. As inference of GWAS and introgressions were conducted within genetic subpopulations, all statistics reported in Supplementary Data 7 (candidate gene lists) are also subpopulation-specific, with the exception of gene co-expression analysis (which was conducted only on AP13 RNA-sequencing libraries used for annotation purposes (Supplementary Data 3)). For a given interval, we present a set of statistics. First, the physical proximity to the peak location was calculated as the midpoint of the gene to the midpoint of the interval (introgression) or GWAS peak position. Second, as the causal locus underlying GWAS peaks within a subpopulation must necessarily be variable within that subpopulation, we extracted all SNPs within and proximate to candidate gene models. These variants were annotated with SNPeff126 and the weighted sum of three main categories of variants (high, moderate and low; a description of these can be found at https://pcingola.github.io/SnpEff/se_inputoutput/#effect-prediction-details) for each gene were calculated as SNPeff_score = high × 20 + moderate × 5 + low × 1. Third, for each gene, we calculated the minor allele frequency of structural and presence–absence variants. Fourth, we include a vector of the identity of the WGCNA clusters for each gene. Finally, if the candidate was a homologue of flowering-time GWAS candidate genes from a previous publication127, the identity of the overlapping interval or gene is included.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this paper.
Online content
Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at 10.1038/s41586-020-03127-1.
Supplementary information
Source data
Acknowledgements
Plant collecting was conducted in collaboration with J. Randall (North Carolina Botanical Garden) through the Seeds for Success programme, A. Stottlemeyer (OSU and the USDA-NIFA Biotechnology Risk Assessment Grant Program, no. 2010-33522-21703), T. Quedensley, M. Donahue, D. Schemske and J. M. M. Reyna. We thank the Brackenridge Field laboratory, the Ladybird Johnson Wildflower Center and the Juenger laboratory for support with plant care and propagation. M. Donahue led the curation, propagation and maintenance of the diversity panel. Fieldwork was also conducted by P. Duberney, S. Reeder, K. Turner, M. Carey, T. Arredondo, N. Ryan, B. Watson, B. Battershell, N. Albert, H. Wilson, L. Simon, J. Sanley, L. Vormwald, T. Bortnem, S. Hofmann, M. Iceberg, C. Lamb and T. Vugteveen. Advice from J. G. Monroe, D. Hoover, P. Edger, J. Lasky, E. Kellogg, J. Vogel, G. Sarath and J. Tuskan helped to craft experimental designs, sequencing strategies and earlier versions of this text. R. VanBuren, P. Edger, H. Zheng, D. Ware and L. Cattivelli provided genome comparison information. We thank the HudsonAlpha Genomic Services Lab for loading Illumina X10 sequencing runs. This research was supported by the US Department of Energy Awards DESC0014156 to T.E.J., DE-SC0017883 to D.B.L. and DE-SC0010743 to K.M.D., the Great Lakes Bioenergy Research Center (Awards DESC0018409 and DE-FC02-07ER64494) and the Center for Bioenergy Innovation (Award DE-AC05-000R22725). Funding was provided by National Science Foundation PGRP Awards IOS0922457 and IOS1444533 to T.E.J. and IOS1402393 to J.T.L. The work conducted by the US Department of Energy Joint Genome Institute is supported by the Office of Science of the US Department of Energy under Contract No DE-AC02-05CH11231. The work conducted by the Joint BioEnergy Institute is supported by the Office of Science of the US Department of Energy under contract no. DE-AC02-05CH11231. The work conducted by Argonne National Laboratory is supported by the Office of Science of the US Department of Energy under contract DE-AC02-06CH11357. J.S. thanks T. Marsh for transferring his passion for ecological science to J.S. T.E.J. thanks K. Robertson for introducing him to prairie habitats and plant diversity.
Extended data figures and tables
Author contributions
K.D.B., P.R., M. Saha, L.E.B., M.C., K.M.D., D.B.L., D.S.R., J.G., T.E.J. and J.S. designed research. S.B., M.H., J. Jiang, T.H.P. IV, S.T., J.Z., J.M.M.-R., P.R., C.M.T., M.U., M.C., K.M.D. and D.B.L. contributed plant material and resources. J. Jenkins, C.P. and S.S. assembled and annotated the genome. J.B., A.R.B., P.A.F., F.B.F., J.D.J., D.B.L., J.L.-R., R.M., R.B.M., F.M.R. Jr, M. Saha, Y.W. and T.E.J. designed and executed field experiments. K.B., L.B., C.D., S.D., A.E., D.K., A.L., E.V.S., D.S., M. Sharma, R.S., A. Stewart, V.R.S., Y.T., J.W., X.W., M.W., Y.Y., M.Z. and R.A.W. conducted sequencing and data acquisition. J.T.L., A.H.M., S.M., J.D.N., A. Session, A. Sreedasyam, P.P.G., T.H., A.H., P.Q., C.A.S., G.A.W. and L.Z. conducted statistical and computational analyses. The manuscript was written by J.T.L., A.H.M., T.E.J. and J.S. with contributions from all authors.
Data availability
Sequence Read Archive accession codes for all RNA and DNA sequencing libraries can be found in Supplementary Data 3 and 4, respectively. The v5 AP13 genome has been deposited at DDBJ/ENA/GenBank under the accession JABWAI000000000. The genome, gene and repeat annotations can also be downloaded directly from Phytozome at https://phytozome-next.jgi.doe.gov/info/Pvirgatum_v5_1. Whenever possible, plant material will be shared upon request. Source data are provided with this paper.
Code availability
Custom pipelines for GWAS and other analyses are available from dataverse at 10.18738/T8/J377KE.
Competing interests
The authors declare no competing interests.
Footnotes
Peer review information Nature thanks Emily Heaton, Todd Michael, Ian Stavness and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: John T. Lovell, Alice H. MacQueen, Sujan Mamidi, Jason Bonnette, Jerry Jenkins
Contributor Information
John T. Lovell, Email: jlovell@hudsonalpha.org
Thomas E. Juenger, Email: tjuenger@mail.utexas.edu
Jeremy Schmutz, Email: jschmutz@hudsonalpha.org.
Extended data
is available for this paper at 10.1038/s41586-020-03127-1.
Supplementary information
The online version contains supplementary material available at 10.1038/s41586-020-03127-1.
References
- 1.Lobell DB, Schlenker W, Costa-Roberts J. Climate trends and global crop production since 1980. Science. 2011;333:616–620. doi: 10.1126/science.1204531. [DOI] [PubMed] [Google Scholar]
- 2.Challinor AJ, et al. A meta-analysis of crop yield under climate change and adaptation. Nat. Clim. Chang. 2014;4:287–291. doi: 10.1038/nclimate2153. [DOI] [Google Scholar]
- 3.Rosenzweig C, et al. Assessing agricultural risks of climate change in the 21st century in a global gridded crop model intercomparison. Proc. Natl Acad. Sci. USA. 2014;111:3268–3273. doi: 10.1073/pnas.1222463110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Porter, J. R. et al. in Climate Change 2014: Impacts, Adaptation, and Vulnerability. Part A: Global and Sectoral Aspects (Contribution of Working Group II to the Fifth Assessment Report of the Intergovernmental Panel on Climate Change) (eds Field, C. B. et al.) 485–533 (Cambridge Univ. Press, 2014).
- 5.Bevan MW, et al. Genomic innovation for crop improvement. Nature. 2017;543:346–354. doi: 10.1038/nature22011. [DOI] [PubMed] [Google Scholar]
- 6.Nelson R, Wiesner-Hanks T, Wisser R, Balint-Kurti P. Navigating complexity to breed disease-resistant crops. Nat. Rev. Genet. 2018;19:21–33. doi: 10.1038/nrg.2017.82. [DOI] [PubMed] [Google Scholar]
- 7.Risser, P. G., Birney, E. C. & Blocker, H. D. The True Prairie Ecosystem (Dowden, Hutchinson and Ross, 1981).
- 8.Suyker AE, Verma SB. Year‐round observations of the net ecosystem exchange of carbon dioxide in a native tallgrass prairie. Glob. Change Biol. 2001;7:279–289. doi: 10.1046/j.1365-2486.2001.00407.x. [DOI] [Google Scholar]
- 9.Schmer MR, Vogel KP, Mitchell RB, Perrin RK. Net energy of cellulosic ethanol from switchgrass. Proc. Natl Acad. Sci. USA. 2008;105:464–469. doi: 10.1073/pnas.0704767105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Palik DJ, Snow AA, Stottlemyer AL, Miriti MN, Heaton EA. Relative performance of non-local cultivars and local, wild populations of switchgrass (Panicum virgatum) in competition experiments. PLoS ONE. 2016;11:e0154444. doi: 10.1371/journal.pone.0154444. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.McLaughlin, S. et al. in Perspectives on New Crops and New Uses (ed. Janick, J.) 282–299 (ASHS, 1999).
- 12.Vogel KP, Schmer MR, Mitchell RB. Plant adaptation regions: ecological and climatic classification of plant materials. Rangeland Ecol. Manag. 2005;58:315–319. doi: 10.2111/1551-5028(2005)58[315:PAREAC]2.0.CO;2. [DOI] [Google Scholar]
- 13.Casler MD, et al. Latitudinal and longitudinal adaptation of switchgrass populations. Crop Sci. 2007;47:2249–2260. doi: 10.2135/cropsci2006.12.0780. [DOI] [Google Scholar]
- 14.Lipka AE, et al. Accelerating the switchgrass (Panicum virgatum L.) breeding cycle using genomic selection approaches. PLoS ONE. 2014;9:e112227. doi: 10.1371/journal.pone.0112227. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Poudel HP, Sanciangco MD, Kaeppler SM, Buell CR, Casler MD. Genomic prediction for winter survival of lowland switchgrass in the northern USA. G3. 2019;9:1921–1931. doi: 10.1534/g3.119.400094. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Lowry DB, et al. QTL × environment interactions underlie adaptive divergence in switchgrass across a large latitudinal gradient. Proc. Natl Acad. Sci. USA. 2019;116:12933–12941. doi: 10.1073/pnas.1821543116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Triplett JK, Wang Y, Zhong J, Kellogg EA. Five nuclear loci resolve the polyploid history of switchgrass (Panicum virgatum L.) and relatives. PLoS ONE. 2012;7:e38702. doi: 10.1371/journal.pone.0038702. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Martínez-Reyna JM, Vogel KP. Incompatibility systems in switchgrass. Crop Sci. 2002;42:1800–1805. doi: 10.2135/cropsci2002.1800. [DOI] [Google Scholar]
- 19.Casler MD, Vogel KP, Harrison M. Switchgrass germplasm resources. Crop Sci. 2015;55:2463–2478. doi: 10.2135/cropsci2015.02.0076. [DOI] [Google Scholar]
- 20.Evans J, et al. Extensive genetic diversity is present within North American switchgrass germplasm. Plant Genome. 2018;11:1–16. doi: 10.3835/plantgenome2017.06.0055. [DOI] [PubMed] [Google Scholar]
- 21.Wang T, Hamann A, Spittlehouse D, Carroll C. Locally downscaled and spatially customizable climate data for historical and future periods for North America. PLoS ONE. 2016;11:e0156720. doi: 10.1371/journal.pone.0156720. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Fick SE, Hijmans RJ. WorldClim 2: new 1-km spatial resolution climate surfaces for global land areas. Int. J. Climatol. 2017;37:4302–4315. doi: 10.1002/joc.5086. [DOI] [Google Scholar]
- 23.Lowry DB, et al. Adaptations between ecotypes and along environmental gradients in Panicum virgatum. Am. Nat. 2014;183:682–692. doi: 10.1086/675760. [DOI] [PubMed] [Google Scholar]
- 24.Casler MD, Vogel KP, Taliaferro CM, Wynia RL. Latitudinal adaptation of switchgrass populations. Crop Sci. 2004;44:293–303. doi: 10.2135/cropsci2004.2930. [DOI] [Google Scholar]
- 25.Porter CL. An analysis of variation between upland and lowland switchgrass Panicum virgatum L in central Oklahoma. Ecology. 1966;47:980–992. doi: 10.2307/1935646. [DOI] [Google Scholar]
- 26.McMillan C. Ecotypic differentiation within four North American prairie grasses. I. Morphological variation within transplanted community fractions. Am. J. Bot. 1964;51:1119–1128. doi: 10.1002/j.1537-2197.1964.tb06743.x. [DOI] [Google Scholar]
- 27.Grabowski PP, Morris GP, Casler MD, Borevitz JO. Population genomic variation reveals roles of history, adaptation and ploidy in switchgrass. Mol. Ecol. 2014;23:4059–4073. doi: 10.1111/mec.12845. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Lu F, et al. Switchgrass genomic diversity, ploidy, and evolution: novel insights from a network-based SNP discovery protocol. PLoS Genet. 2013;9:e1003215. doi: 10.1371/journal.pgen.1003215. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Casler MD, et al. 30 years of progress toward increased biomass yield of switchgrass and big bluestem. Crop Sci. 2018;58:1242–1254. doi: 10.2135/cropsci2017.12.0729. [DOI] [Google Scholar]
- 30.Casler MD, Vogel KP. Selection for biomass yield in upland, lowland, and hybrid switchgrass. Crop Sci. 2014;54:626–636. doi: 10.2135/cropsci2013.04.0239. [DOI] [Google Scholar]
- 31.Suarez-Gonzalez A, Lexer C, Cronk QCB. Adaptive introgression: a plant perspective. Biol. Lett. 2018;14:20170688. doi: 10.1098/rsbl.2017.0688. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Urbut SM, Wang G, Carbonetto P, Stephens M. Flexible statistical methods for estimating and testing effects in genomic studies with multiple conditions. Nat. Genet. 2019;51:187–195. doi: 10.1038/s41588-018-0268-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Kass RE, Raftery AE. Bayes factors. J. Am. Stat. Assoc. 1995;90:773–795. doi: 10.1080/01621459.1995.10476572. [DOI] [Google Scholar]
- 34.Fournier-Level A, et al. A map of local adaptation in Arabidopsis thaliana. Science. 2011;334:86–89. doi: 10.1126/science.1209271. [DOI] [PubMed] [Google Scholar]
- 35.Zhang Y, et al. Post-glacial evolution of Panicum virgatum: centers of diversity and gene pools revealed by SSR markers and cpDNA sequences. Genetica. 2011;139:933–948. doi: 10.1007/s10709-011-9597-6. [DOI] [PubMed] [Google Scholar]
- 36.Corbett-Detig R, Nielsen R. A hidden Markov model approach for simultaneously estimating local ancestry and admixture time using next generation sequence data in samples of arbitrary ploidy. PLoS Genet. 2017;13:e1006529. doi: 10.1371/journal.pgen.1006529. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Todesco M, et al. Massive haplotypes underlie ecotypic differentiation in sunflowers. Nature. 2020;584:602–607. doi: 10.1038/s41586-020-2467-6. [DOI] [PubMed] [Google Scholar]
- 38.Lucek K, Lemoine M, Seehausen O. Contemporary ecotypic divergence during a recent range expansion was facilitated by adaptive introgression. J. Evol. Biol. 2014;27:2233–2248. doi: 10.1111/jeb.12475. [DOI] [PubMed] [Google Scholar]
- 39.Whitney KD, et al. Quantitative trait locus mapping identifies candidate alleles involved in adaptive introgression and range expansion in a wild sunflower. Mol. Ecol. 2015;24:2194–2211. doi: 10.1111/mec.13044. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Comai L. The advantages and disadvantages of being polyploid. Nat. Rev. Genet. 2005;6:836–846. doi: 10.1038/nrg1711. [DOI] [PubMed] [Google Scholar]
- 41.Mattenberger F, Sabater-Muñoz B, Toft C, Fares MA. The phenotypic plasticity of duplicated genes in Saccharomyces cerevisiae and the origin of adaptations. G3. 2017;7:63–75. doi: 10.1534/g3.116.035329. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Clark JW, Donoghue PCJ. Whole-genome duplication and plant macroevolution. Trends Plant Sci. 2018;23:933–945. doi: 10.1016/j.tplants.2018.07.006. [DOI] [PubMed] [Google Scholar]
- 43.Stebbins GL. Polyploidy, hybridization, and the invasion of new habitats. Ann. Mo. Bot. Gard. 1985;72:824. doi: 10.2307/2399224. [DOI] [Google Scholar]
- 44.Bird KA, VanBuren R, Puzey JR, Edger PP. The causes and consequences of subgenome dominance in hybrids and recent polyploids. New Phytol. 2018;220:87–93. doi: 10.1111/nph.15256. [DOI] [PubMed] [Google Scholar]
- 45.Flagel LE, Wendel JF. Evolutionary rate variation, genomic dominance and duplicate gene expression evolution during allotetraploid cotton speciation. New Phytol. 2010;186:184–193. doi: 10.1111/j.1469-8137.2009.03107.x. [DOI] [PubMed] [Google Scholar]
- 46.Edger PP, et al. Origin and evolution of the octoploid strawberry genome. Nat. Genet. 2019;51:541–547. doi: 10.1038/s41588-019-0356-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Chen ZJ, et al. Genomic diversifications of five Gossypium allopolyploid species and their impact on cotton improvement. Nat. Genet. 2020;52:525–533. doi: 10.1038/s41588-020-0614-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Session AM, et al. Genome evolution in the allotetraploid frog Xenopus laevis. Nature. 2016;538:336–343. doi: 10.1038/nature19840. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Nieto Feliner G, Casacuberta J, Wendel JF. Genomics of evolutionary novelty in hybrids and polyploids. Front. Genet. 2020;11:792. doi: 10.3389/fgene.2020.00792. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Davis MB, Shaw RG. Range shifts and adaptive responses to Quaternary climate change. Science. 2001;292:673–679. doi: 10.1126/science.292.5517.673. [DOI] [PubMed] [Google Scholar]
- 51.South, A. rnaturalearthdata: World Vector Map Data from Natural Earth Used in ‘rnaturalearth’. R package version 0.1.0. https://CRAN.R-project.org/package=rnaturalearthdata (2017).
- 52.Xiao C-L, et al. MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads. Nat. Methods. 2017;14:1072–1074. doi: 10.1038/nmeth.4432. [DOI] [PubMed] [Google Scholar]
- 53.Chin C-S, et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods. 2013;10:563–569. doi: 10.1038/nmeth.2474. [DOI] [PubMed] [Google Scholar]
- 54.Kent WJ. BLAT—the BLAST-like alignment tool. Genome Res. 2002;12:656–664. doi: 10.1101/gr.229202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009;25:1754–1760. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.McKenna A, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–1303. doi: 10.1101/gr.107524.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Sharma MK, et al. Targeted switchgrass BAC library screening and sequence analysis identifies predicted biomass and stress response-related genes. Bioenerg. Res. 2016;9:109–122. doi: 10.1007/s12155-015-9667-1. [DOI] [Google Scholar]
- 58.Marçais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011;27:764–770. doi: 10.1093/bioinformatics/btr011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Smit, A. F., Hubley, R. & Green, P. RepeatMasker, http://www.repeatmasker.org/ (1996).
- 60.Ellinghaus D, Kurtz S, Willhoeft U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics. 2008;9:18. doi: 10.1186/1471-2105-9-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 2013;30:772–780. doi: 10.1093/molbev/mst010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Paradis E, Schliep K. ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics. 2019;35:526–528. doi: 10.1093/bioinformatics/bty633. [DOI] [PubMed] [Google Scholar]
- 63.Bennetzen JL, et al. Reference genome sequence of the model plant Setaria. Nat. Biotechnol. 2012;30:555–561. doi: 10.1038/nbt.2196. [DOI] [PubMed] [Google Scholar]
- 64.Lovell JT, et al. The genomic landscape of molecular responses to natural drought stress in Panicum hallii. Nat. Commun. 2018;9:5213. doi: 10.1038/s41467-018-07669-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Wu TD, Nacu S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics. 2010;26:873–881. doi: 10.1093/bioinformatics/btq057. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Zuo C, et al. Revealing the transcriptomic complexity of switchgrass by PacBio long-read sequencing. Biotechnol. Biofuels. 2018;11:170. doi: 10.1186/s13068-018-1167-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Haas BJ, et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 2003;31:5654–5666. doi: 10.1093/nar/gkg770. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Slater GSC, Birney E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics. 2005;6:31. doi: 10.1186/1471-2105-6-31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Lamesch P, et al. The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Res. 2012;40:D1202–D1210. doi: 10.1093/nar/gkr1090. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Schmutz J, et al. Genome sequence of the palaeopolyploid soybean. Nature. 2010;463:178–183. doi: 10.1038/nature08670. [DOI] [PubMed] [Google Scholar]
- 71.Jain R, et al. Genome sequence of the model rice variety KitaakeX. BMC Genomics. 2019;20:905. doi: 10.1186/s12864-019-6262-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Mamidi S, et al. A genome resource for green millet Setaria viridis enables discovery of agronomically valuable loci. Nat. Biotechnol. 2020;38:1203–1210. doi: 10.1038/s41587-020-0681-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Paterson AH, et al. The Sorghum bicolor genome and the diversification of grasses. Nature. 2009;457:551–556. doi: 10.1038/nature07723. [DOI] [PubMed] [Google Scholar]
- 74.Gordon SP, et al. Extensive gene content variation in the Brachypodium distachyon pan-genome correlates with population structure. Nat. Commun. 2017;8:2184. doi: 10.1038/s41467-017-02292-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.UniProt Consortium. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 2019;47:D506–D515. doi: 10.1093/nar/gky1049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Smit, A. & Hubley, R. RepeatModeler Open-1.0, http://www.repeatmasker.org/ (2010).
- 77.Bao W, Kojima KK, Kohany O. Repbase update, a database of repetitive elements in eukaryotic genomes. Mob. DNA. 2015;6:11. doi: 10.1186/s13100-015-0041-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Emms DM, Kelly S. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol. 2015;16:157. doi: 10.1186/s13059-015-0721-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Goodstein DM, et al. Phytozome: a comparative platform for green plant genomics. Nucleic Acids Res. 2012;40:D1178–D1186. doi: 10.1093/nar/gkr944. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Schliep KP. phangorn: phylogenetic analysis in R. Bioinformatics. 2011;27:592–593. doi: 10.1093/bioinformatics/btq706. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Subramanian AR, Kaufmann M, Morgenstern B. DIALIGN-TX: greedy and progressive approaches for segment-based multiple sequence alignment. Algorithms Mol. Biol. 2008;3:6. doi: 10.1186/1748-7188-3-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Talavera G, Castresana J. Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. Syst. Biol. 2007;56:564–577. doi: 10.1080/10635150701472164. [DOI] [PubMed] [Google Scholar]
- 83.Guindon S, et al. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst. Biol. 2010;59:307–321. doi: 10.1093/sysbio/syq010. [DOI] [PubMed] [Google Scholar]
- 84.Sanderson MJ. r8s: inferring absolute rates of molecular evolution and divergence times in the absence of a molecular clock. Bioinformatics. 2003;19:301–302. doi: 10.1093/bioinformatics/19.2.301. [DOI] [PubMed] [Google Scholar]
- 85.Anders S, Pyl PT, Huber W. HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics. 2015;31:166–169. doi: 10.1093/bioinformatics/btu638. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:550. doi: 10.1186/s13059-014-0550-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87.Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. 2008;9:559. doi: 10.1186/1471-2105-9-559. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Li H, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89.Koboldt DC, et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 2012;22:568–576. doi: 10.1101/gr.129684.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.O’Connell J, et al. Haplotype estimation for biobank-scale data sets. Nat. Genet. 2016;48:817–820. doi: 10.1038/ng.3583. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.Danecek P, et al. The variant call format and VCFtools. Bioinformatics. 2011;27:2156–2158. doi: 10.1093/bioinformatics/btr330. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92.Bates, D., Mächler, M., Bolker, B. & Walker, S. Fitting linear mixed-effects models using lme4. J. Stat. Softw. 67, 10.18637/jss.v067.i01 (2015).
- 93.Azad A, Pavlopoulos GA, Ouzounis CA, Kyrpides NC, Buluç A. HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks. Nucleic Acids Res. 2018;46:e33. doi: 10.1093/nar/gkx1313. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 94.VanBuren R, et al. Single-molecule sequencing of the desiccation-tolerant grass Oropetium thomaeum. Nature. 2015;527:508–511. doi: 10.1038/nature15714. [DOI] [PubMed] [Google Scholar]
- 95.McCormick RF, et al. The Sorghum bicolor reference genome: improved assembly, gene annotations, a transcriptome atlas, and signatures of genome organization. Plant J. 2018;93:338–354. doi: 10.1111/tpj.13781. [DOI] [PubMed] [Google Scholar]
- 96.Sedlazeck FJ, et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat. Methods. 2018;15:461–468. doi: 10.1038/s41592-018-0001-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 97.Jeffares DC, et al. Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast. Nat. Commun. 2017;8:14061. doi: 10.1038/ncomms14061. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 98.Zheng X, et al. A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics. 2012;28:3326–3328. doi: 10.1093/bioinformatics/bts606. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 99.Remington DL, et al. Structure of linkage disequilibrium and phenotypic associations in the maize genome. Proc. Natl Acad. Sci. USA. 2001;98:11479–11484. doi: 10.1073/pnas.201394398. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 100.Purcell S, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 101.Jombart T, Devillard S, Balloux F. Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genet. 2010;11:94. doi: 10.1186/1471-2156-11-94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 102.Jombart T, Ahmed I. adegenet 1.3-1: new tools for the analysis of genome-wide SNP data. Bioinformatics. 2011;27:3070–3071. doi: 10.1093/bioinformatics/btr521. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 103.Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000;155:945–959. doi: 10.1093/genetics/155.2.945. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 104.Schiffels S, Durbin R. Inferring human population size and separation history from multiple genome sequences. Nat. Genet. 2014;46:919–925. doi: 10.1038/ng.3015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 105.McArtor DB, Lubke GH, Bergeman CS. Extending multivariate distance matrix regression with an effect size measure and the asymptotic null distribution of the test statistic. Psychometrika. 2017;82:1052–1077. doi: 10.1007/s11336-016-9527-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 106.Kuhn, M. et al. Caret: Classification and Regression Training. R package version 6.0-78. https://CRAN.R-project.org/package=caret (2016).
- 107.Medina P, Thornlow B, Nielsen R, Corbett-Detig R. Estimating the timing of multiple admixture pulses during local ancestry inference. Genetics. 2018;210:1089–1107. doi: 10.1534/genetics.118.301411. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 108.Pebesma E. Simple features for R: standardized support for spatial vector data. R J. 2018;10:439. doi: 10.32614/RJ-2018-009. [DOI] [Google Scholar]
- 109.Hijmans, R. J. et al. raster: Geographic Data Analysis and Modeling. R package version 3.4-5. https://CRAN.R-project.org/package=raster (2015).
- 110.R Core Team. R: A Language and Environment for Statistical Computing, https://www.r-project.org/ (R Foundation for Statistical Computing, 2013).
- 111.Langfelder, P., Zhang, B. & Horvath, S. dynamicTreeCut: Methods for Detection of Clusters in Hierarchical Clustering Dendrograms. R package version 1.63-1. https://CRAN.R-project.org/package=dynamicTreeCut (2014).
- 112.Maechler, M. et al. Cluster: Cluster Analysis Basics and Extensions. R package version 1-56. https://CRAN.R-project.org/package=cluster (2012).
- 113.Thuiller, W., Georges, D., Engler, R., & Breiner, F. biomod2: Ensemble Platform for Species Distribution Modeling. R package version 3.3-7. https://CRAN.R-project.org/package=biomod2 (2016).
- 114.Barbet-Massin M, Jiguet F, Albert CH, Thuiller W. Selecting pseudo-absences for species distribution models: how, where and how many? Methods Ecol. Evol. 2012;3:327–338. doi: 10.1111/j.2041-210X.2011.00172.x. [DOI] [Google Scholar]
- 115.Allouche O, Tsoar A, Kadmon R. Assessing the accuracy of species distribution models: prevalence, kappa and the true skill statistic (TSS) J. Appl. Ecol. 2006;43:1223–1232. doi: 10.1111/j.1365-2664.2006.01214.x. [DOI] [Google Scholar]
- 116.Coetzee BWT, Robertson MP, Erasmus BFN, van Rensburg BJ, Thuiller W. Ensemble models predict Important Bird Areas in southern Africa will become less effective for conserving endemic birds under climate change. Glob. Ecol. Biogeogr. 2009;18:701–710. doi: 10.1111/j.1466-8238.2009.00485.x. [DOI] [Google Scholar]
- 117.Oksanen, J., Blanchet, F., Kindt, R., Legendre, P. & Minchin, R. vegan: Community Ecology Package. R package version 2.0-10. https://CRAN.R-project.org/package=vegan (2013).
- 118.Gugger PF, Ikegami M, Sork VL. Influence of late Quaternary climate change on present patterns of genetic variation in valley oak, Quercus lobata Née. Mol. Ecol. 2013;22:3598–3612. doi: 10.1111/mec.12317. [DOI] [PubMed] [Google Scholar]
- 119.Napier JD, de Lafontaine G, Hu FS. Exploring genomic variation associated with drought stress in Picea mariana populations. Ecol. Evol. 2020;10:9271–9282. doi: 10.1002/ece3.6614. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 120.Forester BR, Lasky JR, Wagner HH, Urban DL. Comparing methods for detecting multilocus adaptation with multivariate genotype–environment associations. Mol. Ecol. 2018;27:2215–2233. doi: 10.1111/mec.14584. [DOI] [PubMed] [Google Scholar]
- 121.Privé F, Aschard H, Ziyatdinov A, Blum MGB. Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr. Bioinformatics. 2018;34:2781–2787. doi: 10.1093/bioinformatics/bty185. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 122.Stephens M. False discovery rates: a new deal. Biostat. 2016;60:kxw041. doi: 10.1093/biostatistics/kxw041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 123.George AW, Visscher PM, Haley CS. Mapping quantitative trait loci in complex pedigrees: a two-step variance component approach. Genetics. 2000;156:2081–2092. doi: 10.1093/genetics/156.4.2081. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 124.VanRaden PM, et al. Reliability of genomic predictions for North American Holstein bulls. J. Dairy Sci. 2009;92:16–24. doi: 10.3168/jds.2008-1514. [DOI] [PubMed] [Google Scholar]
- 125.Santantonio N, Jannink J-L, Sorrells M. A low resolution epistasis mapping approach to identify chromosome arm interactions in allohexaploid wheat. G3. 2018;9:675–684. doi: 10.1534/g3.118.200646. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 126.Cingolani P, et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff. Fly (Austin) 2012;6:80–92. doi: 10.4161/fly.19695. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 127.Grabowski PP, et al. Genome-wide associations with flowering time in switchgrass using exome-capture sequencing data. New Phytol. 2017;213:154–169. doi: 10.1111/nph.14101. [DOI] [PubMed] [Google Scholar]
- 128.Schnable PS, et al. The B73 maize genome: complexity, diversity, and dynamics. Science. 2009;326:1112–1115. doi: 10.1126/science.1178534. [DOI] [PubMed] [Google Scholar]
- 129.Maccaferri M, et al. Durum wheat genome highlights past domestication signatures and future improvement targets. Nat. Genet. 2019;51:885–895. doi: 10.1038/s41588-019-0381-3. [DOI] [PubMed] [Google Scholar]
- 130.Zou C, et al. The genome of broomcorn millet. Nat. Commun. 2019;10:436. doi: 10.1038/s41467-019-08409-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 131.VanBuren R, et al. Exceptional subgenome stability and functional divergence in the allotetraploid Ethiopian cereal teff. Nat. Commun. 2020;11:884. doi: 10.1038/s41467-020-14724-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 132.Hofmeister BT, et al. A genome assembly and the somatic genetic and epigenetic mutation rate in a wild long-lived perennial Populus trichocarpa. Genome Biol. 2020;21:259. doi: 10.1186/s13059-020-02162-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 133.Marrano A, et al. High-quality chromosome-scale assembly of the walnut (Juglans regia L.) reference genome. Gigascience. 2020;9:giaa050. doi: 10.1093/gigascience/giaa050. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Sequence Read Archive accession codes for all RNA and DNA sequencing libraries can be found in Supplementary Data 3 and 4, respectively. The v5 AP13 genome has been deposited at DDBJ/ENA/GenBank under the accession JABWAI000000000. The genome, gene and repeat annotations can also be downloaded directly from Phytozome at https://phytozome-next.jgi.doe.gov/info/Pvirgatum_v5_1. Whenever possible, plant material will be shared upon request. Source data are provided with this paper.
Custom pipelines for GWAS and other analyses are available from dataverse at 10.18738/T8/J377KE.