Abstract
Genetic diversity is key to crop improvement. Owing to pervasive genomic structural variation, a single reference genome assembly cannot capture the full complement of sequence diversity of a crop species (known as the ‘pan-genome’1). Multiple high-quality sequence assemblies are an indispensable component of a pan-genome infrastructure. Barley (Hordeum vulgare L.) is an important cereal crop with a long history of cultivation that is adapted to a wide range of agro-climatic conditions2. Here we report the construction of chromosome-scale sequence assemblies for the genotypes of 20 varieties of barley—comprising landraces, cultivars and a wild barley—that were selected as representatives of global barley diversity. We catalogued genomic presence/absence variants and explored the use of structural variants for quantitative genetic analysis through whole-genome shotgun sequencing of 300 gene bank accessions. We discovered abundant large inversion polymorphisms and analysed in detail two inversions that are frequently found in current elite barley germplasm; one is probably the product of mutation breeding and the other is tightly linked to a locus that is involved in the expansion of geographical range. This first-generation barley pan-genome makes previously hidden genetic variation accessible to genetic studies and breeding.
Subject terms: Structural variation, Genomics, Plant genetics
Chromosome-scale sequence assemblies of 20 diverse varieties of barley are used to construct a first-generation pan-genome, revealing previously hidden genetic variation that can be used by studies aimed at crop improvement
Main
A staple food of ancient civilizations, today barley is used mainly for animal feed and malting. Barley is more adaptable to harsh environmental conditions than its close relative wheat, and maintains an important role in human nutrition in harsh climatic regions that include the Ethiopian and Tibetan highlands2. As in other crops, genomics has been a major driver of progress in barley genetics and breeding in the past decade3. The first draft reference genome for barley4, and its subsequent revisions5,6, have formed the basis for gene isolation7, compiling a single-nucleotide polymorphism (SNP) variation atlas for wild and domesticated germplasm8, and activating plant genetic resources9. At the same time, reduced-representation surveys of structural variation10 and map-based cloning11 have implicated variation in gene content and copy number in the control of agronomic traits. The concept of the pan-genome refers to a species-wide catalogue of genic presence/absence variation (PAV)12, or more generally, structural variation that affects (potentially non-coding) sequences of 50 or more base pairs (bp) in size. Although several methods of pan-genomic analysis that use short-read sequence data in the context of a single reference genome have been devised13, large and complex genomes require multiple high-quality sequence assemblies to capture and contextualize sequences that are absent in—or highly diverged from—a single reference genotype14. Progress in sequencing and genome mapping technologies has only recently made possible the fast and cost-effective assembly of tens of genotypes of large-genome plant species, such as barley (haploid genome size of 5 Gb)15.
Twenty barley reference genomes
The starting point for pan-genomics in barley was the comprehensive survey of species-wide diversity on the basis of the genome-wide genotyping of more than 22,000 barley accessions, mainly from the German national gene bank9. To achieve a good representation of major barley gene pools, we selected accessions that were located in the branches of the first six principal components from the previously published principal component analysis (PCA)9 (Fig. 1a, Extended Data Fig. 1), reflecting the key determinants of population structure: geographical origin, row type and annual growth habit. In addition to these gene pool representatives, our panel included the reference cultivar Morex5, two current or former elite malting varieties (RGT Planet and Hockett), two founder lines of Chinese barley breeding (ZDM01467 and ZDM02064), Golden Promise and Igri (two genotypes with high transformation efficiency16,17), Barke (a successful German variety and the parent of several mutant and mapping populations18,19) and one wild barley (H. vulgare subsp. spontaneum (K. Koch) Thell.) genotype from Israel (B1K-04-12, a desert ecotype collected at Ein Prat)20.
We constructed chromosome-scale sequence assemblies for 20 accessions (Extended Data Table 1). In brief, paired-end and mate-pair Illumina short reads were assembled into scaffolds of megabase (Mb)-scale contiguity (Extended Data Table 1). Scaffold assembly was done with Minia21 and SOAPDenovo22 following the TRITEX method6 (n = 16), DeNovoMagic from NRGene (n = 3) or W2rap23 (n = 1). We used 10X Genomics Chromium linked-reads and chromosome conformation capture (Hi-C) data to arrange scaffolds into chromosomal pseudomolecules using the TRITEX pipeline6 (Extended Data Table 1). A comparison of the short-read assembly of the Morex cultivar to a long-read assembly of this genotype generated from PacBio long reads showed high co-linearity at chromosomal scale, good concordance in gene space representation and similar power to detect PAV (Extended Data Fig. 2), indicating that short-read assemblies are amenable to pan-genomic analyses in barley. Although the assemblies of the 20 diverse accessions differed in contiguity and the extent of gap sequence in the intergenic space, they had a similar representation of reference gene models (Morex V2) and were highly co-linear to each other at the whole-chromosome scale (Fig. 1b, Extended Data Fig. 3). A similar proportion (about 80%) of the assembled sequence of each genotype was composed of transposable elements, with an average of 113,200 intact full-length long-terminal repeat retro-elements (LTRs) per assembly (Supplementary Table 1). However, we found pronounced differences in the number of shared intact full-length LTR locations: only 17 to 25% of full-length LTR locations present in the wild barley B1K-04-12 were shared at 98% sequence identity and 98% alignment coverage with any domesticated genotype (Extended Data Fig. 4). By contrast, more closely related domesticated genotypes shared between 53% and 67% of their full-length LTRs, consistent with previous reports of rapid sequence turn-over in the non-coding space in large-genome plant species24,25.
Extended Data Table 1.
#Chromosomes 1H to 7H.
§Non-transposable element models or transposable-element filtered.
De novo gene annotation using Illumina RNA sequencing and PacBio Iso-Seq data (Supplementary Table 2) was performed for three genotypes: Morex (which has previously been reported6), Barke and the Ethiopian landrace HOR 10350 (Extended Data Fig. 5). Gene models defined on the basis of these three assemblies were consolidated and projected onto the remaining 17 assemblies (Extended Data Fig. 5). Between 35,859 and 40,044 gene models were annotated by projection in each assembly (Extended Data Table 1) with an average of 37,515 (s.d. = 896). The number of gene models was about 20% higher in the projections than in de novo annotations (Extended Data Fig. 5e), which indicates that some of the models lack transcript support: possible explanations for the discrepancy are highly tissue-specific expression or pseudogenization. The clustering of orthologous gene models yielded 40,176 orthologous groups. Of these, 21,992 occurred as a single copy in all 20 assemblies; 3,236 occurred in multiple copies in at least one of the 20 assemblies; 13,188 were absent from at least one assembly; and 1,760 were present in only one assembly. On average, 14.7% of gene models annotated in each assembly occurred in tandem arrays that comprised two or more adjacent copies. These results point to abundant genic copy-number variation between barley genotypes. Future transcriptomic studies will ascertain the effect of structural variants on gene expression.
Pan-genome as a tool for genetics and breeding
High-quality genome assemblies are a resource for ascertaining and providing context to structural variants, which can then be genotyped in a wider set of germplasm using low-coverage or reduced-representation sequence data. We used two complementary approaches to detect structural variation: assembly comparison and clustering of single-copy sequences to derive markers that can be scored in short-read data. We used the Assemblytics26 software to discover PAV by pair-wise comparison of 19 chromosome-scale assemblies to the Morex reference. We identified 1,586,262 PAVs, ranging in size from 50 to 999,568 bp, and observed an enrichment for low-frequency variants (Extended Data Fig. 6a, b). PAV density was higher in distal, gene-rich regions (Extended Data Fig. 6c), which are characterized by higher nucleotide diversity and recombination rates8. A total of 5,446 out of 5,602 deletions longer than 5 kilobases (kb) found in Barke relative to Morex were mapped genetically in the 90 recombinant inbred lines of the Morex × Barke population19 with highly concordant positions (Spearman correlation = 0.99) (Extended Data Fig. 6d), which provides support for the accuracy of the detected polymorphisms. At least one member of 18,562 (46%) groups of orthologous genes overlapped with structural variants discovered in the 20 sequence assemblies. As observed in other plant species27, resistance-gene homologues containing NB-ARC and protein kinase domains were frequently found among PAV genes (Supplementary Table 3).
Structural variants cover non-genic regions composed of repetitive sequence, making it hard to establish orthologous relationships or the presence of specific alleles from short-read data only. To derive quantitative estimates of the extent of pan-genomic variation and as a tool for genetic analysis such as association scans, we focused on single-copy regions extracted from each of the 20 assemblies and clustered into a non-redundant set of sequences (hereafter referred to as the ‘single-copy pan-genome’) (Extended Data Fig. 7a). The average cumulative size of single-copy sequence in each accession was 478 Mb (that is, 9.5% of the assembly genome). The total size of non-redundant single-copy sequence was 638.6 Mb, represented by 1,472,508 clusters with an N50 of 1,087 bp (Extended Data Fig. 7b). The single-copy sequence shared among all 20 genotypes amounted to 402.5 Mb, whereas 235.9 Mb were variable (that is, absent or present in higher copy number in at least one assembly) (Fig. 2a). On average, each of the 20 genotypes contained 2.9 Mb of single-copy sequence not present in any other assembly. As observed for transposable element divergence, the wild barley B1K-04-12 had the highest amount of unique single-copy sequence (Extended Data Table 1).
To test the suitability of the single-copy pan-genome for genetic analysis in a wider diversity panel without high-quality genome sequences, we collected whole-genome shotgun data (threefold coverage) for 200 domesticated and 100 wild varieties of barley (Supplementary Table 4). The abundance of 160,716 single-copy clusters that overlap structural variants was estimated by counting cluster-constituent k-mers (k = 31) in sequence reads of the diversity panel. In addition, we analysed genotyping-by-sequencing data of 19,778 gene bank accessions of domesticated barley9 using the same approach. Abundance estimates based on k-mers (hereafter referred to as ‘pan-genome markers’) showed that loci detected as single-copy sequence in one genome assembly can vary in copy number from zero to many in diverse germplasm (Extended Data Fig. 7c). A PCA of pan-genome markers genotyped in whole-genome shotgun and genotyping-by-sequencing data highlighted the same drivers of global population structure as SNPs (Extended Data Fig. 7d–g). In genome-wide association scans for morphological traits, pan-genome markers revealed—with a good signal-to-noise ratio—peaks that are consistent with previous reports9 (Fig. 2b, Extended Data Fig. 8). Notably, the pan-genome marker that was most highly associated with lemma adherence covered the NUDUM (NUD) gene11 (Fig. 2c). All varieties of naked barley—in which lemmas can be easily separated from grains—are thought to trace back to a single mutational event, deleting the entire NUD sequence11. Another putative knockout allele of NUD (nud1.g) that contains a likely disruptive SNP variant was recently found in Tibetan barley28. All 36 naked accessions in our panel contained the known deletion (Fig. 2d), indicating that broader sampling of barley diversity—with a particular focus on centres of (morphological) diversity—is needed to discover novel rare alleles by genomic analyses.
Compared to reference-free approaches for k-mer-based genome-wide association scans such as AgRenSeq29, trait-associated pan-genome markers are assigned with high precision to genomic positions, and aligning sequence assemblies in their vicinity provides immediate information about differences between haplotypes (Fig. 2c). Furthermore, the reduction of marker number by implicit clustering of k-mers into single-copy loci allows the use of standard mixed linear models30,31 to correct for genomic relatedness.
A map of polymorphic inversions
Chromosome-scale sequence assemblies can reveal large-scale rearrangements that are challenging to detect with other methods. Large inversions (more than 5 Mb in size) were prominent in the genome alignments of our 20 assemblies (Fig. 1b, Extended Data Fig. 3a, c). Previous reports on segregating inversions in barley are anecdotal and have focused on induced mutants32,33. To discover inversions in a broader set of germplasm, we mined patterns of contact frequencies in Hi-C data of a diversity panel mapped to a single reference genome34. Among 69 barley genotypes (67 domesticated and 2 wild accessions) (Supplementary Table 5), Hi-C-based inversion scans revealed a total of 42 events that ranged from 4 to 141 Mb in size (mean size of 23.9 Mb) (Extended Data Fig. 9a). Most of these events occurred in the low-recombining pericentromeric regions of the barley chromosomes and segregated at low frequency: 25 events were observed only once (Extended Data Fig. 9b, c, Supplementary Table 6). We focus here on two notable examples: a frequent event on chromosome 2H and an inversion in the distal region of the long arm of chromosome 7H.
The inversion in chromosome 7H detected in the RGT Planet cultivar was the largest event that segregated in our panel (141 Mb) (Fig. 3a). In a biparental mapping population derived from a cross between RGT Planet and the non-carrier cultivar Hindmarsh (Fig. 3b), this event repressed recombination in an interval that spanned 49 cM in the genetic map of the Morex × Barke population19, which is isogenic for absence of the inversion (Fig. 3c, Supplementary Table 7). We also observed a moderately distorted segregation (57% allele frequency, χ2 = 4.88, P < 0.05) in favour of the Hindmarsh allele in this interval. Recombination frequencies were increased in the flanking regions of the inversion in the RGT Planet × Hindmarsh population relative to Morex × Barke, which suggests a compensatory mechanism to maintain an average number of one-to-two crossovers per chromosome in the presence of large tracts of suppressed recombination35.
By focusing on the inversion breakpoints in the RGT Planet sequence assembly, we designed a diagnostic PCR assay (Supplementary Fig. 2a, b, d) to rapidly genotype the presence of the inversion in 1,406 accessions (Supplementary Table 8). The inverted haplotype occurred at low frequency (1.3%) in the whole panel, but was found in many lines in the RGT Planet pedigree (Supplementary Fig. 3)—including commercially successful barley cultivars of past decades, such as Triumph, Quench and Sebastian. The earliest cultivar that carried the inversion was Diamant. As one of the donors of the semi-dwarf growth habit, Diamant was a highly influential founder line of modern barley breeding and traces back to a mutant induced by gamma irradiation of the Czech cultivar Valticky36. We genotyped several gene bank accessions and germplasm samples of both Valticky and Diamant. Notably, none of the Valticky samples carried the inversion, whereas it segregated in the Diamant samples (Fig. 3d). Quantitative trait loci mapping for yield-related traits in the RGT Planet × Hindmarsh population did not show signals on chromosome 7H (Supplementary Fig. 2e, Supplementary Table 9), consistent with selective neutrality of the inversion. This strongly suggests that mutation breeding in the 1960s has given rise to a cryptic large inversion, which—unbeknownst to breeders—segregates in elite varieties of barley.
The second inversion we focused on spanned 10 Mb in the interstitial region of chromosome 2H (Fig. 1b) and was present in 26 out of 69 Hi-C samples (Supplementary Table 8). Local PCA and haplotype analysis in our panel of 200 domesticated and 100 wild varieties of barley indicated a single origin of the inverted haplotype (Fig. 4a, b, Supplementary Fig. 2c). The inversion occurred only among domesticated barley of Western geographical origin9, indicating that it arose or has risen to high frequency after domestication. The inverted region contains 46 high-confidence genes in the Morex cultivar. The closest gene to the inversion breakpoint—at 448 kb distance from the distal breakpoint in the non-carrier Morex—was HvCENTRORADIALIS (HvCEN)37 (Fig. 4c). Although induced mutants of HvCEN flower very early, natural variation in HvCEN has previously been implicated in environmental adaptation to northern European climates37. All of the inversion carriers we analysed had HvCEN haplotype III, which is associated with later flowering in spring barley varieties from northern Europe37,38. Further research is required to determine whether the inversion close to HvCEN has direct functional consequences (for instance, by modulating HvCEN expression) or whether it hitchhiked along with a tightly linked causal variant.
Discussion
The digital representation of the pan-genome can expand the repertoire of natural or induced sequence variation that is accessible to genetic analyses and breeding. Our comparison of 20 chromosome-scale sequence assemblies has revealed pervasive variation in genes and non-coding regions. Focusing on single-copy sequences, we translated this variation into scorable markers that are amenable to population genetic analysis and association scans. A notable finding was the prevalence of large (more than 5 Mb in size) inversion polymorphisms in current elite germplasm. It is likely that the suppression of genetic recombination in inversion heterozygotes has manifested itself in hard-to-explain patterns of long-range linkage and segregation distortion between elite lines in breeding programmes. Our map of inversion polymorphisms will provide breeders with a point of reference to avoid—or interpret correctly—crosses between carriers and non-carriers. We found abundant structural variation in 20 representative barley genotypes, but individual events occurred at low frequency (Extended Data Figs. 6, 9). This observation, combined with the slow saturation of the single-copy pan-genome (Fig. 2a), motivate the genomic analysis of more genotypes to expand the barley pan-genome. The next phase of barley pan-genomics will focus on an augmented panel of domesticated and wild germplasm, working towards the long-term goal of high-quality genome sequences of all barley plant genetic resources as part of a biodigital resource centre39,40.
Methods
No statistical methods were used to predetermine sample size. The experiments were not randomized, and investigators were not blinded to allocation during experiments and outcome assessment.
Library preparation, sequencing data generation and genome assembly of 20 diverse varieties of barley
High-molecular-weight DNA was extracted from one-week-old seedlings of 20 diverse barley accessions given in Supplementary Table 10, using a previously described large-scale DNA extraction protocol41. For the NRGene DeNovoMAGIC3.0 assemblies, 450-bp paired-end (PE450) libraries of Morex, Barke, HOR 10350 and B1K-04-12 were prepared at the Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben. The 450-bp paired-end libraries for other accessions, 800-bp paired-end libraries and mate-pair libraries of three sizes were prepared and sequenced at the University of Illinois Roy J. Carver Biotechnology Center. The 10X Genomics Chromium libraries were prepared at the University of Saskatchewan Wheat Molecular Breeding Laboratoryand sequenced by Genome Quebec or prepared and sequenced at the Roy J. Carver Biotechnology Center, using the manufacturers’ recommendations. Published tethered chromosome conformation data for Morex, Barke, HOR 10350 and B1K-04-12 (ref. 42) was used for scaffolding the respective genome. For the other accessions, in situ Hi-C libraries were prepared using a previously described method43. Sequencing data generated from each of the libraries are given in Supplementary Table 10. NRGene DeNovoMAGIC3.0 scaffold assemblies were provided for Barke, HOR 10350 and B1K-04-12. The 10X Chromium, population sequencing (POPSEQ) and Hi-C data were then used to prepare chromosome-scale assemblies using the TRITEX pipeline6 (commit: 7041ff2). For the other assemblies, the TRITEX pipeline was also used for contig assembly and scaffolding with mate-pair and 10X data (Extended Data Table 1). High-confidence gene models annotated on the Morex V2 reference6 and full-length cDNA sequences44 were aligned to the assemblies to assess gene-space completeness with the parameters of ≥90% query coverage and ≥97% (≥90% for full-length cDNA) identity.
Tissue collection and RNA extraction
Plant material for the collection of tissues for RNA sequencing (RNA-seq) and Iso-Seq was grown in the greenhouse at IPK Gatersleben with day–night temperatures of 21 °C–18 °C. Embryonic tissue, leaves, roots, internode, inflorescence (5 mm) and developing seeds (5 and 15 days after pollination) were collected as previously described4, snap-frozen in liquid nitrogen and stored at −80 °C until RNA extractions were performed. RNA was extracted from the collected tissues using a Trizol extraction protocol4 and purified using Qiagen RNeasy miniprep columns as per the manufacturer’s instructions. RNA quality was checked on Agilent RNA HS screen tape and RNA with RIN value greater than 8 was used for RNA-seq and Iso-Seq library construction.
RNA-seq library preparation and data generation
RNA-seq libraries were prepared from purified RNA using the TruSeq RNA sample preparation kit (Illumina) as per the manufacturer’s recommendation at IPK Gatersleben. Libraries were pooled at equimolar concentrations, quantified by qPCR and paired-end-sequenced on an Illumina HiSeq 2500 for 200 cycles. The data generated for each tissue are given in Supplementary Table 2.
Iso-Seq data generation and analysis
Two libraries for each embryonic tissue RNA and pooled RNA from seven tissues (described in ‘Tissue collection and RNA extraction’) were prepared for Barke and HOR 10350 using the PacBio Iso-Seq protocol. In brief, double-stranded RNA was synthesized using SMARTer PCR cDNA synthesis kit (Clontech; cat. no. 634925). Two fractions of cDNA with different size profiles were prepared by using differing ratios of DNA to Ampure XP beads (Beckman Coulter, cat. no. A63882). Equimolar concentration of each fraction were pooled, and a minimum of one microgram of double-stranded cDNA was used for Iso-Seq library construction as per the PacBio library construction protocol. Two additional libraries from pooled RNA tissues were prepared using cDNA prepared from TeloPrime v.1.0 kit (Lexogen) following the manufacturer’s instructions. Libraries were quantified and sequenced on a PacBio Sequel device at IPK Gatersleben. Data were analysed using SMRTLink v.5.0 Isoseq v.1.0 pipeline or Isoseq3 pipeline (https://github.com/PacificBiosciences/IsoSeq_SA3nUP/wiki/Tutorial:-Installing-and-Running-Iso-Seq-3-using-Conda). The steps involved in Iso-Seq data analysis were the generation of circular consensus sequences, and then the classification of circular consensus sequence reads into full-length non-chimeric reads and non-full length reads on the basis of the presence of primer sequences and polyA sequences. Full-length non-chimeric reads were then clustered on the basis of sequence similarity to yield high- and low-quality isoforms. The data generated and method of library preparation are given in Supplementary Table 2.
Gene projections and repeat annotation
Gene models for Morex, Barke and HOR 10350 were predicted using transcriptome data (Supplementary Table 2) and protein homology evidence, and derived by a previously described annotation pipeline5. High-confidence gene models from these accessions were aligned to pseudo-chromosomes of each accession separately using blat45. For each genomic region identified by blat, additional alignments were performed by exonerate46 in its genomic neighbourhood ranging between 20 kb upstream and 20 kb downstream of the match position. A series of quality criteria was applied to select high-confidence gene models in each accession. The functional annotation for genes of 20 accessions was carried out using the AHRD pipeline v.3.3.3 (https://github.com/groupschoof/AHRD). Orthologous gene groups between the twenty accessions were predicted using OrthoFinder47 v.2.3.1 with default parameters.
Repeat annotation
To obtain a consistent transposon annotation across all lines for transposons and tandem repeats, the same methods were applied to all 20 barley lines. Transposons were detected and classified by homology search against the REdat_9.7_Poaceae section of the PGSB transposon library48. The program vmatch (http://www.vmatch.de, version 2.3.0) was used for that purpose as a fast and efficient matching tool that is well-suited for such large and highly repetitive genomes. Vmatch was run with the following parameters: identity ≥ 70%, minimal hit length 75 bp, seed length 12 bp (exact command line: -d -p -l 75 -identity 70 -seedlength 12 -exdrop 5). To remove overlapping annotations, the vmatch output was filtered for redundant hits via a priority-based approach. Higher scoring matches were assigned first and lower scoring hits at overlapping positions were either shortened or removed if they were contained to ≥90% in the overlap or <50 bp of rest length remained. The resulting transposon annotations are overlap-free, but disrupted elements from nested insertions have not been defragmented into one element. Still-intact full-length LTR retrotransposons were identified with LTRharvest49, a program that scans the genome for LTR retrotransposon specific structural hallmarks, such as long terminal repeats, RNA cognate primer binding sites and target site duplications. LTRharvest (included in genometools 1.5.9) was run with the following parameter settings: ‘overlaps best -seed 30 -minlenltr 100 -maxlenltr 2000 -mindistltr 3000 -maxdistltr 25000 -similar 85 -mintsd 4 -maxtsd 20 -motif tgca -motifmis 1 -vic 60 -xdrop 5 -mat 2 -mis -2 -ins -3 -del -3’. All candidates were annotated for PfamA domains using hmmer3 (http://hmmer.org, version 3.1b2) and filtered to remove false positives. The inner domain order served as a criterion for the LTR-retrotransposon superfamily classification into either Gypsy or Copia. In the cases of insufficient domain information, the elements were assigned as still undetermined.
Most of the transposons insert at random locations leading to novel and usually unique sequence stretches at both borders around the inserted element and the neighbouring original sequence. The de novo detected full-length LTR set provides defined element borders, a prerequisite for the exact positioning of transposable element junctions. We used 100-bp single transposable element junctions with 50 bp outside and 50 bp inside the element from both sides of the element and merged them to 200 bp joined junctions per element. Junctions from the reverse strand were reverse-complemented. The 200-bp joined junctions from all 20 lines were clustered with vmatch dbcluster (http://www.vmatch.de, version 2.3.0) at 98% identity and 98% mutual length coverage (command-line parameters: 98 98 -e 2 -l 98 -d). About 97% of the clusters belonged to the 1:1 type with a maximum of 1 member per line and were used for the downstream analyses. Using the above-described 200-bp joined junctions instead of full sequences reduces the amount of data for clustering to 2%, from about 10 kb to 200 bp per full-length LTR element, thus allowing a sequence clustering of 2.2 million elements in the first place. By including sequence information outside of the element, the repetitiveness of high-copy transposable element families is removed and at the same time the syntenic context is provided even for elements located on chrUn (that is, not assigned to chromosomal pseudomolecules).
PAV detection and validation
Owing to higher sensitivity in detecting deletions over insertions, a paired genome alignment strategy was used in which each assembly was aligned to reference genome Morex reciprocally by treating Morex as a query and reference using Minimap2 (v.2.17)50. From these two alignments, insertion and deletions were called using Assemblytics (v.1.2.1)26. Then, only deletions were selected in both alignments and converted into PAVs with regard to Morex. In addition, a hard filter was used to discard PAVs containing more than 5% gaps (Ns) and nested PAVs. We used a previously described method51 to map deletions longer than 5 kb in Barke relative to Morex using whole-genome shotgun data for 90 Morex × Barke recombinant inbred lines19. Mosdepth (v.0.2.9)52 was used for determining read depth in genomic intervals.
k-mer-based genome-wide association
PAVs overlapping with single copy regions were identified by BedTools (v.2.28.0)53. k-mer sequences with step size of 2 bp were retrieved from single-copy regions residing within PAVs. The abundances of the extracted k-mer sequences were counted in sequence reads using BBDuk (BBMap_37.93) (https://sourceforge.net/projects/bbmap/). k-mer counts were obtained for whole-genome shotgun data of 300 diverse varieties of barley generated in the present study and previously published genotyping-by-sequencing data9. k-mer counts were imported into R (v.3.5.1)54 and normalized for differences in read depth between samples. The normalized k-mer counts were then used for genome-wide association scans using GAPIT330 and PCA using standard R functions.
Construction of single-copy pan-genome
To identify single-copy regions in each genome, genomic regions covered by 31-mers occurring more than once were masked using BBDuk (BBMap_37.93)55. Based on masking, single-copy regions in each assembly were obtained in .bed format and subsequently related sequences were retrieved using BEDTools (v2.28.0)53. Single-copy sequences from all the assemblies were combined to perform an all-against-all blast search. The blast results were filtered (>90% identity and minimum 80% alignment length) and then clustered using the igraph package56. A representative from each cluster (the largest contained sequence) was selected and used for estimating pan-genome size. Clusters shared by all the 20 accessions are referred to as the core genome, and clusters with sequences originating from 1 to 19 genotypes are considered as the variable genome.
Hi-C library preparation, sequencing and inversion calling
In situ Hi-C libraries were prepared from one-week-old seedlings of barley IPK core50 collection9 (Supplementary Table 5) based on a previously described protocol43 Sequencing, Hi-C raw data processing and inversion calling were performed as previously described34 using the MorexV2 reference genome sequence assembly6. The breakpoint regions were identified by pairwise genome alignment using Minimap2 (v.2.17)50 and PipMaker (http://pipmaker.bx.psu.edu/cgi-bin/pipmaker?basic)57.
Resequencing, SNP calling and PCA
Raw reads (Supplementary Table 4) were trimmed with cutadapt (v.1.15) and aligned to the MorexV2 genome assembly6 using Minimap2 (v.2.17)50. The alignments were sorted using Novosort (V3.06.05) (http://www.novocraft.com). BCFtools (v.1.8)58 was used to call SNPs and short insertions and deletions (indels). The resulting VCF file was converted into Genomic Data Structure (GDS) format using SeqArray package59 in R to obtain a SNP matrix. Finally, hard filtering was applied to remove SNPs having more than 10% missing data and heterozygosity. Previously generated genotyping-by-sequencing data9 were aligned to the MorexV2 reference and identified SNPs using a previously described variant calling pipeline9. PCAs were performed using snpgdsPCA() function of the package SNPrelate60.
RGT Planet × Hindmarsh mapping population
A cross was made between RGT Planet (maternal plant) and Hindmarsh (pollen donor). In total, 38 F2 plants from the direct cross and 233 individual heads from F3 seeds were progressed to the F6 generation by single seed descent method. The F6 recombinant inbred lines (RIL) (224 in total) were used for construction of a genetic linkage map. Genomic DNA was extracted from the leaves of a single plant per RIL using the cetyl-trimethyl-ammonium bromide method. DNA quality was assessed on 1% agarose gels and quantified using a NanoDrop spectrophotometer (Thermo Scientific NanoDrop Products). DNA was diluted into 50 ng/μl and placed in a 96-well plate for PCR. DArT-seq genotyping-by-sequencing was performed using the DArT-seq platform (DArT PL) according to the manufacturer’s protocol (https://www.diversityarrays.com/). In brief, 100 μl of 50 ng μl−1 DNA was sent to DArT PL, and genotyping-by-sequencing was performed using complexity reduction followed by sequencing on a HiSeq Illumina platform as previously described61 (Supplementary Table 9). Sequences flanking polymorphisms detected by DArT-seq were aligned against the MorexV2 genome assembly to determine their physical positions (Supplementary Table 7).
Field experiments and phenotypic data
Field experiments were conducted at six sites: Gibson, Western Australia (WA, −33.612176, 121.798438); Williams, Western Australia (−33.577668, 116.734934); Wongan Hills, Western Australia (−30.848953, 116.756461); Merredin, Western Australia (−31.487009, 118.229668); South Perth, Western Australia (−31.991186, 115.887944); and Shepperton, Victoria (−36.487551, 145.388470). The distance between South Perth and Shepperton is over 3,300 km. The Merredin site is located inland and receives little rainfall, whereas the Gibson site receives a high amount of rainfall: the other sites are in between. The experimental design for field trial sites was performed as previously described62. In brief, all regional field trials (partially replicated design) were planted in a randomized complete block design using plots of 1 by 5 m2, laid out in a row–column format and the middle 3 m was harvested for grain yield. Field trials in South Perth and Shepperton were conducted using double rows with a 40-cm distance within and between rows, owing to space constraints. Seven control varieties were used for spatial adjustment of the experimental data. Measurements were taken at each plot of each field experiment in the study to determine flowering time (days to Zadoks stage (ZS)49), plant height and grain yield. In brief, heading date was recorded as the number of days from sowing to 50% awn emergence above the flag leaf (ZS49), as a proxy for flowering time. Plant height was determined by estimating the average height from the base to the tip of the head of all plants in each plot. Grain yield (kg ha−1) was determined by destructively harvesting all plant material from each plot to separate the grain, and then determining grain mass. Grain yield data of the field experiments, as well as plant height and heading data, were analysed using linear mixed models in ASReml-R (https://www.vsni.co.uk/software/asreml-r/) to determine best linear unbiased predictions or best linear unbiased estimations for each trait for further analysis. Local best practices for fertilization and disease control were adopted for each trial site.
Quantitative trait loci (QTL) mapping
Software MapQTL6 was used for the QTL analysis63. The genotypic data, phenotypic data and genetic map were formatted and imported to MapQTL6. Interval mapping was conducted for each trait, and then the markers with a logarithm of odds (LOD) value of above 3.0 were selected as cofactors. Multiple QTL model mapping was performed to re-calculate the QTL. If the markers with the highest LOD value were inconsistent with the cofactor markers, then the new markers were selected as cofactors and re-calculated. The QTL results and charts were exported from the software.
Long-read sequence assembly of the Morex cultivar
PacBio libraries were constructed using SMRTbell Template Prep Kit 1.0 and sized on a SAGE Blue Pippin instrument 20–80 kb. Sequencing was performed on Sequell II device at the HudsonAlpha Institute using V.1.0 chemistry and 10-h movie time. Data were generated from a total of five SMRT cells, yielding 604 Gb of raw sequence reads. A total of 520.72 Gb of this set (104.15×) was used for assembly (Supplementary Tables 11, 12). Previously published Illumina short-read data (ERR3183748 and ERR31837496 (Supplementary Table 12)) was used for polishing and error correction. Before use, Illumina fragment reads were screened for phix contamination. Reads composed of >95% simple sequences were removed. Illumina reads shorter than 50 bp after trimming for adaptor and quality (q < 20) were removed. The final read set consists of 605,178,701 reads, representing a total of 43.17× of high-quality Illumina bases. The initial assembly was generated by assembling 32,743,478 PacBio reads (104.15× sequence coverage) using MECAT (v.1.1)64 and subsequently polished using Arrow65. This produced an initial assembly of 1,577 scaffolds (1,577 contigs), with a contig N50 of 10.4 Mb, 987 scaffolds larger than 100 kb and a total genome size of 4,139.8 Mb (Supplementary Table 13).
A first round of breaking chimeric scaffolds was done using the POPSEQ genetic map19 to identify contigs bearing markers from distant genomic regions. A total of 17 misjoins were identified and resolved. Homozygous SNPs and indels were corrected in the release consensus sequence using a subset of about 30× of the Illumina reads described above in this section. Reads were aligned using BWA-MEM66. Homozygous SNPs and indels were discovered with GATK’s UnifiedGenotyper tool67. A total of 59 homozygous SNPs and 15,759 homozygous indels were corrected. After these correction steps, the assembly contains 4,139.7 Mb of sequence, consisting of 1,594 contigs with a contig N50 of 10.2 Mb. A second round of chimaera breaking by inspecting Hi-C contact matrices as described in the TRITEX pipeline6. Published Hi-C data of the Morex cultivar was used5. Corrected contigs were arranged into pseudomolecules using TRITEX.
Comparison of PacBio continuous long read (CLR) and TRITEX assemblies of the Morex cultivar
Full-length cDNA sequences44 were aligned to the assemblies to assess gene space completeness. Only alignments with query coverage ≥90% and identity ≥90% were considered. Whole-genome assemblies were done with Minimap2. Structural variant calling with Assemblytics (v.1.2.1)26 (Morex TRITEX versus Morex CLR; Morex CLR versus Barke) and extraction of single-copy regions were done as described in ‘PAV detection and validation’.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this paper.
Online content
Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/s41586-020-2947-8.
Supplementary information
Acknowledgements
We thank M. Knauft, I. Walde and S. König for technical assistance; D. Schüler for sequence data management; J. Bauernfeind, T. Münch and H. Miehe for IT administration; D. Arend for help with data submission; M. Bayer for advice on transcriptome analysis; and M. Herz for pedigree information. This research was supported by grants from the German Federal Ministry of Education and Research to N.S., M.M., U.S., M.S. and K.F.X.M. (SHAPE, FKZ 031B0190), to U.S. and K.F.X.M. (de.NBI, FKZ 031A536) and to N.S. (COBRA, FKZ 031A323A); the Australian Grain Research and Development Cooperation (9176507) to C.L., K.C., P.L. and P.W.; JST CREST Japan (no. JPMJCR16O4 to K.M. and T.H.); JST Mirai Program Japan (no. 18076896 to K.S.); the National Key R&D Program of China (2018YFD1000701 and 2018YFD1000700) to D.X. and J.Z.; by funding from the China Agriculture Research System (CARS-05) and the Agricultural Science and Technology Innovation Program to C.W. and G.G. Support for 10X sequencing was provided by a research grant from Genome Canada and Genome Prairie to C.P. and J.E.; and by the Natural Science Foundation of China (31620103912) and the National Key R&D Program of China (2018YFD1000706) to G.Z. We acknowledge support from the European Research Council (ERC Shuffle, project identifier: 66918) to R.W.
Extended data figures and tables
Author contributions
N.S. and M.M. designed the study. N.S. coordinated experiments and sequencing. M.M. supervised sequence assembly. M. Spannagl and K.F.X.M. supervised annotation. U.S. supervised data management and submission. S.P., A.H., J.E., D.X., L.B.B. and J.G. performed sequencing experiments. M.J., C.M., Y.G., C.P., J.J. and J.S. performed sequence assembly. M.J. performed structural variation and genome-wide association scan analysis. A.F. submitted sequence data. G.H., T.L., H.G., V.S.B., N.K. and D.L. annotated and analysed genes and transposable elements. S.P., M.J., X.-Q.Z., T.T.A., G. Zhou, C.T., C.H., P.W., M.M. and C.L. analysed polymorphic inversions. H.B., J.G., J.S., J.Z., C.W., G.G., G. Zhang, K.M., T.H., K.S., K.J.C., P.L., C.J.P., C.L., M. Schreiber, R.W. and N.S. contributed sequence data. M.J., S.P., C.L. and M.M. wrote the paper with input from all co-authors.
Data availability
All raw sequence data collected in this study and sequence assemblies have been deposited at the European Nucleotide Archive (ENA). Accession codes for raw data and assemblies are listed in Supplementary Tables: Supplementary Table 14 (assemblies), Supplementary Table 10 (assembly raw data), Supplementary Table 4 (whole-genome shotgun sequencing), Supplementary Table 5 (Hi-C) and Supplementary Table 9 (DArT-seq). Assemblies, annotations and analysis results were deposited under a DOI in the PGP repository68 using the e!DAL submission system69 and are accessible under the URL 10.5447/ipk/2020/24. Assemblies and gene annotations can also be downloaded from https://barley-pangenome.ipk-gatersleben.de. The Barley Pedigree Catalogue is available at http://genbank.vurv.cz/barley/pedigree/.
Code availability
Source code is released in a public Bitbucket repository, at https://bitbucket.org/ipk_dg_public/barley_pangenome/.
Competing interests
The authors declare no competing interests.
Footnotes
Peer review information Nature thanks Victor Albert, Scott Jackson and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Murukarthick Jayakodi, Sudharsan Padmarasu
Contributor Information
Chengdao Li, Email: C.Li@murdoch.edu.au.
Martin Mascher, Email: mascher@ipk-gatersleben.de.
Nils Stein, Email: stein@ipk-gatersleben.de.
Extended data
is available for this paper at 10.1038/s41586-020-2947-8.
Supplementary information
is available for this paper at 10.1038/s41586-020-2947-8.
References
- 1.Bayer PE, Golicz AA, Scheben A, Batley J, Edwards D. Plant pan-genomes are the new reference. Nat. Plants. 2020;6:914–920. doi: 10.1038/s41477-020-0733-0. [DOI] [PubMed] [Google Scholar]
- 2.Dawson IK, et al. Barley: a translational model for adaptation to climate change. New Phytol. 2015;206:913–931. doi: 10.1111/nph.13266. [DOI] [PubMed] [Google Scholar]
- 3.Stein, N. & Muehlbauer, G. J. The Barley Genome (Springer, 2018).
- 4.International Barley Genome Sequencing Consortium A physical, genetic and functional sequence assembly of the barley genome. Nature. 2012;491:711–716. doi: 10.1038/nature11543. [DOI] [PubMed] [Google Scholar]
- 5.Mascher M, et al. A chromosome conformation capture ordered sequence of the barley genome. Nature. 2017;544:427–433. doi: 10.1038/nature22043. [DOI] [PubMed] [Google Scholar]
- 6.Monat C, et al. TRITEX: chromosome-scale sequence assembly of Triticeae genomes with open-source tools. Genome Biol. 2019;20:284. doi: 10.1186/s13059-019-1899-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Mascher M, et al. Mapping-by-sequencing accelerates forward genetics in barley. Genome Biol. 2014;15:R78. doi: 10.1186/gb-2014-15-6-r78. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Russell J, et al. Exome sequencing of geographically diverse barley landraces and wild relatives gives insights into environmental adaptation. Nat. Genet. 2016;48:1024–1030. doi: 10.1038/ng.3612. [DOI] [PubMed] [Google Scholar]
- 9.Milner SG, et al. Genebank genomics highlights the diversity of a global barley collection. Nat. Genet. 2019;51:319–326. doi: 10.1038/s41588-018-0266-x. [DOI] [PubMed] [Google Scholar]
- 10.Muñoz-Amatriaín M, et al. Distribution, functional impact, and origin mechanisms of copy number variation in the barley genome. Genome Biol. 2013;14:R58. doi: 10.1186/gb-2013-14-6-r58. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Taketa S, et al. Barley grain with adhering hulls is controlled by an ERF family transcription factor gene regulating a lipid biosynthesis pathway. Proc. Natl Acad. Sci. USA. 2008;105:4062–4067. doi: 10.1073/pnas.0711034105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Tettelin H, et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome”. Proc. Natl Acad. Sci. USA. 2005;102:13950–13955. doi: 10.1073/pnas.0506758102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Ho SS, Urban AE, Mills RE. Structural variation in the sequencing era. Nat. Rev. Genet. 2019;21:171–189. doi: 10.1038/s41576-019-0180-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Danilevicz MF, Tay Fernandez CG, Marsh JI, Bayer PE, Edwards D. Plant pangenomics: approaches, applications and advancements. Curr. Opin. Plant Biol. 2020;54:18–25. doi: 10.1016/j.pbi.2019.12.005. [DOI] [PubMed] [Google Scholar]
- 15.Monat C, Schreiber M, Stein N, Mascher M. Prospects of pan-genomics in barley. Theor. Appl. Genet. 2019;132:785–796. doi: 10.1007/s00122-018-3234-z. [DOI] [PubMed] [Google Scholar]
- 16.Coronado M-J, Hensel G, Broeders S, Otto I, Kumlehn J. Immature pollen-derived doubled haploid formation in barley cv. Golden Promise as a tool for transgene recombination. Acta Physiol. Plant. 2005;27:591–599. [Google Scholar]
- 17.Schreiber M, et al. A genome assembly of the barley ‘transformation reference’ cultivar Golden Promise. G3. 2020;10:1823–1827. doi: 10.1534/g3.119.401010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Gottwald S, Bauer P, Komatsuda T, Lundqvist U, Stein N. TILLING in the two-rowed barley cultivar ‘Barke’ reveals preferred sites of functional diversity in the gene HvHox1. BMC Res. Notes. 2009;2:258. doi: 10.1186/1756-0500-2-258. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Mascher M, et al. Anchoring and ordering NGS contig assemblies by population sequencing (POPSEQ) Plant J. 2013;76:718–727. doi: 10.1111/tpj.12319. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Hübner S, et al. Strong correlation of wild barley (Hordeum spontaneum) population structure with temperature and precipitation variation. Mol. Ecol. 2009;18:1523–1536. doi: 10.1111/j.1365-294X.2009.04106.x. [DOI] [PubMed] [Google Scholar]
- 21.Chikhi R, Limasset A, Medvedev P. Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics. 2016;32:i201–i208. doi: 10.1093/bioinformatics/btw279. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Luo R, et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience. 2012;1:18. doi: 10.1186/2047-217X-1-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Clavijo BJ, et al. An improved assembly and annotation of the allohexaploid wheat genome identifies complete families of agronomic genes and provides genomic evidence for chromosomal translocations. Genome Res. 2017;27:885–896. doi: 10.1101/gr.217117.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Anderson SN, et al. Transposable elements contribute to dynamic genome content in maize. Plant J. 2019;100:1052–1065. doi: 10.1111/tpj.14489. [DOI] [PubMed] [Google Scholar]
- 25.Brunner S, Fengler K, Morgante M, Tingey S, Rafalski A. Evolution of DNA sequence nonhomologies among maize inbreds. Plant Cell. 2005;17:343–360. doi: 10.1105/tpc.104.025627. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Nattestad M, Schatz MC. Assemblytics: a web analytics tool for the detection of variants from an assembly. Bioinformatics. 2016;32:3021–3023. doi: 10.1093/bioinformatics/btw369. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Gordon SP, et al. Extensive gene content variation in the Brachypodium distachyon pan-genome correlates with population structure. Nat. Commun. 2017;8:2184. doi: 10.1038/s41467-017-02292-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Yu S, et al. A single nucleotide polymorphism of Nud converts the caryopsis type of barley (Hordeum vulgare L.) Plant Mol. Biol. Report. 2016;34:242–248. [Google Scholar]
- 29.Arora S, et al. Resistance gene cloning from a wild crop relative by sequence capture and association genetics. Nat. Biotechnol. 2019;37:139–143. doi: 10.1038/s41587-018-0007-9. [DOI] [PubMed] [Google Scholar]
- 30.Lipka AE, et al. GAPIT: genome association and prediction integrated tool. Bioinformatics. 2012;28:2397–2399. doi: 10.1093/bioinformatics/bts444. [DOI] [PubMed] [Google Scholar]
- 31.Yu J, et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat. Genet. 2006;38:203–208. doi: 10.1038/ng1702. [DOI] [PubMed] [Google Scholar]
- 32.Ekberg I. Cytogenetic studies of three paracentric inversions in barley. Hereditas. 1974;76:1–30. doi: 10.1111/j.1601-5223.1974.tb01172.x. [DOI] [PubMed] [Google Scholar]
- 33.Ramage R, Suneson C. Translocation-gene linkages on barley chromosome 7. Crop Sci. 1961;1:319–320. [Google Scholar]
- 34.Himmelbach A, et al. Discovery of multi-megabase polymorphic inversions by chromosome conformation capture sequencing in large-genome plant species. Plant J. 2018;96:1309–1316. doi: 10.1111/tpj.14109. [DOI] [PubMed] [Google Scholar]
- 35.Ederveen A, Lai Y, van Driel MA, Gerats T, Peters JL. Modulating crossover positioning by introducing large structural changes in chromosomes. BMC Genomics. 2015;16:89. doi: 10.1186/s12864-015-1276-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Bouma, J. & Ohnoutka, Z. Importance and Application of the Mutant ‘Diamant’ in Spring Barley Breeding (IAEA, 1991).
- 37.Comadran J, et al. Natural variation in a homolog of Antirrhinum CENTRORADIALIS contributed to spring growth habit and environmental adaptation in cultivated barley. Nat. Genet. 2012;44:1388–1392. doi: 10.1038/ng.2447. [DOI] [PubMed] [Google Scholar]
- 38.Bustos-Korts D, et al. Exome sequences and multi-environment field trials elucidate the genetic basis of adaptation in barley. Plant J. 2019;99:1172–1191. doi: 10.1111/tpj.14414. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Mascher M, et al. Genebank genomics bridges the gap between the conservation of crop diversity and plant breeding. Nat. Genet. 2019;51:1076–1081. doi: 10.1038/s41588-019-0443-6. [DOI] [PubMed] [Google Scholar]
- 40.Khan AW, et al. Super-pangenome by integrating the wild side of a species for accelerated crop improvement. Trends Plant Sci. 2020;25:148–158. doi: 10.1016/j.tplants.2019.10.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Dvorak J, McGuire PE, Cassidy B. Apparent sources of the A genomes of wheats inferred from polymorphism in abundance and restriction fragment length of repeated nucleotide sequences. Genome. 1988;30:680–689. [Google Scholar]
- 42.Himmelbach A, Walde I, Mascher M, Stein N. Tethered chromosome conformation capture sequencing in Triticeae: a valuable tool for genome assembly. Bio Protoc. 2018;8:e2955. doi: 10.21769/BioProtoc.2955. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Padmarasu, S., Himmelbach, A., Mascher, M. & Stein, N. in Plant Long Non-Coding RNAs (eds Chekanova, J. & Wang, H.-L.) 441–472 (Springer, 2019).
- 44.Matsumoto T, et al. Comprehensive sequence analysis of 24,783 barley full-length cDNAs derived from 12 clone libraries. Plant Physiol. 2011;156:20–28. doi: 10.1104/pp.110.171579. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Kent WJ. BLAT—the BLAST-like alignment tool. Genome Res. 2002;12:656–664. doi: 10.1101/gr.229202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Slater GSC, Birney E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics. 2005;6:31. doi: 10.1186/1471-2105-6-31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Emms DM, Kelly S. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol. 2015;16:157. doi: 10.1186/s13059-015-0721-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Spannagl M, et al. PGSB PlantsDB: updates to the database framework for comparative plant genome research. Nucleic Acids Res. 2016;44:D1141–D1147. doi: 10.1093/nar/gkv1130. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Ellinghaus D, Kurtz S, Willhoeft U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics. 2008;9:18. doi: 10.1186/1471-2105-9-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34:3094–3100. doi: 10.1093/bioinformatics/bty191. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Gutierrez-Gonzalez JJ, Mascher M, Poland J, Muehlbauer GJ. Dense genotyping-by-sequencing linkage maps of two synthetic W7984×Opata reference populations provide insights into wheat structural diversity. Sci. Rep. 2019;9:1793. doi: 10.1038/s41598-018-38111-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Pedersen BS, Quinlan AR. Mosdepth: quick coverage calculation for genomes and exomes. Bioinformatics. 2018;34:867–868. doi: 10.1093/bioinformatics/btx699. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–842. doi: 10.1093/bioinformatics/btq033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.R Core Team. R: A Language and Environment for Statistical Computing http://www.R-project.org (R Foundation for Statistical Computing, 2013).
- 55.Bushnell, B. BBMap: A Fast, Accurate, Splice-aware Aligner (Lawrence Berkeley National Laboratory, 2014).
- 56.Csardi G, Nepusz T. The igraph software package for complex network research. InterJournal Complex Syst. 2006;1695:1–9. [Google Scholar]
- 57.Schwartz S, et al. PipMaker—a web server for aligning two genomic DNA sequences. Genome Res. 2000;10:577–586. doi: 10.1101/gr.10.4.577. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011;27:2987–2993. doi: 10.1093/bioinformatics/btr509. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Zheng, X. & Gogarten, S. SeqArray: big data management of genome-wide sequence variants. R package version 1.10.6 https://github.com/zhengxwen/SeqArray (accessed January 2017).
- 60.Zheng X, et al. A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics. 2012;28:3326–3328. doi: 10.1093/bioinformatics/bts606. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Akbari M, et al. Diversity arrays technology (DArT) for high-throughput profiling of the hexaploid wheat genome. Theor. Appl. Genet. 2006;113:1409–1420. doi: 10.1007/s00122-006-0365-4. [DOI] [PubMed] [Google Scholar]
- 62.Hill CB, et al. Hybridisation-based target enrichment of phenology genes to dissect the genetic basis of yield and adaptation in barley. Plant Biotechnol. J. 2019;17:932–944. doi: 10.1111/pbi.13029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Van Ooijen, J. MapQTL 5, Software for the Mapping of Quantitative Trait Loci in Experimental Populations (Kyazma, 2004).
- 64.Xiao CL, et al. MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads. Nat. Methods. 2017;14:1072–1074. doi: 10.1038/nmeth.4432. [DOI] [PubMed] [Google Scholar]
- 65.Chin CS, et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods. 2013;10:563–569. doi: 10.1038/nmeth.2474. [DOI] [PubMed] [Google Scholar]
- 66.Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997 (2013).
- 67.McKenna A, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–1303. doi: 10.1101/gr.107524.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Arend D, et al. PGP repository: a plant phenomics and genomics data publication infrastructure. Database (Oxford) 2016;2016:baw033. doi: 10.1093/database/baw033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Arend D, et al. e!DAL—a framework to store, share and publish research data. BMC Bioinformatics. 2014;15:214. doi: 10.1186/1471-2105-15-214. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All raw sequence data collected in this study and sequence assemblies have been deposited at the European Nucleotide Archive (ENA). Accession codes for raw data and assemblies are listed in Supplementary Tables: Supplementary Table 14 (assemblies), Supplementary Table 10 (assembly raw data), Supplementary Table 4 (whole-genome shotgun sequencing), Supplementary Table 5 (Hi-C) and Supplementary Table 9 (DArT-seq). Assemblies, annotations and analysis results were deposited under a DOI in the PGP repository68 using the e!DAL submission system69 and are accessible under the URL 10.5447/ipk/2020/24. Assemblies and gene annotations can also be downloaded from https://barley-pangenome.ipk-gatersleben.de. The Barley Pedigree Catalogue is available at http://genbank.vurv.cz/barley/pedigree/.
Source code is released in a public Bitbucket repository, at https://bitbucket.org/ipk_dg_public/barley_pangenome/.