Abstract
A pangenome captures the genomic diversity for a species, derived from a collection of genetic sequences of diverse populations. Advances in sequencing technologies have given rise to three primary methods for pangenome construction and analysis: de novo assembly and comparison, reference genome-based iterative assembly, and graph-based pangenome construction. Each method presents advantages and challenges in processing varying amounts and structures of DNA sequencing data. With the emergence of high-quality genome assemblies and advanced bioinformatic tools, the graph-based pangenome is emerging as an advanced reference for exploring the biological and functional implications of genetic variations.
Keywords: pangenome, plant, genome variation, bioinformatic pipeline, data analysis and visualization
Significance.
At present, there are three primary methods for constructing and analyzing plant pangenomes: de novo assembly and comparison, reference genome-based iterative assembly, and graph-based pangenome construction but with some variation applied in each approach. A thorough understanding of the potential and limitation of these methods is important in choosing the appropriate method and associated parameters.
Introduction
By comparing multiple genome assemblies for the same species, researchers realized that one individual's genome cannot capture the genetic diversity of a species due to significant sequence presence/absence variation (PAV) within populations (Golicz, Batley, and Edwards 2016; Bayer et al. 2020; Jia et al. 2023). As a result, there is a growing interest in pangenome studies that analyze a collection of genomic sequences from a species or clade (Tettelin et al. 2005), encompassing core genes present in all individuals, alongside dispensable genes absent in at least one. Accurate gene nomenclature is important for distinguishing between truly dispensable genes and core gene sets, mitigating the impact of errors in genome annotation. However, the accuracy of distinguishing between “core” and “dispensable” genes in a pangenome is sensitive to the quality of genome assembly and annotation, with errors potentially resulting in misclassifications. Therefore, a practical classification of the pangenome divides dispensable genes into three distinct groups: “soft-core” genes present in the majority but not all genomes, “shell” genes that are less widespread than soft-core genes but found in more than just a handful of genomes, and “cloud” genes limited to a few individuals. This gene nomenclature has been applied in many studies such as the Brachypodium distachyon pangenome (Gordon et al. 2017), which identified soft-core genes in 95% to 98% of the accessions, shell genes in 5% to 94% of the accessions, and cloud genes in 2% to 5% of the accessions.
In the study of plant genomes, which are notably rich in transposable elements, extensive structural variations have been identified. These variations encompass PAVs, copy number variations, and chromosomal rearrangements, including inversions and translocations. Therefore, the pangenome concept extends beyond the gene-centric view to encompass the full expanse of the genomic landscape, including intergenic spaces and repetitive sequences, contributing to the genomic architecture by influencing gene expression and genetic diversity (Fu and Dooner 2002). Research into pangenomes has underscored the critical role of these structural variations in influencing plant species’ environmental adaptability, evolution, and breeding potential by affecting genes associated with abiotic and biotic stress resistance, domestication, and agronomic traits (Golicz, Bayer, et al. 2016; Bayer, Valliyodan, et al. 2022; Hu et al. 2022; Derbyshire et al. 2023).
Recent advances in sequencing methods and bioinformatics pipelines have revolutionized pangenome studies, allowing for the cost-effective construction of pangenomes and in-depth analysis of genetic variation in organisms. This progress has positioned the graph-based pangenome as an increasingly viable alternative to traditional reference genomes. In addition, the advance of artificial intelligence such as machine learning that can discern intricate patterns and relationships within data has improved the accuracy of structural variations identification and the study of trait association analysis.
The Advancement of Sequencing Technologies Driving the Evolution of Plant Pangenome Research
The methods for pangenome analysis and construction have evolved alongside advances in DNA sequencing technologies and the reduction in sequencing costs. In the early 21st century, due to the high cost of DNA sequencing and the requirement for extensive computational resources, pangenome studies were initially limited to a small number of individuals within a species. The first milestone in pangenome research was achieved with the bacterial species Streptococcus agalactiae in 2005 (Tettelin et al. 2005), by de novo assembling eight distinct strains. This was followed by the first plant pangenome study on wild soybean (Glycine soja) in 2014, demonstrating gene PAVs associated with important agronomic traits including disease resistance, flowering time, and seed composition (Li et al. 2014). In the same year, a pangenome study of three cultivated rice accessions was reported, highlighting that the presence or absence of LRK genes correlated with enhanced yield, and the deletion of Pup1 linked to phosphorus deficiency (Schatz et al. 2014). From the mid to late 2010s, the decline in sequencing costs facilitated broader sequencing of populations, enabling deeper exploration of genetic diversity at the population level, including the ability to predict the scale of gene PAV in a species. Several of these studies, encompassing many accessions, employed the reference genome-based method, either the iterative mapping and assembly approach or the “map-to-pan” method. For example, Golicz, Bayer, et al. (2016) developed and applied the iterative mapping and assembly approach to construct the pangenome of Brassica oleracea. Their analyses along with the first plant pangenome browser revealed that nearly 20% of genes, predominantly associated with disease resistance and flowering time, were influenced by PAVs. A pangenome study involving thousands of rice accessions used the “map-to-pan” method (Wang et al. 2018) to identify 268 Mb of new sequences and 12,465 new genes in Asian cultivated rice, capturing a broad spectrum of genetic diversity within the rice gene pool. In addition, the pangenome study of B. distachyon, which included the development of the BrachyPan browser to explore the diversity among 54 accessions, revealed a pangenome with nearly twice the genes of any single genome, with core genes associated with essential functions and dispensable genes linked to conditionally beneficial traits and rapid evolution (Gordon et al. 2017). Complementing this study, a pangenomic study on the allotetraploid Brachypodium hybridum illustrates its derivation from two distinct polyploidization events among its diploid ancestors (Gordon et al. 2020), which underscores that the majority of gene variation predates the formation of the polyploid and suggests a foundational role of existing genetic diversity in polyploid evolution. The recent advances in long-read DNA sequencing such as PacBio HiFi and Oxford Nanopore ultralong sequencing have enabled the generation of haplotype-phased high-quality genome assembly and even nongap telomere-to-telomere genome assemblies. These technological advances in DNA sequencing have paved the way for the construction of graph-based pangenomes that represent genetic variants as nodes and edges, and their application in trait association studies has demonstrated that they capture heritability that is missing when using a single genome reference (Edwards and Batley 2022). While still in early developmental stages, these graph-based pangenome methods have been applied to draft the human pangenome reference (Liao et al. 2023) and have been adopted for several crops, such as tomato (Zhou et al. 2022) and wheat (Bayer, Petereit, et al. 2022).
Different Plant Pangenome Construction Pipelines and Strategies
At present, there are three primary methods for the construction of plant pangenomes: de novo assembly and comparison, reference genome-based iterative assembly, and graph-based pangenome approach (Fig. 1). The de novo assembly method involves assembling individual genomes from scratch and then comparing these to identify shared and novel genomic regions. For plant species characterized by large, highly repetitive, heterozygous, and polyploid genomes, where accurate whole-genome alignment is challenging, an alternative approach is to focus on the coding potential or the pangene set of a species by comparing gene annotations of different individual genomes. The reference genome-based iterative assembly approach can be divided into the iterative mapping and assembly approach (Golicz, Bayer, et al. 2016; Hu et al. 2020), as well as the map-to-pan method (Hu et al. 2017), distinguished by the order in which assembly and mapping are initially undertaken. The iterative mapping and assembly method starts by first aligning reads to an existing reference genome. Reads that do not align are subsequently assembled, and the resulting annotated contigs are integrated into a linear format pangenome reference. Conversely, the map-to-pan method begins with the de novo assembly of individual genomes and constructs the linear format of pangenome by mapping these assembled sequences to an available reference genome. More recently, the graph-based pangenome approach has emerged as an alternative to traditional linear format plant pangenomes. This approach combines pangenome sequences with genetic variants, ensuring sequence continuity and displaying the topology of structural variations among individuals (Eizenga et al. 2020). Since each plant pangenome construction and analysis method has its own distinct features, each method exhibits advantages and drawbacks when handling different types and amounts of data (Table 1). Furthermore, to enhance the efficiency and usability of these methods, a range of bioinformatic tools have been developed. These tools are designed not only to streamline the pangenome construction but also to provide robust downstream analysis, enabling researchers to handle complex genomic data with greater precision.
Fig. 1.
With advances in sequencing technologies, a timeline illustrating the evolution of pangenome construction and analysis pipelines, alongside key pangenomic studies (Tettelin et al. 2005; Li et al. 2014; Golicz, Bayer, et al. 2016; Gordon et al. 2017; Wang et al. 2018; Bayer, Petereit, et al. 2022; Yates et al. 2022; Zhou et al. 2022), has emerged.
Table 1.
Summary and comparison of three main three primary methods for pangenome construction and analysis when handling volumes and structures of sequencing data
| Methods | Subcategory methods | Feature | Representative bioinformatic tools | Advantages | Limitations | Representative studies applied this method |
|---|---|---|---|---|---|---|
| De novo assembly and comparison | Whole-genome alignment | De novo assembling individual genomes followed by whole-genome sequence comparisons | Mummer/Minimap2 + Syri/Assemblytics + BLAST | Enables precise and efficient whole-genome comparisons using high-quality genome assemblies, accurately identifying structural variations | Assembly of population-level accessions using long-read sequencing remains a computational resource-intensive and cost-expensive task | Sesame and Barley (Yu et al. 2019; Jayakodi et al. 2020) |
| Gene annotation comparison | De novo assembly of individual genomes with an emphasis on comparing gene annotations across genomes | GET_HOMOLOGUES-EST/GET_PANGENES | Effective in analyzing gene PAVs in plant species with large, highly repetitive, heterozygous, and polyploid genomes, where conducting whole-genome alignment poses significant challenges | May overlook important structural variations such as tandem duplication or inversions in non-coding sequences | Maize, Brachypodium distachyon (Gordon et al. 2017; Hufford et al. 2021) | |
| Reference-based Iterative assembly | Mapping and assembly | Mapping reads then assembling and merging the unaligned reads to the pangenome | Bowtie2 + Masurca + SGSGeneLoss | Requires short-read sequencing data with a lower sequencing coverage and thus is cost-effective for a greater number of individuals | The assembly generated by short-read sequencing struggles with complex repeat regions and fails to retain positional information for genetic variants in newly assembled contigs | Soybean, bread wheat, Brassica oleracea, Brassica napus, pigeon pea, lupin, banana, Sorghum, Amborella (Golicz, Bayer, et al. 2016; Montenegro et al. 2017; Zhao et al. 2020; Ruperao et al. 2021; Bayer, Valliyodan, et al. 2022; Garg et al. 2022; Hu et al. 2022; Rijzaani et al. 2022) |
| Assembly and mapping | De novo assembly then map contigs to a reference genome to build the pangenome | EUPN and PSVCP | Cost-effective for short-read sequencing data and effectively captures and localizes insertion positions within a linear layout using PSVCP, making it readily adaptable for single-reference-based applications | Fails to detect complex structural variations such as inversions and tandem repeat contractions and expansions and struggles to represent these diverse changes within a linear format | Rice (Wang et al. 2018; Wang et al. 2023) | |
| Graph pangenomes | Genetic variation graph | Integrating genetic variations through whole-genome comparison into a reference genome | Vg construct | The simplest and most efficient approach to constructing a pangenome graph and can be readily adapted for linear reference–based downstream analyses | Has reference bias and can fail to capture genetic variations that only exist in other genomes but not the reference genome | Pearl millet and tomato (Zhou et al. 2022; Yan et al. 2023) |
| Reference-based pangenome graph | A graph data model represents multiple genomes while keeping the coordinate of the linear reference genome | Minigraph and Minigraph-Cactus | An efficient approach to compactly encode structural variants absent from the reference genome and can be readily adapted for linear reference–based downstream analyses | The selection of a different reference genome can lead to a completely different pangenome graph | Wheat and melon (Bayer, Petereit, et al. 2022; Vaughn et al. 2022) | |
| Reference-unbiased pangenome graph | A reference-free pangenome graph model, allowing any genome in the dataset to act as a reference | PGGB | Free from reference bias and allows any genome in the dataset to act as a reference | Resource-intensive and time-consuming for large plant genomes and requires multiple computational trials to fine-tune the optimal parameters | Human (Liao et al. 2023) |
The de novo assembly and comparison method represent the most commonly used approach for constructing and analyzing plant pangenomes. It also acts as an essential procedure in the construction of graph-based pangenomes. The development of accurate long-read DNA sequencing technologies, coupled with advances in bioinformatics, led to a revolutionary transformation in this field (Hu et al. 2018). Combined with a fast haplotype-resolved assembler (hifiasm) (Cheng et al. 2021), these technological breakthroughs have produced high-quality plant genome assemblies with greatly improved completeness and contiguity. A strategy using PacBio HiFi reads or Nanopore ultralong sequencing allows the assembly of telomere-to-telomere gapless plant genomes (Li et al. 2022), enabling pangenome-wide discovery of transposable elements and structural variations with tools such as EDTA (Ou et al. 2019) and SyRI (Goel et al. 2019). However, the de novo assembly of population-level accessions using long-read sequencing remains a resource-intensive task, demanding substantial high-performance computing capabilities and expensive sequencing. Conversely, focusing on gene-centric comparisons provides a more feasible alternative for pangenome analysis, proving to be especially effective at the genus level or for broad-species studies. The pangene set can be obtained through two pipelines: GET_HOMOLOGUES-EST (Contreras-Moreira et al. 2017), which necessitates a 95% nucleotide sequence similarity for core gene sets, and GET_PANGENES (Contreras-Moreira et al. 2023), which performs pairwise whole-genome alignments. This is followed by minimizing gene annotation errors through the comparison of gene model overlaps across individuals. Furthermore, the development of PANGENE (Li et al. 2024) software has enabled the generation and visualization of pangenes in a graph-based format by comparing protein sets across multiple genomes. The gene annotation comparison effectively simplifies the study of PAV by reducing genome complexity, while it may overlook important structural variations such as tandem duplication or inversions in noncoding sequences.
Compared with the de novo assembly and comparison method, the reference genome-based iterative assembly method relies on an available reference genome and short-read sequencing data with a relatively low sequencing coverage (>10×) or a stratified approach using deep, medium, and low sequencing coverage and is thus cost-effective for pangenome analysis of breeding population (Golicz, Bayer, et al. 2016; Wang et al. 2021; Bayer, Valliyodan, et al. 2022). However, short-read sequencing has limitations in accurately resolving large structural variations such as inversions and tandem duplications due to lack of contextual information for clear mapping and differentiation of similar sequences or precise identification of breakpoints. The iterative mapping and assembly approach involves mapping reads with the Bowtie2 alignment tool (Langmead and Salzberg 2012), configured with parameters (--end-to-end --sensitive -I 0 -X 1000), followed by pooling of unmapped reads and assembly using a metagenome aware assembler such as Masurca (Zimin et al. 2013). A metagenome aware assembler is required due to the variation in the presence or absence of genes in the pooled individuals. After annotation of the newly assembled contigs, all reads are then remapped to the pangenome and PAVs called using the SGSGeneLoss tool (Golicz et al. 2015). The “map-to-pan” method is as an alternate reference genome-based approach for constructing pangenomes, prominently using the Eukaryotic Pangenome Analysis Toolkit (EUPAN) (Hu et al. 2017). Employing an “assembly first, then map” strategy, this method typically yields a pangenome comparable in quality and detail to that constructed by the iterative mapping and assembly approach. It is important to note that both these methods construct the pangenome by incorporating the unaligned PAV sequences into the reference genome, and it is not always possible to identify the specific locations of these PAVs within the genome. To overcome this limitation, a novel pangenome construction and a downstream pangenome analysis pipeline (PSVCP) were developed (Wang et al. 2023). PSVCP enables the capture of genetic variants’ position information while maintaining a linearized pangenome layout, making it available for reference-based downstream applications, including genome-wide association studies and the genome visualization tool GBrowse (Donlin 2007).
While the PSVCP is robust at identifying PAVs and generates a linear format pangenome, it falls short in detecting and illustrating more complex structural variations, such as inversions and translocations that are often associated with plant traits (Hu et al. 2024). More advanced graph-based pangenome pipelines are capable of storing and displaying all genetic variants in a graph format, facilitating the depiction of duplications as repeated nodes or paths for comprehensive alignments, with their efficacy in variant detection depending on graph complexity and the alignment algorithms utilized. Various read alignment tools and algorithms have been specifically devised for handling graph pangenomes in the Graphical Fragment Assembly format. For instance, Minigraph (Li et al. 2020) is robust in mapping sequences directly onto the pangenome graph, while Vg Giraffe (Sirén et al. 2021) is noted for its effectiveness in aligning short-read sequencing data to the pangenome graph. Additionally, tools like GraphChainer (Ma et al. 2023) and GraphAligner (Rautiainen and Marschall 2020) offer advantages in aligning long-read data, demonstrating the tailored functionalities of these algorithms in addressing different aspects of pangenome analysis. Currently, three approaches are available for generating different types of pangenome graphs, including the genetic variation graph, reference-based pangenome graph, and reference-unbiased pangenome graph. Among these, the genetic variation graph represents the simplest approach for constructing a pangenome graph. It employs the Vg construct toolkit (Garrison et al. 2018) that integrates genetic variations identified through whole-genome comparison into an existing reference genome. Employing the Minigraph (Li et al. 2020) or Minigraph-Cactus toolkits (Hickey et al. 2023) allows for the efficient construction of a reference-based pangenome graph by compactly encoding structural variants absent from the chosen reference genome. Owing to their ability to maintain the coordinates of the linear reference genome, both the genetic variation graph and the reference-based pangenome graph can be readily adapted for linear reference–based downstream analyses. However, the reference genome-based methods can introduce bias, potentially overlooking genetic variations exclusive to nonreference genomes. Furthermore, the selection of a different reference genome can result in the production of a different pangenome graph. The reference-unbiased pangenome graph, generated using the PGGB tool (Garrison et al. 2023), is based on all-to-all whole-genome alignments, culminating in a reference-free pangenome graph model. This approach treats all genomes equally and so ensures that the pangenome graph captures genomic diversity free from reference bias. However, the construction of a graph pangenome using PGGB can be resource-intensive and time-consuming, particularly for large genomes with extensive repeat regions. For example, the graph pangenome construction time for over 20 barley genomes can extend up to several months. Additionally, the optimal parameters for PGGB vary across different species and often require multiple computational trials to fine-tune.
The Future Direction of Plant Pangenome Analysis
As sequencing costs decrease and third-generation sequencing technologies advance, complemented by the emergence of graph genome bioinformatic tools, graph-based pangenomes are increasingly becoming the standard reference. Compared with linear format pangenomes, graph-based pangenomes allow for more comprehensive detection and depiction of genomic sequences and variation, leading to improved read alignment rates with the reference and enhancing the ability to capture missing heritability within populations. However, the complexity and size of graph-based pangenomes pose challenges for genome graph analysis; there currently exist no universally accepted benchmarks to evaluate the quality and reliability of graph-based pangenomes. Nevertheless, the recent introduction of the bioinformatics tool gretl (Vorbrugg et al. 2024) marks a significant advancement. This tool is designed to assess graphs generated by various graph construction pipelines under differing parameters, offering a range of statistics for evaluating graph-based pangenomes. These measures include the number of nodes and edges, the average or median length of the top 5% of nodes, and a compression ratio calculated as the input genome size in base pairs divided by the graph size in base pairs.
The question of whether to use efficient reference-based methods or apply nonreference-biased approaches remains open. Reference-based methods offer a distinct advantage for pangenome updates as new data become available, by facilitating the straightforward alignment of this new data to the existing pangenome structure, while the selection of a reference genome can significantly alter the range of genetic variations identified, underlining the inherent variability of this method and requiring careful consideration in choosing the reference genome. Using nonreference-biased approaches may offer a more comprehensive view of genomic diversity, helping to unravel the complex genetic variations across different genomes, but comes with a significant additional development cost. In addition, while tools such as Bandage (Wick et al. 2015) and ODGI (Guarracino et al. 2022) have facilitated the application of graph-based pangenomes, the majority of genome analysis tools are currently designed for linear sequences, highlighting the need for scalable software and sophisticated data structures specifically designed for graph-based pangenome analysis. Recognizing this gap, recent advancements like Panache have been developed to navigate the complexities of graph-based pangenome data through intuitive linear representations, offering promising solutions to the visualization challenges posed by graph-based pangenomes. In the future, the migration of linear genome-based methods to the graph-based pangenome is expected to emerge as a significant area of focus, particularly when combined with large-scale sequencing to characterize the distribution of variants in a population.
The development of artificial intelligence such as machine learning, which is able to identify complex patterns and relationships in data, can significantly enhance the accuracy of downstream pangenome analyses such as gene prediction, structural variation detection, and trait association (Bayer, Petereit, et al. 2021; Bayer, Scheben, et al. 2021; Danilevicz et al. 2022; Upadhyaya et al. 2022). This facilitates the exploration of genome features and the association of genomic variants with plant traits. For example, a rice pangenome study implemented a machine learning–based workflow to identify inversions and produce a pangenome-wide inversion index using high-throughput sequencing data from the 3K-Rice Genome Project (Zhou et al. 2023). A separate study employed soybean pangenome data and multiple machine learning–based models for trait prediction (Gill et al. 2022). However, a challenge faced by machine learning–based methods is the lack of empirical training data. Consequently, training machine learning–based structural variant detection and trait association using simulated data could be a viable approach (Dierckxsens et al. 2021). Ongoing advances in DNA sequencing technologies and artificial intelligence are paving the way toward the further application of graph-based pangenome, enhancing our understanding of the biological and functional significance of genetic variations.
Acknowledgments
This work is funded by the Key Project of Guangdong Basic and Applied Basic Research Foundation (2020B1515420003), Guangdong Key Laboratory of New Technology in Rice Breeding (2023B1212060042), the Innovation Team Project of Guangdong Modern Agricultural Industrial System (2023KJ106), the “YouGu” Plan of Rice Research Institute of Guangdong Academy of Agricultural Sciences (2023YG04), Introduction of Young Key Talents of Guangdong Academy of Agricultural Sciences (R2023YJ-QC001), the GuangDong Basic and Applied Basic Research Foundation (2024A1515011981), and the Australia Research Council (projects DP200100762 and DP210100296).
Contributor Information
Haifei Hu, Rice Research Institute, Guangdong Academy of Agricultural Sciences & Key Laboratory of Genetics and Breeding of High Quality Rice in Southern China (Co-construction by Ministry and Province), Ministry of Agriculture and Rural Affairs & Guangdong Key Laboratory of New Technology in Rice Breeding & Guangdong Rice Engineering Laboratory, Guangzhou 510640, China.
Risheng Li, Rice Research Institute, Guangdong Academy of Agricultural Sciences & Key Laboratory of Genetics and Breeding of High Quality Rice in Southern China (Co-construction by Ministry and Province), Ministry of Agriculture and Rural Affairs & Guangdong Key Laboratory of New Technology in Rice Breeding & Guangdong Rice Engineering Laboratory, Guangzhou 510640, China; College of Agriculture, South China Agricultural University, Guangzhou, Guangdong 510642, China.
Junliang Zhao, Rice Research Institute, Guangdong Academy of Agricultural Sciences & Key Laboratory of Genetics and Breeding of High Quality Rice in Southern China (Co-construction by Ministry and Province), Ministry of Agriculture and Rural Affairs & Guangdong Key Laboratory of New Technology in Rice Breeding & Guangdong Rice Engineering Laboratory, Guangzhou 510640, China.
Jacqueline Batley, School of Biological Sciences, University of Western Australia, Perth, WA, Australia.
David Edwards, School of Biological Sciences, University of Western Australia, Perth, WA, Australia; Centre for Applied Bioinformatics, University of Western Australia, Perth, WA 6009, Australia.
Data Availability
No new data were generated as part of this review.
Literature Cited
- Bayer PE, Golicz AA, Scheben A, Batley J, Edwards D. Plant pan-genomes are the new reference. Nat Plants. 2020:6(8):914–920. 10.1038/s41477-020-0733-0. [DOI] [PubMed] [Google Scholar]
- Bayer PE, Petereit J, Danilevicz MF, Anderson R, Batley J, Edwards D. The application of pangenomics and machine learning in genomic selection in plants. Plant Genome. 2021:14(3):e20112. 10.1002/tpg2.20112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bayer PE, Petereit J, Durant É, Monat C, Rouard M, Hu H, Chapman B, Li C, Cheng S, Batley J, et al. Wheat Panache: a pangenome graph database representing presence–absence variation across sixteen bread wheat genomes. Plant Genome. 2022:15(3):e20221. 10.1002/tpg2.20221. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bayer PE, Scheben A, Golicz AA, Yuan Y, Faure S, Lee H, Chawla HS, Anderson R, Bancroft I, Raman H, et al. Modelling of gene loss propensity in the pangenomes of three Brassica species suggests different mechanisms between polyploids and diploids. Plant Biotechnol J. 2021:19(12):2488–2500. doi: doi 10.1111/pbi.13674. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bayer PE, Valliyodan B, Hu H, Marsh JI, Yuan Y, Vuong TD, Patil G, Song Q, Batley J, Varshney RK, et al. Sequencing the USDA core soybean collection reveals gene loss during domestication and breeding. Plant Genome. 2022:15(1):e20109. 10.1002/tpg2.20109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cheng H, Concepcion GT, Feng X, Zhang H, Li H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods. 2021:18(2):170–175. 10.1038/s41592-020-01056-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Contreras-Moreira B, Cantalapiedra CP, García-Pereira MJ, Gordon SP, Vogel JP, Igartua E, Casas AM, Vinuesa P. Analysis of plant pan-genomes and transcriptomes with GET_HOMOLOGUES-EST, a clustering solution for sequences of the same species. Front. Plant Sci. 2017:8:184. 10.3389/fpls.2017.00184. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Contreras-Moreira B, Saraf S, Naamati G, Casas AM, Amberkar SS, Flicek P, Jones AR, Dyer S. GET_PANGENES: calling pangenes from plant genome alignments confirms presence–absence variation. Genome Biol. 2023:24(1):223. 10.1186/s13059-023-03071-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Danilevicz MF, Gill M, Anderson R, Batley J, Bennamoun M, Bayer PE, Edwards D. Plant genotype to phenotype prediction using machine learning. Front Genet. 2022:13:822173. 10.3389/fgene.2022.822173. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Derbyshire MC, Marsh J, Tirnaz S, Nguyen HT, Batley J, Bayer PE, Edwards D. Diversity of fatty acid biosynthesis genes across the soybean pangenome. Plant Genome. 2023:16(2):e20334. 10.1002/tpg2.20334. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dierckxsens N, Li T, Vermeesch JR, Xie Z. A benchmark of structural variation detection by long reads through a realistic simulated model. Genome Biol. 2021:22(1):1–16. 10.1186/s13059-021-02551-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Donlin MJ. Using the generic genome browser (GBrowse). Curr Protoc Bioinformatics. 2007:17(1):9 Chapter 9: Unit 9 9. 10.1002/0471250953.bi0909s17. [DOI] [PubMed] [Google Scholar]
- Edwards D, Batley J. Graph pangenomes find missing heritability. Nat Genet. 2022:54(7):919–920. 10.1038/s41588-022-01099-8. [DOI] [PubMed] [Google Scholar]
- Eizenga JM, Novak AM, Sibbesen JA, Heumos S, Ghaffaari A, Hickey G, Chang X, Seaman JD, Rounthwaite R, Ebler J, et al. Pangenome graphs. Annu Rev Genomics Hum Genet. 2020:21(1):139–162. 10.1146/annurev-genom-120219-080406. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fu H, Dooner HK. Intraspecific violation of genetic colinearity and its implications in maize. Proc Natl Acad Sci U S A. 2002:99(14):9573–9578. 10.1073/pnas.132259199. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Garg G, Kamphuis LG, Bayer PE, Kaur P, Dudchenko O, Taylor CM, Frick KM, Foley RC, Gao LL, Aiden EL, et al. A pan-genome and chromosome-length reference genome of narrow-leafed lupin (Lupinus angustifolius) reveals genomic diversity and insights into key industry and biological traits. Plant J. 2022:111(5):1252–1266. 10.1111/tpj.15885. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Garrison E, Guarracino A, Heumos S, Villani F, Bao Z, Tattini L, Hagmann J, Vorbrugg S, Marco-Sola S, Kubica C, et al. Building pangenome graphs. bioRxiv 535718. 10.1101/2023.04.05.535718, 6 April 2023, preprint: not peer reviewed. [DOI] [PubMed]
- Garrison E, Sirén J, Novak AM, Hickey G, Eizenga JM, Dawson ET, Jones W, Garg S, Markello C, Lin MF, et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat Biotechnol. 2018:36(9):875–879. 10.1038/nbt.4227. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gill M, Gill M, Anderson R, Hu H, Bennamoun M, Petereit J, Valliyodan B, Nguyen HT, Batley J, Bayer PE, et al. Machine learning models outperform deep learning models, provide interpretation and facilitate feature selection for soybean trait prediction. BMC Plant Biol. 2022:22(1):180. 10.1186/s12870-022-03559-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goel M, Sun H, Jiao WB, Schneeberger K. SyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies. Genome Biol. 2019:20(1):1–13. 10.1186/s13059-019-1911-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Golicz AA, Batley J, Edwards D. Towards plant pangenomics. Plant Biotechnol J. 2016:14(4):1099–1105. 10.1111/pbi.12499. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Golicz AA, Bayer PE, Barker GC, Edger PP, Kim H, Martinez PA, Chan CK, Severn-Ellis A, McCombie WR, Parkin IA, et al. The pangenome of an agronomically important crop plant Brassica oleracea. Nat Commun. 2016:7(1):13390. 10.1038/ncomms13390. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Golicz AA, Martinez PA, Zander M, Patel DA, Van De Wouw AP, Visendi P, Fitzgerald TL, Edwards D, Batley J. Gene loss in the fungal canola pathogen Leptosphaeria maculans. Funct Integr Genomics. 2015:15(2):189–196. 10.1007/s10142-014-0412-1. [DOI] [PubMed] [Google Scholar]
- Gordon SP, Contreras-Moreira B, Woods DP, Des Marais DL, Burgess D, Shu S, Stritt C, Roulin AC, Schackwitz W, Tyler L, et al. Extensive gene content variation in the Brachypodium distachyon pan-genome correlates with population structure. Nat Commun. 2017:8(1):2184. 10.1038/s41467-017-02292-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gordon SP, Contreras-Moreira B, Levy JJ, Djamei A, Czedik-Eysenberg A, Tartaglio VS, Session A, Martin J, Cartwright A, Katz A, et al. Gradual polyploid genome evolution revealed by pan-genomic analysis of Brachypodium hybridum and its diploid progenitors. Nat Commun. 2020:11(1):3670. 10.1038/s41467-020-17302-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guarracino A, Heumos S, Nahnsen S, Prins P, Garrison E. ODGI: understanding pangenome graphs. Bioinformatics. 2022:38(13):3319–3326. 10.1093/bioinformatics/btac308. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hickey G, Monlong J, Ebler J, Novak AM, Eizenga JM, Gao Y, Human Pangenome Reference Consortium, Marschall T, Li H, Paten B. Pangenome graph construction from genome alignments with Minigraph-Cactus. Nat Biotechnol. 2023:42(4):663–673. 10.1038/s41587-023-01793-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hu H, Scheben A, Edwards D. Advances in integrating genomics and bioinformatics in the plant breeding pipeline. Agriculture. 2018:8(6):75. doi: https://doi.org/https://doi.org/10.3390/agriculture806007510.3390/agriculture8060075. [Google Scholar]
- Hu H, Scheben A, Verpaalen B, Tirnaz S, Bayer PE, Hodel RGJ, Batley J, Soltis DE, Soltis PS, Edwards D. Amborella gene presence/absence variation is associated with abiotic stress responses that may contribute to environmental adaptation. New Phytol. 2022:233(4):1548–1555. 10.1111/nph.17658. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hu H, Scheben A, Wang J, Li F, Li C, Edwards D, Zhao J. Unraveling inversions: technological advances, challenges, and potential impact on crop breeding. Plant Biotechnol J. 2024:22(3):544–554. 10.1111/pbi.14224. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hu H, Yuan Y, Bayer PE, Fernandez CT, Scheben A, Golicz AA, Edwards D. Legume pangenome construction using an iterative mapping and assembly approach. Methods Mol Biol. 2020:2107:35–47. 10.1007/978-1-0716-0235-5_3. [DOI] [PubMed] [Google Scholar]
- Hu Z, Sun C, Lu KC, Chu X, Zhao Y, Lu J, Shi J, Wei C. EUPAN enables pan-genome studies of a large number of eukaryotic genomes. Bioinformatics. 2017:33(15):2408–2409. 10.1093/bioinformatics/btx170. [DOI] [PubMed] [Google Scholar]
- Hufford MB, Seetharam AS, Woodhouse MR, Chougule KM, Ou S, Liu J, Ricci WA, Guo T, Olson A, Qiu Y, et al. De novo assembly, annotation, and comparative analysis of 26 diverse maize genomes. Science. 2021:373(6555):655–662. 10.1126/science.abg5289. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jayakodi M, Padmarasu S, Haberer G, Bonthala VS, Gundlach H, Monat C, Lux T, Kamal N, Lang D, Himmelbach A, et al. The barley pan-genome reveals the hidden legacy of mutation breeding. Nature. 2020:588(7837):284–289. 10.1038/s41586-020-2947-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jia Y, Xu M, Hu H, Chapman B, Watt C, Buerte B, Han N, Zhu M, Bian H, Li C, et al. Comparative gene retention analysis in barley, wild emmer, and bread wheat pangenome lines reveals factors affecting gene retention following gene duplication. BMC Biol. 2023:21(1):25. 10.1186/s12915-022-01503-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012:9(4):357–359. 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H, Feng X, Chu C. The design and construction of reference pangenome graphs with minigraph. Genome Biol. 2020:21(1):265. 10.1186/s13059-020-02168-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H, Marin M, Farhat MR. Exploring gene content with pangenome gene graphs. arXiv: 2402.16185. 10.48550/arXiv.2402.16185, 25 Feb 2024, preprint: not peer reviewed. [DOI]
- Li W, Liu J, Zhang H, Liu Z, Wang Y, Xing L, He Q, Du H. Plant pan-genomics: recent advances, new challenges, and roads ahead. J Genet Genomics. 2022:49(9):833–846. 10.1016/j.jgg.2022.06.004. [DOI] [PubMed] [Google Scholar]
- Li YH, Zhou G, Ma J, Jiang W, Jin LG, Zhang Z, Guo Y, Zhang J, Sui Y, Zheng L, et al. De novo assembly of soybean wild relatives for pan-genome analysis of diversity and agronomic traits. Nat Biotechnol. 2014:32(10):1045–1052. 10.1038/nbt.2979. [DOI] [PubMed] [Google Scholar]
- Liao WW, Asri M, Ebler J, Doerr D, Haukness M, Hickey G, Lu S, Lucas JK, Monlong J, Abel HJ, et al. A draft human pangenome reference. Nature. 2023:617(7960):312–324. 10.1038/s41586-023-05896-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ma J, Cáceres M, Salmela L, Mäkinen V, Tomescu AI. Chaining for accurate alignment of erroneous long reads to acyclic variation graphs. Bioinformatics. 2023:39(8):btad460. 10.1093/bioinformatics/btad460. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Montenegro JD, Golicz AA, Bayer PE, Hurgobin B, Lee H, Chan CK, Visendi P, Lai K, Doležel J, Batley J, et al. The pangenome of hexaploid bread wheat. Plant J. 2017:90(5):1007–1013. 10.1111/tpj.13515. [DOI] [PubMed] [Google Scholar]
- Ou S, Su W, Liao Y, Chougule K, Agda JRA, Hellinga AJ, Lugo CSB, Elliott TA, Ware D, Peterson T, et al. Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline. Genome Biol. 2019:20(1):1–18. 10.1186/s13059-019-1905-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rautiainen M, Marschall T. GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol. 2020:21(1):253. 10.1186/s13059-020-02157-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rijzaani H, Bayer PE, Rouard M, Doležel J, Batley J, Edwards D. The pangenome of banana highlights differences between genera and genomes. Plant Genome. 2022:15(1):e20100. 10.1002/tpg2.20100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ruperao P, Thirunavukkarasu N, Gandham P, Selvanayagam S, Govindaraj M, Nebie B, Manyasa E, Gupta R, Das RR, Odeny DA, et al. Sorghum pan-genome explores the functional utility for genomic-assisted breeding to accelerate the genetic gain. Front Plant Sci. 2021:12:963. 10.3389/fpls.2021.666342. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sirén J, Monlong J, Chang X, Novak AM, Eizenga JM, Markello C, Sibbesen JA, Hickey G, Chang PC, Carroll A, et al. Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science. 2021:374(6574):abg8871. doi: doi 10.1126/science.abg8871. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schatz MC, Maron LG, Stein JC, Hernandez Wences A, Gurtowski J, Biggers E, Lee H, Kramer M, Antoniou E, Ghiban E, et al. Whole genome de novo assemblies of three divergent strains of rice, Oryza sativa, document novel gene space of aus and indica. Genome Biol. 2014:15(11):506. 10.1186/PREACCEPT-2784872521277375. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, Angiuoli SV, Crabtree J, Jones AL, Durkin AS, et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome”. Proc Natl Acad Sci U S A. 2005:102(39):13950–13955. 10.1073/pnas.0506758102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Upadhyaya SR, Bayer PE, Tay Fernandez CG, Petereit J, Batley J, Bennamoun M, Boussaid F, Edwards D. Evaluating plant gene models using machine learning. Plants (Basel). 2022:11(12):1619. 10.3390/plants11121619. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vaughn JN, Branham SE, Abernathy B, Hulse-Kemp AM, Rivers AR, Levi A, Wechter WP. Graph-based pangenomics maximizes genotyping density and reveals structural impacts on fungal resistance in melon. Nat Commun. 2022:13(1):7897. 10.1038/s41467-022-35621-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vorbrugg S, Bezrukov I, Bao Z, Weigel D. Gretl-variation GRaph evaluation TooLkit. bioRxiv 580974. 10.1101/2024.03.04.580974, 5 March 2024, preprint: not peer reviewed. [DOI]
- Wang J, Yang W, Zhang S, Hu H, Yuan Y, Dong J, Chen L, Ma Y, Yang T, Zhou L, et al. A pangenome analysis pipeline provides insights into functional gene identification in rice. Genome Biol. 2023:24(1):19. 10.1186/s13059-023-02861-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang K, Hu H, Tian Y, Li J, Scheben A, Zhang C, Li Y, Wu J, Yang L, Fan X, et al. The chicken pan-genome reveals gene content variation and a promoter region deletion in IGF2BP1 affecting body size. Mol Biol Evol. 2021:38(11):5066–5081. 10.1093/molbev/msab231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang W, Mauleon R, Hu Z, Chebotarov D, Tai S, Wu Z, Li M, Zheng T, Fuentes RR, Zhang F, et al. Genomic variation in 3,010 diverse accessions of Asian cultivated rice. Nature. 2018:557(7703):43–49. 10.1038/s41586-018-0063-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wick RR, Schultz MB, Zobel J, Holt KE. Bandage: interactive visualization of de novo genome assemblies. Bioinformatics. 2015:31(20):3350–3352. 10.1093/bioinformatics/btv383. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yates AD, Allen J, Amode RM, Azov AG, Barba M, Becerra A, Bhai J, Campbell LI, Carbajo Martinez M, Chakiachvili M, et al. Ensembl genomes 2022: an expanding genome resource for non-vertebrates. Nucleic Acids Res. 2022:50(D1):D996–D1003. 10.1093/nar/gkab1007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yan H, Sun M, Zhang Z, Jin Y, Zhang A, Lin C, Wu B, He M, Xu B, Wang J, et al. Pangenomic analysis identifies structural variation associated with heat tolerance in pearl millet. Nat Genet. 2023:55(3):507–518. 10.1038/s41588-023-01302-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yu J, Golicz AA, Lu K, Dossa K, Zhang Y, Chen J, Wang L, You J, Fan D, Edwards D, et al. Insight into the evolution and functional characteristics of the pan-genome assembly from sesame landraces and modern cultivars. Plant Biotechnol J. 2019:17(5):881–892. doi: doi 10.1111/pbi.13022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao J, Bayer PE, Ruperao P, Saxena RK, Khan AW, Golicz AA, Nguyen HT, Batley J, Edwards D, Varshney RK. Trait associations in the pangenome of pigeon pea (Cajanus cajan). Plant Biotechnol J. 2020:18(9):1946–1954. 10.1111/pbi.13354. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhou Y, Yu Z, Chebotarov D, Chougule K, Lu Z, Rivera LF, Kathiresan N, Al-Bader N, Mohammed N, Alsantely A, et al. Pan-genome inversion index reveals evolutionary insights into the subpopulation structure of Asian rice. Nat Commun. 2023:14(1):1567. 10.1038/s41467-023-37004-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhou Y, Zhang Z, Bao Z, Li H, Lyu Y, Zan Y, Wu Y, Cheng L, Fang Y, Wu K, et al. Graph pangenome captures missing heritability and empowers tomato breeding. Nature. 2022:606(7914):527–534. 10.1038/s41586-022-04808-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zimin AV, Marçais G, Puiu D, Roberts M, Salzberg SL, Yorke JA. The MaSuRCA genome assembler. Bioinformatics. 2013:29(21):2669–2677. 10.1093/bioinformatics/btt476. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
No new data were generated as part of this review.

