Abstract
Cultivated pear consists of several Pyrus species with Pyrus communis (European pear) representing a large fraction of worldwide production. As a relatively recently domesticated crop and perennial tree, pear can benefit from genome-assisted breeding. Additionally, comparative genomics within Rosaceae promises greater understanding of evolution within this economically important family. Here, we generate a fully phased chromosome-scale genome assembly of P. communis ‘d’Anjou.’ Using PacBio HiFi and Dovetail Omni-C reads, the genome is resolved into the expected 17 chromosomes, with each haplotype totaling nearly 540 Megabases and a contig N50 of nearly 14 Mb. Both haplotypes are highly syntenic to each other and to the Malus domestica ‘Honeycrisp’ apple genome. Nearly 45,000 genes were annotated in each haplotype, over 90% of which have direct RNA-seq expression evidence. We detect signatures of the known whole-genome duplication shared between apple and pear, and we estimate 57% of d’Anjou genes are retained in duplicate derived from this event. This genome highlights the value of generating phased diploid assemblies for recovering the full allelic complement in highly heterozygous crop species.
Keywords: genome assembly, comparative genomics, PacBio HiFi, haplotype phased, whole-genome duplication
Yocca et al. present a fully-phased chromosome-scale genome assembly of the European pear P. communis “d'Anjou”. This assembly will allow for genome-assisted breeding and will thus increase the pace of improvement for this relatively recently-cultivated, economically-important tree species. Additionally, comparative genomic studies within the Rose family (Rosaceae) may provide insight into the evolutionary history of the European pear. The genome assembly, annotation, and analyses were produced in a semester-long undergraduate and graduate course as part of the ACTG: American Campus Tree Genomes Initiative.
Introduction
Pyrus L. is a genus in the family Rosaceae (subfamily Maloideae) comprising cultivated and wild pears. Pyrus is divided into 2 broad categories, the European and Asian pears, with their divergence estimated around 3–6 MYA (Wu et al. 2018). At least 26 species of Pyrus and 10 naturally occurring interspecific crosses are now found in Western and Eastern Asia, Europe, North Africa, and the Middle East (Bell and Itai, 2011). In 2021, the pear's value of utilized production in the United States reached $353 million (United States Department of Agriculture National Agricultural Statistics Service 2023). This makes pear one of the most cultivated pome fruits worldwide. One of the most important North American varieties of pear, the Anjou, also known as the Beurre d'Anjou or simply Anjou (Pyrus communis ‘d'Anjou’), is thought to have originated in Belgium, named for the Anjou region of France.
Over the last decade, several pear genomes have been sequenced and assembled using a variety of technologies. The first Pyrus genome sequenced in 2012 was the most commercially important Asian pear Pyrus bretschneideri Rehd. ‘Dangshansuli,’ using a combination of BAC-by-BAC sequencing and mate pair Illumina sequencing (Wu et al. 2013). Following that, European pear (P. communis ‘Bartlett’) was sequenced using Roche 454 (Chagné et al. 2014). In 2019, the P. communis genome was updated by sequencing the doubled-haploid ‘Bartlett’ cultivar using PacBio long reads and high-throughput chromosome conformation capture (Hi-C) technology (Linsmith et al. 2019). This assembly helped uncover duplicated gene models in previous assemblies that overassembled heterozygous regions. However, being a doubled-haploid, it still lacked an entire parental complement. A draft assembly and annotation for P. communis ‘d’Anjou’ was generated recently (Zhang et al. 2022), which was carefully annotated and revealed systematic differences in gene annotations across Rosaceae genomes. However, this assembly was also not phased, lacking information on allelic variants. Genomes are currently available for 5 of 26 Pyrus species in the Genome Database for Rosaceae (GDR; https://www.rosaceae.org/organism/26137) and for only a few of the thousands of recognized cultivars (Li et al. 2022).
Here, we sequenced and assembled a chromosome-scale reference genome for P. communis ‘d’Anjou’ using PacBio HiFi and Dovetail Omni-C sequencing. This genome was assembled as part of a semester-long undergraduate and graduate genomics course under the American Campus Tree Genomes (ACTG) initiative, where undergraduate and graduate students assemble, annotate, and publish culturally and economically valuable tree species. Here, we present a haplotype-resolved, chromosome-scale assembly and annotation of Anjou pear, place it in a phylogenetic context with other Rosaceae species, and show evidence of an ancient whole-genome duplication (WGD) event shared by cultivated apple and pear.
Methods
Genome sequencing
Tissue was acquired from Van Well Nursery as described in Zhang et al. (Zhang et al. 2022). The source material was labeled as the cultivar ‘d’Anjou.’ It should be noted we consider ‘Anjou’ and ‘Beurré d’Anjou’ as synonymous cultivar names. DNA was isolated from young leaf tissue using a standard CTAB approach (Doyle and Doyle 1987). Illumina TruSeq DNA PCR-free libraries were constructed from 1 μg of input DNA and sequenced on an Illumina NovaSeq6000 at HudsonAlpha Institute for Biotechnology. These short-reads were generated for plastid genome assembly as well as genome size estimation and postassembly assessment. Raw reads were assessed for quality using FASTQC v0.11.9 (Andrews et al. 2010). Then, low-quality reads were filtered out of the raw data by using fastp v0.12.4, allowing the generation of a statistical report with MultiQC 1.13.dev0 (Ewels et al. 2016). Nuclear genome size and ploidy were estimated using jellyfish v2.2.10 ((Marçais and Kingsford 2011; Ranallo-Benavidez et al. 2020)) to count k-mers and visualized in GenomeScope2.0 (Marçais and Kingsford 2011; Ranallo-Benavidez et al. 2020). For PacBio HiFi sequencing, ∼20 g of young leaf tissue from a ‘d’Anjou’ pear clone were collected and flash-frozen in liquid nitrogen. High molecular weight DNA was isolated from the young leaf tissue using a Circulomics Nanobind Plant Nuclei Big DNA kit (Baltimore, MD), with 4 g of input tissue and a 2 h lysis. DNA was tested for purity via spectrophotometry, quantified by Qubit dsDNA Broad Range, and size-selected on an Agilent Femto Pulse. DNA was sheared with a Diagenode Megaruptor and size-selected to roughly 25 kb on a BluePippin. A PacBio sequencing library was produced using the SMRTbell Express Template Prep Kit 2.0, and circular consensus sequencing (CCS) (HiFi) reads were produced on 2 8 M flow cells. PacBio HiFi read quality was assessed for read quality vs read distribution (Supplementary Fig. 1) using software Pauvre v0.2.3 (Schultz et al. 2019).
Plastid genome assembly and annotation
The plastid genomes from 5 Pyrus individuals (Supplementary Table 3) were assembled using NOVOPlasty v4.3.1 (Dierckxsens et al. 2016), setting the expected plastid genome size to 130–170 kb and using the seed file provided (https://github.com/ndierckx/NOVOPlasty). The assembled plastid genomes were annotated using GeSeq v2.0.3 (Tillich et al. 2017) and visualized using OGDRAW v1.3.1 (Greiner et al. 2019).
Genome assembly and scaffolding
Raw HiFi reads were assembled into contigs using hifiasm v0.16.0 (Cheng et al. 2021). To scaffold the ‘d’Anjou’ genome, 1 g of young leaf tissue was used as input for a Dovetail Omni-C library per manufacturer instructions (Dovetail Genomics, Inc.). The Omni-C library was sequenced on an Illumina NovaSeq6000 using paired-end 150 base-pair reads. To map the Omni-C data to our preliminary genome assembly, the Arima genomics pipeline was followed (https://github.com/ArimaGenomics/mapping_pipeline). Scaffolding was then performed using yet another Hi-C scaffolding tool (YaHS) with default parameters (Zhou et al. 2023). Omni-C contact maps were visualized using Juicebox version 1.11.08 (Durand et al. 2016). Several examples of likely misassembled regions were manually rearranged in Juicebox and documented in the Supplementary Methods. Genome completeness was assessed using compleasm v0.2.2 with the lineage “embryophyta_odb10” (Huang and Li 2023).
Annotating repeats and transposable elements
Transposable elements (TEs) were predicted and annotated from the pear genome assembly using the Extensive de novo TE Annotator (EDTA) pipeline (v1.9.3) (Xu and Wang 2007; Ellinghaus et al. 2008; Xiong et al. 2014; Ou and Jiang 2019, 2018; Ou et al. 2019; Su et al. 2019; Shi and Liang 2019). EDTA parameters were set to the following: “--species others --step all --sensitive 1 --anno 1 --evaluate 1 --threads 4.” The coverage of genes and repeats in 1 Mb windows with a 100 Kb step was calculated using BEDTools version 2.30.0 (Quinlan and Hall 2010) and plotted onto the chromosomes using karyoploteR version 1.18.0 (Gel and Serra 2017).
Structural variant analysis
First, assemblies were aligned using MUMmer (Marçais et al. 2018). Next, structural variants were characterized between genome assemblies using Assemblytics (Nattestad and Schatz 2016). More details are provided in the Supplementary Methods.
Gene annotation
Protein-coding genes were annotated using MAKER2 (Holt and Yandell 2011). Arabidopsis Araport 11 proteins and 7 P. communis ‘d’Anjou’ RNA-seq libraries were used as evidence (Cheng et al. 2017). RNA-seq libraries are available on the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) under accession PRJNA791346. One round of evidence-based annotation was performed and used to iteratively train ab initio prediction models through both SNAP and Augustus. More details are provided in the Supplementary Methods.
RNA-seq analyses
RNA-seq reads were retrieved from the NCBI SRA under accession PRJNA791346. Reads were adapter trimmed using the BBMap “bbduk.sh” script (https://sourceforge.net/projects/bbmap/). Gene expression was quantified using Kallisto (Bray et al. 2016). Clustering was performed using the “heatmap()” function in R (R Core Team 2022). More details are provided in the Supplementary Methods.
Comparative genomic analyses
Putative synteny constrained orthologs between P. communis ‘d’Anjou,’ Malus domestica ‘Honeycrisp,’ and Prunus cerasus ‘Montmorency’ were identified using the JCVI utilities library compara catalog ortholog function (Tang et al. 2008). Genome assemblies and annotations were retrieved from the Genome Database for Rosaceae. Synonymous substitution rates were calculated using a custom Ka/Ks pipeline (https://github.com/Aeyocca/ka_ks_pipe). Briefly, orthologs were aligned using MUSCLEv3.8.31 (Edgar 2004), and PAL2NAL v14 was used to convert the peptide alignment to a nucleotide alignment, and Ks values were computed between gene pairs using codeml from PAML v4.9 with parameters specified in the control file found in the GitHub repository listed above (Yang 1997; Suyama et al. 2006).
Results
Nuclear genome assembly
We generated several types of sequencing data to assemble and annotate the Anjou genome (Fig. 1). Given an estimated genome size of ∼550 Mb (Niu et al. 2020), we generated 113× coverage of Illumina shotgun data, 66× of PacBio HiFi data, and 190× of Omni-C data per haplotype. Genomescope estimated a k-mer-based genome size of ∼495 Mb, 46.79% repeated sequences, and 1.79% heterozygosity (Supplementary Fig. 1). We assessed the quality of our HiFi reads using Pauvre indicating high-quality libraries and a read length distribution centered around 15 kb (Supplementary Fig. 2). Our mean and median read lengths were 15,555 and 14,758 bp, while the longest read was 49,417 bp long.
Fig. 1.

Pear fruit photographs. Photographs of green Anjou fruit (a) and Red Anjou fruit (b). Photos were provided by USA Pears.
The final assembly is haplotype-resolved with 17 chromosomes per haplotype. Chromosomes were oriented according to the M. domestica ‘Honeycrisp’ assembly (Khan et al. 2022). The final assembly consisted of nearly 540 Mb per haplotype with >93% of the raw contig assemblies contained in the 17 chromosomes (Supplementary Fig. 3). The contig N50s for haplotypes 1 and 2 respectively were 14.7 and 13.4 Mb, while the scaffold N50s were 29.6 Mb. We found >99% complete BUSCOs in each haplotype with over 30% of them present in duplicate, reflecting the WGD experienced by the Maleae lineage ∼45 MYA (Xiang et al. 2017). Over 99% of our Illumina reads were properly mapped back to our assembly. k-mer-based completeness between Illumina reads and the final assembly demonstrated high-quality values (36.16) and low error rates (0.0002423) for both haplotypes.
Chloroplast assembly
We also assembled the chloroplast of P. communis ‘d’Anjou’ along with 4 other Pyrus species or accessions (Supplementary Table 3 and Fig. 4; Fig. 2). The chloroplast genomes were similar in size, ranging from 159 to 161 kb, and consisted of a large single-copy region, small single-copy region, and 2 inverted repeats for each species. Pyrus as a genus consists of 2 major genetic groups: European and Asian (Zheng et al. 2014). Pyrus hopeiensis, Pyrus pyrifolia, and Pyrus bretscheirderi are all considered Asian species. We estimated phylogenetic relationships between our chloroplast assemblies and found both representatives of P. communis sister to each other consistent with expectations.
Fig. 2.
Chloroplast assemblies and phylogeny. Chloroplast genomes of assorted pear cultivars—assemblies and annotations. Plastid assemblies were carried out using NOVOPlasty v4.4.1 and annotated using GeSeq v2.0.3. Phylogenetic relationships were estimated using maximum likelihood under the generalized time reversible model.
TEs are important components of plant genomes, contributing to genome size variation, gene family evolution, and transcriptional novelty (Lu et al. 2019; Quadrana 2020). Repetitive elements were annotated using EDTA (Ou et al. 2019) (Table 1). A total of 39–42% of each haplotype consisted of repetitive elements. The majority of these elements by length were long terminal repeat (LTR) retrotransposons accounting for ∼32% of each haplotype. These elements are most abundant around the putative centromeres but are also ubiquitous in gene-rich regions (Fig. 3). Terminal inverted repeats (TIRs) were also abundant and dominated by mutator elements (∼3.4% of each haplotype).
Table 1.
Summary of repeat elements annotated by EDTA.
| Repeat type | Hap | Count | bp masked | % Masked | Repeat type | Hap | Count | bp masked | % Masked |
|---|---|---|---|---|---|---|---|---|---|
| LTR Ty1 | 1 | 31,417 | 29,651,485 | 5.6 | LTR Ty1 | 2 | 30,811 | 29,080,309 | 5.73 |
| LTR Ty3 | 1 | 52,870 | 65,248,004 | 12.32 | LTR Ty3 | 2 | 51,619 | 65,330,713 | 12.88 |
| LTR unknown | 1 | 52,617 | 44,783,539 | 8.46 | LTR unknown | 2 | 60,287 | 50,732,038 | 10 |
| TIR CACTA | 1 | 20,714 | 7,389,362 | 1.4 | TIR CACTA | 2 | 19,593 | 7,081,084 | 1.4 |
| TIR mutator | 1 | 75,530 | 18,368,328 | 3.47 | TIR mutator | 2 | 71,859 | 17,304,544 | 3.41 |
| TIR PIF harbinger | 1 | 26,889 | 9,561,615 | 1.81 | TIR PIF harbinger | 2 | 25,649 | 9,164,523 | 1.81 |
| TIR Tc1 mariner | 1 | 1,950 | 713,551 | 0.13 | TIR Tc1 mariner | 2 | 1,857 | 567,099 | 0.11 |
| TIR hAT | 1 | 14,789 | 4,479,323 | 0.85 | TIR hAT | 2 | 13,724 | 4,267,786 | 0.84 |
| LINE | 1 | 1,494 | 720,397 | 0.14 | LINE | 2 | 1,409 | 710,461 | 0.14 |
| Non-LTR unknown | 1 | 242 | 304,682 | 0.06 | Non-LTR unknown | 2 | 215 | 279,820 | 0.06 |
| Helitron | 1 | 25,911 | 8,267,980 | 1.56 | Helitron | 2 | 29,480 | 9,716,313 | 1.92 |
| Other repeat regions | 1 | 83,566 | 21,068,202 | 3.98 | Other repeat regions | 2 | 87,157 | 21,406,735 | 4.22 |
| Total | 1 | 387,989 | 210,556,468 | 39.78 | Total | 2 | 393,660 | 215,641,425 | 42.52 |
LTR, long terminal repeat; TIR, terminal inverted repeat; PIF, P instability factor; LINE, long interspersed nuclear element; Hap, haplotype; Bp, base pairs.
Fig. 3.
Distributions of genomic elements. Density of genomic elements across our assembly. Feature densities are calculated in 1 Mb windows with a 100 kb step size. Features on haplotype 1 are listed in a), and those on haplotype 2 are listed in b). Feature distributions are stacked in the order: Genes (bottom distribution), Ty3 TEs (second lowest distribution), Copia TEs (second highest distribution), and other repeat elements annotated by EDTA (highest distribution). Numbers along the x-axis correspond to position along the chromosome (Mb).
Each haplotype was independently annotated with expression evidence, Arabidopsis protein evidence, and ab initio gene prediction using the MAKER pipeline (Supplementary Methods and Table 4). We annotated a total of 44,839 genes in haplotype A and 44,561 genes in haplotype B, which is similar to the number of genes annotated in M. domestica ‘Honeycrisp’ (50,105). Gene density was highest on chromosome arms and was inversely related to the density of TEs (Fig. 3).
There were several structural variants between our 2 haplotypes (Table 2). We characterized 13,421 variants within 50–10,000 base pairs between the haplotypes, totaling almost 32 Mb of sequence. Repeat expansion and contractions were the largest classes of structural variant. Insertions and deletions also affected nearly 6 Mb of sequence between haplotypes. Between P. communis ‘d’Anjou’ and P. communis ‘Bartlett,’ 14,946 variants affected 26 Mb of sequence. The total amount of sequence affected is lower than that observed between ‘d’Anjou’ haplotypes. This may simply be due to a more complete assembly for both Anjou haplotypes relative to the ‘Bartlett’ assembly.
Table 2.
Structural variants between 50 and 10,000 bp identified by Assemblytics.
| Reference | Query | Variant type | # Variants | # Bases affected |
|---|---|---|---|---|
| ‘d’Anjou’ Hap1 | ‘d’Anjou’ Hap2 | Indel | 4,297 | 6,000,228 |
| ‘d’Anjou’ Hap1 | ‘d’Anjou’ Hap2 | Repeat | 8,711 | 24,943,411 |
| ‘d’Anjou’ Hap1 | ‘Bartlett’ | Indel | 5,739 | 4,439,368 |
| ‘d’Anjou’ Hap1 | ‘Bartlett’ | Repeat | 8,910 | 11,571,098 |
Indel is short for “insertion/deletion.”
Comparative genomics and polyploidy
Rosaceae as a plant family contains several important crops such as pear, apple, peach, cherry, and blackberry. Comparative genomics between these crops may allow functional genomics in 1 species to be translated to others. Therefore, we compared the genomes of 3 of these important crops: P. communis ‘d’Anjou’ (pear), M. domestica ‘Honeycrisp’ (apple (Khan et al. 2022)), and P. cerasus ‘Montmorency’ (cherry (Goeckeritz et al. 2023)). Both our assembled haplotypes were highly collinear with each other and with apple. We identified 40,567 orthologs between pear haplotypes, 30,340 orthologs between pear haplotype 1 and apple, and 20,526 orthologs to P. cerasus ‘Montmorency’ consistent with pear's divergence with apple postdating that to cherry.
Apple and pear share a WGD occurring after their divergence with cherry (Xiang et al. 2017). Our results show they both demonstrate a high percentage (>⅓) of duplicated BUSCO genes as well as 17 chromosomes, almost double the Amygdaloideae base chromosome count of 9 (Hodel et al. 2021). Therefore, we infer apple and pear retain much of their genome in duplicate. Across all genes within P. communis ‘d’Anjou,’ ∼57% are classified as having a syntenic paralog retained from this WGD event (Supplementary Table 5).
‘Montmorency’ is a tetraploid formed from a hybridization between different Prunus species after their divergence with the common ancestor of apple and pear. Therefore, we only compared the “A” subgenome to our assemblies. As expected, each cherry “A” subgenome scaffold was syntenic with ∼2 pear and apple scaffolds (Fig. 4a). Additionally, there were blocks in pear syntenic with 2 regions of apple that are likely regions retained from the last WGD event. There were likely further karyotype changes since the divergence of Malineae and cherry as the syntenic blocks are not entirely retained nor perfectly paired in 1:2 ratios. However, there remains high collinearity with these genomes suggesting future translation of functional genomics across species.
Fig. 4.
Ribbon plot and Ks distributions. a) A phylogenetic tree with known relationships between 4 assemblies. To the right is a ribbon plot based on gene synteny created with GENESPACE (Lovell et al. 2022). b) A density plot showing the distribution of synonymous substitution rates (Ks) between genome-wide gene pairs. The shared WGD event is denoted by a green star. All comparisons are to P. communis ‘d’Anjou’ haplotype 1 except for the “M. domestica self” comparison. Abbreviations are as follows: Pyrus Hap1, P. communis ‘d’Anjou’ haplotype 1; Pyrus Hap2, P. communis ‘d’Anjou’ haplotype 2.
The distribution of synonymous substitution rates (Ks) across gene pairs indicates the divergence between them as gene pairs will accumulate synonymous substitutions over time (Yang and Nielsen 2000; Senchina et al. 2003). We see orthologs between haplotypes 1 and 2 in our assembly have a Ks distribution centered near 0 as expected for allelic copies of genes that are still segregating within the species. Comparing haplotype 1 to itself identifies gene pairs that are retained from the most recent WGD event. We see this distribution is higher than that of gene pairs between Pyrus and Malus suggesting this WGD event occurred before the divergence of these species. Additionally, comparing M. domestica with itself shows a distribution similar to that of the “Pyrus self” comparison, as expected reflecting a shared WGD event or, at the very least, a different WGD event occurring around the same time (Fig. 4b; star). This distribution is lower than that compared with P. cerasus as this WGD event postdates the divergence of the cherry and apple/pear lineages.
Gene expression
We quantified gene expression across 7 tissues (Table 3). We found expression evidence for ∼33–35,000 gene models per tissue. Most gene models were expressed in fruitlet stage 1, and the least were expressed in fruitlet stage 2 suggesting dynamic gene expression across fruit development. There was evidence of gene expression in at least a single tissue for 40,734 gene models, while 2,152 genes were expressed in only a single tissue (average of 307 genes per tissue). Our expression data were generated to assist genome annotation and are only single replicates. We therefore cannot perform differential expression analyses. We instead performed hierarchical clustering of gene expression (Fig. 5). We see stable clustering across haplotypes and find similar tissues cluster together. For example, our 2 fruit libraries clustered with each other. We generated an UpSet plot showing the 15 largest intersects of genes expressed >1 transcript per million (TPM; Fig. 5). The largest intersect was genes expressed >1 TPM in every tissue queried. The top 15 intersects, however, included each of the 7 tissue-specific categories. Open buds had the most tissue-specific genes (445), while budding leaf–specific genes had the least (171).
Table 3.
Expression characteristics of P. communis ‘d’Anjou.’
| Tissue | Hap | Genes expressed | Median TPM | Tissue | Hap | Genes expressed | Median TPM |
|---|---|---|---|---|---|---|---|
| Budding leaves | 1 | 33,594 | 84.97 | Budding leaves | 2 | 33,470 | 88.00 |
| Expanding leaves | 1 | 34,469 | 119.7 | Expanding leaves | 2 | 34,380 | 122.0 |
| Flower buds | 1 | 34,138 | 71.34 | Flower buds | 2 | 34,082 | 73.3 |
| Fruitlet stage 1 | 1 | 34,923 | 193 | Fruitlet stage 1 | 2 | 34,797 | 200 |
| Fruitlet stage 2 | 1 | 33,227 | 96.4 | Fruitlet stage 2 | 2 | 33,107 | 100.0 |
| Open buds | 1 | 34,463 | 72.0 | Open buds | 2 | 34,372 | 74.02 |
| ¼” buds | 1 | 34,718 | 108.3 | ¼” buds | 2 | 34,513 | 111.00 |
Hap, haplotype; TPM, transcripts per million reads.
Fig. 5.
Gene expression characterization. Heatmaps and UpSet plot of gene expression. Cladograms represent the relationships between libraries through hierarchical clustering. A total of 1000 genes are displayed that show expression in each tissue and have the highest expression variance. a) Haplotype 1. b) Haplotype 2. c) UpSet plot of expression across tissues for haplotype 1. Genes were considered expressed if they had a TPM value above 1. Note the break in the y-axis.
Conclusion
We assembled a chromosome-scale phased genome assembly for cultivated European pear as part of the ACTG: American Campus Tree Genomes initiative where students assemble, annotate, and publish iconic tree genomes in semester courses. PacBio HiFi reads coupled with Dovetail Omni-C resulted in a high-quality assembly, displaying high k-mer completeness, quality scores, synteny with available assemblies, and recovery of universal single-copy orthologs. This assembly revealed thousands of structural variants between haplotypes which are of great importance to future pear breeding efforts as structural variants disrupt recombination. Comparative analyses between other members of the Rosaceae family demonstrated deeply conserved synteny and recovered evidence for a 45 million-year-old WGD event. Gene expression across several tissue types was largely conserved, but thousands of genes also constrained themselves to a single tissue. Further characterization of pear germplasm will accelerate breeding gains not only within pear but potentially across multiple Rosaceous crops. Lastly, we highlight the utility of generating such genomes as part of semester courses and the training opportunities that it provides.
Supplementary Material
Contributor Information
Alan Yocca, HudsonAlpha Institute for Biotechnology, Huntsville, AL 35806, USA.
Mary Akinyuwa, Department of Entomology and Plant Pathology, Auburn University, Auburn, AL 36849, USA.
Nick Bailey, Department of Biological Sciences, Auburn University, Auburn, AL 36849, USA.
Brannan Cliver, Department of Biological Sciences, Auburn University, Auburn, AL 36849, USA.
Harrison Estes, Department of Entomology and Plant Pathology, Auburn University, Auburn, AL 36849, USA.
Abigail Guillemette, Department of Biological Sciences, Auburn University, Auburn, AL 36849, USA.
Omar Hasannin, Department of Biological Sciences, Auburn University, Auburn, AL 36849, USA.
Jennifer Hutchison, Department of Biological Sciences, Auburn University, Auburn, AL 36849, USA.
Wren Jenkins, Department of Biological Sciences, Auburn University, Auburn, AL 36849, USA.
Ishveen Kaur, Department of Biological Sciences, Auburn University, Auburn, AL 36849, USA.
Risheek Rahul Khanna, Department of Biological Sciences, Auburn University, Auburn, AL 36849, USA.
Madelene Loftin, Department of Biological Sciences, Auburn University, Auburn, AL 36849, USA.
Lauren Lopes, Department of Biological Sciences, Auburn University, Auburn, AL 36849, USA.
Erika Moore-Pollard, Department of Biological Sciences, University of Memphis, Memphis, TN 38152-3530, USA.
Oluwakemisola Olofintila, Department of Biological Sciences, Auburn University, Auburn, AL 36849, USA.
Gideon Oluwaseye Oyebode, Department of Biological Sciences, Auburn University, Auburn, AL 36849, USA.
Jinesh Patel, Department of Biological Sciences, Auburn University, Auburn, AL 36849, USA.
Parbati Thapa, Department of Entomology and Plant Pathology, Auburn University, Auburn, AL 36849, USA.
Martin Waldinger, Department of Biological Sciences, Auburn University, Auburn, AL 36849, USA.
Jie Zhang, Department of Biological Sciences, Auburn University, Auburn, AL 36849, USA.
Qiong Zhang, Department of Biological Sciences, Auburn University, Auburn, AL 36849, USA.
Leslie Goertzen, Department of Biological Sciences, Auburn University, Auburn, AL 36849, USA.
Sarah B Carey, HudsonAlpha Institute for Biotechnology, Huntsville, AL 35806, USA.
Heidi Hargarten, Physiology and Pathology of Tree Fruits Research Laboratory, USDA ARS, Wenatchee, WA 98801, USA.
James Mattheis, Physiology and Pathology of Tree Fruits Research Laboratory, USDA ARS, Wenatchee, WA 98801, USA.
Huiting Zhang, Physiology and Pathology of Tree Fruits Research Laboratory, USDA ARS, Wenatchee, WA 98801, USA; Department of Horticulture, Washington State University, Pullman, WA 99164-6414, USA.
Teresa Jones, HudsonAlpha Institute for Biotechnology, Huntsville, AL 35806, USA; HudsonAlpha Genome Sequencing Center, HudsonAlpha Institute for Biotechnology, Huntsville, AL 35806, USA.
LoriBeth Boston, HudsonAlpha Institute for Biotechnology, Huntsville, AL 35806, USA; HudsonAlpha Genome Sequencing Center, HudsonAlpha Institute for Biotechnology, Huntsville, AL 35806, USA.
Jane Grimwood, HudsonAlpha Institute for Biotechnology, Huntsville, AL 35806, USA; HudsonAlpha Genome Sequencing Center, HudsonAlpha Institute for Biotechnology, Huntsville, AL 35806, USA.
Stephen Ficklin, Department of Horticulture, Washington State University, Pullman, WA 99164-6414, USA.
Loren Honaas, Physiology and Pathology of Tree Fruits Research Laboratory, USDA ARS, Wenatchee, WA 98801, USA.
Alex Harkess, HudsonAlpha Institute for Biotechnology, Huntsville, AL 35806, USA.
Data availability
Data used to generate this assembly are deposited in the NCBI SRA under BioProject PRJNA992953. Gene expression data are available separately under BioProject PRJNA791346. Custom scripts used throughout are available on GitHub (https://github.com/Aeyocca/dAnjou_genome_MS). Genome assembly and annotation files are available on GDR (https://www.rosaceae.org/Analysis/17650423) and on the NCBI SRA under accession numbers PRJNA1047602 and PRJNA1047603.
Supplemental material available at G3 online.
Funding
This work was supported through funding from National Science Foundation IOS-PGRP CAREER award #2239530 to A.H.; NSF IOS-EDGE award #2128196 to A.H.; Washington Tree Fruit Research Commission award AP-19-103 to L.H., S.F., and A.H.; the USDA ARS; and the Auburn University Department of Crop, Soil, and Environmental Sciences who supported student compute costs for the CSES7120 Plant Genomics course.
Literature cited
- Andrews S, Krueger F, Segonds-Pichon A, Biggins L, Krueger C, Wingett S. FastQC: a quality control tool for high throughput sequence data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc2010.
- Bell RL, Itai A. 2011. Pyrus. In: Kole C, editor. Wild Crop Relatives: Genomic and Breeding Resources. Berlin (Germany): Springer-Verlag. p. 147–177. [Google Scholar]
- Bray NL, Pimentel H, Melsted P, Pachter L. 2016. Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol. 34(5):525–527. doi: 10.1038/nbt.3519. [DOI] [PubMed] [Google Scholar]
- Chagné D, Crowhurst RN, Pindo M, Thrimawithana A, Deng C, Ireland H, Fiers M, Dzierzon H, Cestaro A, Fontana P, et al. 2014. The draft genome sequence of European pear (Pyrus communis L. ‘Bartlett’). PLoS One. 9(4):e92644. doi: 10.1371/journal.pone.0092644. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cheng H, Concepcion GT, Feng X, Zhang H, Li H. 2021. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods. 18(2):170–175. doi: 10.1038/s41592-020-01056-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cheng C-Y, Krishnakumar V, Chan AP, Thibaud-Nissen F, Schobel S, Town CD. 2017. Araport11: a complete reannotation of the Arabidopsis thaliana reference genome. Plant J. 89(4):789–804. doi: 10.1111/tpj.13415. [DOI] [PubMed] [Google Scholar]
- Dierckxsens N, Mardulyn P, Smits G. 2016. NOVOPlasty: de novo assembly of organelle genomes from whole genome data. Nucleic Acids Res. 45(4):e18. doi: 10.1093/nar/gkw955. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Doyle JJ, Doyle JL. 1987. A rapid DNA isolation procedure for small quantities of fresh leaf tissue. RESEARCH. https://worldveg.tind.io/record/33886/.
- Durand NC, Robinson JT, Shamim MS, Machol I, Mesirov JP, Lander ES, Aiden EL. 2016. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell Syst. 3(1):99–101. doi: 10.1016/j.cels.2015.07.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Edgar RC. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32(5):1792–1797. doi: 10.1093/nar/gkh340. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ellinghaus D, Kurtz S, Willhoeft U. 2008. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics. 9(1):18. doi: 10.1186/1471-2105-9-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ewels P, Magnusson M, Lundin S, Käller M. 2016. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 32(19):3047–3048. doi: 10.1093/bioinformatics/btw354. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gel B, Serra E. 2017. karyoploteR: an R/Bioconductor package to plot customizable genomes displaying arbitrary data. Bioinformatics. 33(19):3088–3090. doi: 10.1093/bioinformatics/btx346. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goeckeritz CZ, Rhoades KE, Childs KL, Iezzoni AF, VanBuren R, Hollender CA. 2023. Genome of tetraploid sour cherry (Prunus cerasus L.) ‘Montmorency’ identifies three distinct ancestral Prunus genomes. Hortic Res. 10(7):7. doi: 10.1093/hr/uhad097. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Greiner S, Lehwark P, Bock R. 2019. OrganellarGenomeDRAW (OGDRAW) version 1.3.1: expanded toolkit for the graphical visualization of organellar genomes. Nucleic Acids Res. 47(W1):W59–W64. doi: 10.1093/nar/gkz238. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hodel RGJ, Zimmer EA, Liu B-B, Wen J. 2021. Synthesis of nuclear and chloroplast data combined with network analyses supports the polyploid origin of the apple tribe and the hybrid origin of the Maleae-Gillenieae clade. Front Plant Sci. 12:820997. doi: 10.3389/fpls.2021.820997. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Holt C, Yandell M. 2011. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics. 12(1):491. doi: 10.1186/1471-2105-12-491. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang N, Li H. 2023. miniBUSCO: a faster and more accurate reimplementation of BUSCO. bioRxiv 543588. 10.1101/2023.06.03.543588, preprint: not peer reviewed. [DOI] [PMC free article] [PubMed]
- Khan A, Carey SB, Serrano A, Zhang H, Hargarten H, Hale H, Harkess A, Honaas L. 2022. A phased, chromosome-scale genome of ‘Honeycrisp’ apple (Malus domestica). GigaByte. 2022:gigabyte69. doi: 10.46471/gigabyte.69. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li J, Zhang M, Li X, Khan A, Kumar S, Allan AC, Lin-Wang K, Espley RV, Wang C, Wang R, et al. 2022. Pear genetics: recent advances, new prospects, and a roadmap for the future. Hortic Res. 9:uhab040. doi: 10.1093/hr/uhab040. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Linsmith G, Rombauts S, Montanari S, Deng CH, Celton J-M, Guérif P, Liu C, Lohaus R, Zurn JD, Cestaro A, et al. 2019. Pseudo-chromosome–length genome assembly of a double haploid ‘Bartlett’ pear (Pyrus communis L.). GigaScience. 8(12):giz138. doi: 10.1093/gigascience/giz138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lovell JT, Sreedasyam A, Eric Schranz M, Wilson M, Carlson JW, Harkess A, Emms D, Goodstein DM, Schmutz J. 2022. GENESPACE tracks regions of interest and gene copy number variation across multiple genomes. eLife. 11:e78526. doi: 10.7554/eLife.78526. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lu Z, Marand AP, Ricci WA, Ethridge CL, Zhang X, Schmitz RJ. 2019. The prevalence, evolution and chromatin signatures of plant regulatory elements. Nat Plants. 5(12):1250–1259. doi: 10.1038/s41477-019-0548-z. [DOI] [PubMed] [Google Scholar]
- Marçais G, Delcher AL, Phillippy AM, Coston R, Salzberg SL, Zimin A. 2018. MUMmer4: a fast and versatile genome alignment system. PLoS Comput Biol. 14(1):e1005944. doi: 10.1371/journal.pcbi.1005944. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marçais G, Kingsford C. 2011. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 27(6):764–770. doi: 10.1093/bioinformatics/btr011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nattestad M, Schatz MC. 2016. Assemblytics: a web analytics tool for the detection of variants from an assembly. Bioinformatics. 32(19):3021–3023. doi: 10.1093/bioinformatics/btw369. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Niu Y, Zhou W, Chen X, Fan G, Zhang S, Liao K. 2020. Genome size and chromosome ploidy identification in pear germplasm represented by Asian pears—local pear varieties. Sci Hortic. 265:109202. doi: 10.1016/j.scienta.2020.109202. [DOI] [Google Scholar]
- Ou S, Jiang N. 2018. LTR_retriever: a highly accurate and sensitive program for identification of long terminal repeat retrotransposons. Plant Physiol. 176(2):1410–1422. doi: 10.1104/pp.17.01310. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ou S, Jiang N. 2019. LTR_FINDER_parallel: parallelization of LTR_FINDER enabling rapid identification of long terminal repeat retrotransposons. Mob DNA. 10(1):48. doi: 10.1186/s13100-019-0193-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ou S, Su W, Liao Y, Chougule K, Agda JRA, Hellinga AJ, Lugo CSB, Elliott TA, Ware D, Peterson T, et al. 2019. Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline. Genome Biol. 20(1):275. doi: 10.1186/s13059-019-1905-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Quadrana L. 2020. The contribution of transposable elements to transcriptional novelty in plants: the affair. Transcription. 11(3–4):192–198. doi: 10.1080/21541264.2020.1803031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Quinlan AR, Hall IM. 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 26(6):841–842. doi: 10.1093/bioinformatics/btq033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ranallo-Benavidez TR, Jaron KS, Schatz MC. 2020. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nat Commun. 11(1):1432. doi: 10.1038/s41467-020-14998-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- R Core Team . 2022. R: A language and environment for statistical computing. Vienna (Austria): R Foundation for Statistical Computing. https://www.R-project.org/. [Google Scholar]
- Schultz D, Ebbert M, De Coster W. 2019. Pauvre. May 16, 2019. https://github.com/conchoecia/pauvre
- Senchina DS, Alvarez I, Cronn RC, Liu B, Rong J, Noyes RD, Paterson AH, Wing RA, Wilkins TA, Wendel JF. 2003. Rate variation among nuclear genes and the age of polyploidy in Gossypium. Mol Biol Evol. 20(4):633–643. doi: 10.1093/molbev/msg065. [DOI] [PubMed] [Google Scholar]
- Shi J, Liang C. 2019. Generic Repeat Finder: a high-sensitivity tool for genome-wide de novo repeat detection. Plant Physiol. 180(4):1803–1815. doi: 10.1104/pp.19.00386. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Su W, Gu X, Peterson T. 2019. TIR-learner, a new ensemble method for TIR transposable element annotation, provides evidence for abundant new transposable elements in the maize genome. Mol Plant. 12(3):447–460. doi: 10.1016/j.molp.2019.02.008. [DOI] [PubMed] [Google Scholar]
- Suyama M, Torrents D, Bork P. 2006. PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Res. 34(Web Server issue):W609–W612. doi: 10.1093/nar/gkl315. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tang H, Bowers JE, Wang X, Ming R, Alam M, Paterson AH. 2008. Synteny and collinearity in plant genomes. Science. 320(5875):486–488. doi: 10.1126/science.1153917. [DOI] [PubMed] [Google Scholar]
- Tillich M, Lehwark P, Pellizzer T, Ulbricht-Jones ES, Fischer A, Bock R, Greiner S. 2017. GeSeq—versatile and accurate annotation of organelle genomes. Nucleic Acids Res. 45(W1):W6–11. doi: 10.1093/nar/gkx391. [DOI] [PMC free article] [PubMed] [Google Scholar]
- United States Department of Agriculture National Agricultural Statistics Service . 2023. Noncitrus fruits and nuts 2022 summary, May. https://downloads.usda.library.cornell.edu/usda-esmis/files/zs25×846c/zk51wx21m/k356bk214/ncit0523.pdf.
- Wu J, Wang Z, Shi Z, Zhang S, Ming R, Zhu S, Khan MA, Tao S, Korban SS, Wang H, et al. 2013. The genome of the pear (Pyrus bretschneideri Rehd.). Genome Res. 23(2):396–408. doi: 10.1101/gr.144311.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu J, Wang Y, Xu J, Korban SS, Fei Z, Tao S, Ming R, Tai S, Khan AM, Postman JD, et al. 2018. Diversification and independent domestication of Asian and European pears. Genome Biol. 19(1):77. doi: 10.1186/s13059-018-1452-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xiang Y, Huang C-H, Hu Y, Wen J, Li S, Yi T, Chen H, Xiang J, Ma H. 2017. Evolution of Rosaceae fruit types based on nuclear phylogeny in the context of geological times and genome duplication. Mol Biol Evol. 34(2):262–281. doi: 10.1093/molbev/msw242. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xiong W, He L, Lai J, Dooner HK, Du C. 2014. HelitronScanner uncovers a large overlooked cache of Helitron transposons in many plant genomes. Proc Natl Acad Sci U S A. 111(28):10263–10268. doi: 10.1073/pnas.1410068111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xu Z, Wang H. 2007. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res. 35(Web Server issue):W265–W268. doi: 10.1093/nar/gkm286. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang Z. 1997. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci. 13(5):555–556. doi: 10.1093/bioinformatics/13.5.555. [DOI] [PubMed] [Google Scholar]
- Yang Z, Nielsen R. 2000. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol Biol Evol. 17(1):32–43. doi: 10.1093/oxfordjournals.molbev.a026236. [DOI] [PubMed] [Google Scholar]
- Zhang H, Wafula EK, Eilers J, Harkess AE, Ralph PE, Timilsena PR, dePamphilis CW, Waite JM, Honaas LA. 2022. Building a foundation for gene family analysis in Rosaceae genomes with a novel workflow: a case study in Pyrus architecture genes. Front Plant Sci. 13:975942. doi: 10.3389/fpls.2022.975942. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zheng X, Cai D, Potter D, Postman J, Liu J, Teng Y. 2014. Phylogeny and evolutionary histories of Pyrus L. Revealed by phylogenetic trees and networks based on data from multiple DNA sequences. Mol Phylogenet Evol. 80:54–65. doi: 10.1016/j.ympev.2014.07.009. [DOI] [PubMed] [Google Scholar]
- Zhou C, McCarthy SA, Durbin R. 2023. YaHS: yet another Hi-C scaffolding tool. Bioinformatics. 39(1):btac808. doi: 10.1093/bioinformatics/btac808. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Data used to generate this assembly are deposited in the NCBI SRA under BioProject PRJNA992953. Gene expression data are available separately under BioProject PRJNA791346. Custom scripts used throughout are available on GitHub (https://github.com/Aeyocca/dAnjou_genome_MS). Genome assembly and annotation files are available on GDR (https://www.rosaceae.org/Analysis/17650423) and on the NCBI SRA under accession numbers PRJNA1047602 and PRJNA1047603.
Supplemental material available at G3 online.




