Large structural variations in the haplotype‐resolved African cassava genome

Ben N Mansfeld; Adam Boyher; Jeffrey C Berry; Mark Wilson; Shujun Ou; Seth Polydore; Todd P Michael; Noah Fahlgren; Rebecca S Bart

doi:10.1111/tpj.15543

. 2021 Nov 10;108(6):1830–1848. doi: 10.1111/tpj.15543

Large structural variations in the haplotype‐resolved African cassava genome

Ben N Mansfeld ¹, Adam Boyher ¹, Jeffrey C Berry ¹, Mark Wilson ¹, Shujun Ou ², Seth Polydore ¹, Todd P Michael ³, Noah Fahlgren ¹, Rebecca S Bart ^1,^✉

PMCID: PMC9299708 PMID: 34661327

Summary

Cassava (Manihot esculenta Crantz, 2n = 36) is a global food security crop. It has a highly heterozygous genome, high genetic load, and genotype‐dependent asynchronous flowering. It is typically propagated by stem cuttings and any genetic variation between haplotypes, including large structural variations, is preserved by such clonal propagation. Traditional genome assembly approaches generate a collapsed haplotype representation of the genome. In highly heterozygous plants, this results in artifacts and an oversimplification of heterozygous regions. We used a combination of Pacific Biosciences (PacBio), Illumina, and Hi‐C to resolve each haplotype of the genome of a farmer‐preferred cassava line, TME7 (Oko‐iyawo). PacBio reads were assembled using the FALCON suite. Phase switch errors were corrected using FALCON‐Phase and Hi‐C read data. The ultralong‐range information from Hi‐C sequencing was also used for scaffolding. Comparison of the two phases revealed >5000 large haplotype‐specific structural variants affecting over 8 Mb, including insertions and deletions spanning thousands of base pairs. The potential of these variants to affect allele‐specific expression was further explored. RNA‐sequencing data from 11 different tissue types were mapped against the scaffolded haploid assembly and gene expression data are incorporated into our existing easy‐to‐use web‐based interface to facilitate use by the broader plant science community. These two assemblies provide an excellent means to study the effects of heterozygosity, haplotype‐specific structural variation, gene hemizygosity, and allele‐specific gene expression contributing to important agricultural traits and further our understanding of the genetics and domestication of cassava.

Keywords: cassava, genome assembly, high heterozygosity, haplotype phasing, structural variants

Significance Statement

The cassava varieties grown by subsistence farmers in Africa largely differ from the partially inbred reference genome due to their highly heterozygous nature. We used multiple sequencing technologies to assemble and resolve both haplotypes in TME7, a farmer‐preferred cassava line, enabling us to study the considerable haplotypic structural variation in this line.

INTRODUCTION

Cassava (Manihot esculenta Crantz 2n = 2x = 36) is a globally important crop and is particularly critical for subsistence farmers in the developing world (Ceballos et al., 2004). As an outcrossing plant, cassava is considerably heterozygous with a high genetic load and thus suffers from inbreeding depression (Rojas et al., 2009). This has hindered genetic improvement via breeding in cassava, and many agriculturally favorable lines are commonly clonally propagated, which maintains any heterozygosity in the germplasm (Aye, 2012; Ramu et al., 2017). Moreover, the heterozygous nature of the cassava genome and limitations in sequencing technologies have hindered the ability to sequence and assemble the genome accurately (Chin et al., 2016). Because of this, a partially inbred cassava accession, AM560‐2, was selected as the cassava reference genome (Prochnik et al., 2012). AM560‐2 is the product of three generations of selfing of the Colombian cassava line MCol1505, and is 94% homozygous (Bredeson et al., 2016). The reference genome has been an asset to the cassava community for more than 10 years, but due to the homozygous nature of the genome it does not accurately represent lines grown in farmer’s fields.

The development of long‐read and long‐range sequencing technologies and recent advancement in assembly algorithms have strong implications for genome assembly of heterozygous plant and animal species. Such haplotype‐resolved genome assemblies can be crucial to our comprehension of genetics in crops with strong inbreeding depression where generation of inbred lines is very difficult and not representative of the agriculturally grown plants. However, even with these advances, assembling fully haplotype‐phased genomes is difficult, particularly when rates of heterozygosity are high (Michael and VanBuren, 2020). New genome assembly strategies now exist for separate assembly of homologous and homeologous chromosome copies, allowing for accurate phasing of haplotypes and polyploid genomes (Chin et al., 2016; Koren et al., 2018; Kronenberg et al., 2018). One such strategy uses sequence data from parental lines to discern the haplotype specificity of offspring sequence reads before their assembly (Koren et al., 2018). However, this strategy requires access to the parental genotypes, which are unknown in many clonally propagated farmer‐preferred cassava lines. Another novel approach utilizes single cell sequencing of gamete cells to gain insight into phasing information and haplotype assembly (Campoy et al., 2020). This “Gamete binning” approach was showcased in the heterozygous tree crop apricot (Prunus armeniaca), and while potentially a viable option for field grown cassava lines, it requires extraction of pollen nuclei and other technical skills that are potentially limiting factors to its immediate adoption (Campoy et al., 2020). An alternate computational approach, implemented in the FALCON‐Phase algorithm, uses mapping information from long‐range chromatin conformation capture (Hi‐C) sequencing to phase haplotype assembled sequences correctly (Kronenberg et al., 2018). This de novo approach can be used to correct assembly phase switch errors, and accurately represent the chromosome from telomere to telomere (Kronenberg et al., 2018).

Recent attempts at assembling heterozygous farmer‐preferred cassava lines have produced contiguous large assemblies (Kuon et al., 2019). These assemblies, however, are limited due to the lack of haplotypic separation; the primary assemblies include both haplotypes and thus contain many duplicated sequences (Kuon et al., 2019; Lyons et al., 2021). This has implications on the assembly size and scaffolding, which can be severely impacted by these duplications (Guan et al., 2020). Sequence duplication can also cause problems for downstream analyses such as read mapping and gene annotation. Assessing the deduplication, completeness, and quality of heterozygous genomes thus plays a critical role in each assembly step, to ensure truly resolved haplotypic sequences (Rhie et al., 2020).

Here, we assemble a phased diploid assembly of the Nigerian cassava landrace (Tropical‐ Manihot‐ esculenta) TME7, also known as "Oko‐iyawo," a farmer‐preferred line resistant to the cassava mosaic disease virus (Rabbi et al., 2014). By assembling and phasing the moderately sized (approximately 700 Mb) diploid cassava genome we have a unique opportunity to study haplotype‐specific structural polymorphisms maintained for generations by clonal propagation. Elucidation of haplotype‐specific structural variations (SVs) in cassava will have direct implications for our understanding of these types of variations in other clonally propagated, heterozygous crops with larger genomes, including many tree fruit crops and other horticulturally important species. The two haplotype assemblies will also provide an excellent means to study the haplotype‐specific SVs, synteny, and allele‐specific gene expression that contribute to important agricultural traits, furthering our understanding of the genetics and domestication of cassava. As breeding is difficult in a crop such as cassava, a better understanding of the haplotype‐specific genetics will allow for more accurate, appropriate, and targeted gene editing to improve lines for agricultural purposes.

RESULTS AND DISCUSSION

Genome size and heterozygosity

Owing to the significant differences between TME7 (a clonally propagated, heterozygous, farmer‐preferred line grown in Africa) and AM560‐2 (an inbred South American line) we opted to re‐estimate the genome size of TME7 before assembly. Both flow cytometry and a k‐mer based approach (GenomeScope; Vurture et al., 2017), estimated the genome size to be within the range of 670–711 Mb (Figure 1). We settled on approximately 700 Mb as a target haploid size for this assembly. This estimate is moderately lower than that estimated for the reference genome line AM560‐2 (approximately 750 Mb, Bredeson et al., 2016 ). Based on the k‐mer analysis, the repeat content was estimated at roughly 61% of the estimated genome size and the two very distinct k‐mer frequency peaks suggested a high level of heterozygosity (Figure 1b; Figure S1). The GenomeScope model further estimated the heterozygosity of this cassava line to be approximately 1.4%, or roughly one polymorphism every approximately 70 bp (Figure 1b). This is slightly lower than other outcrossing clonally propagated crops such as pear (1.6%; Vurture et al., 2017), grape (1.6–1.7%; Guan et al., 2020; Patel et al., 2018), as well as the closely related rubber tree (1.6%; Liu et al., 2020). Nonetheless, this level of estimated heterozygosity suggested that haplotype‐resolved assembly approaches would be appropriate for assembly of the cassava genome.

Estimates of TME7 genome parameters using flow cytometry and short reads.

(a) Three biological samples of TME7, each with four technical replicates, were analyzed using flow cytometry. A mean genome size was estimated at 690 Mb.

(b) Estimation of genome size, heterozygosity, and repetitiveness using GenomeScope Profile. K‐mer size was set to 21, and k‐mer coverage cutoff was set at 1e6 to include repeat regions in genome size estimates. The haploid genome size was estimated to be 704 Mb consisting of 61% repetitive sequence and a heterozygosity of 1.41%. dup, mean read duplication rate; err, percentage sequencing error rate; het, percentage estimated heterozygosity; kcov, average k‐mer coverage for heterozygous bases; len, haploid length; uniq, percentage non‐repetitive sequence.

Maximizing the diploid assembly

With that goal in mind, we sequenced the TME7 cassava genome using long‐read PacBio single‐molecule real‐time sequencing cells yielding roughly 90× coverage. We generated 64.2 Gb of data in 8 018 064 raw PacBio subreads (Figure S2) that had an N50 of 11 099 bp; 4 970 318 of the reads were longer than 5000 bp, which was used as a seed read size. After multiple assembly attempts with different software (i.e., Canu) and other parameters (Table S1), we generated a PacBio‐only assembly with FALCON and FALCON‐Unzip (Chin et al., 2016), which yielded the best compromise between longest and most contiguous results. FALCON‐Unzip assembled a total of 874 Mb in primary contigs, as well as an additional 157 Mb in haplotigs. FALCON‐Unzip is limited in its ability to identify sequences with greater than 4–5% variation as haplotypic sequences, and these are often retained as primary contigs (Chin et al., 2016, also e.g., Padgitt‐Cobb et al. (2021)). The total sequence assembled was approximately 1 Gb, and while not yet well partitioned into haplotypes, included about 300 Mb in potentially haplotypic sequences. This represented the potential for an approximately 50% “unzipped” genome assembly. Assembly statistics for each stage of assembly and phasing are reported in Table 1.

Table 1.

Assembly contiguity, completeness, and quality assessment

	Falcon		Falcon‐Unzip		Pilon		Add SRC	purge_dups		FALCON‐Phase Unzip		FALCON‐Phase Pseudohaplotype		Scaffolded
	Primary	Alternate	Primary	Alternate	Primary	Alternate	Diploid	Primary	Alternate	Primary	Alternate	Primary	Alternate	Primary
No. contigs	6910	917	5114	5254	5114	5254	75 291	9925	17 415	9925	12 805	9925	9925	4936
Total length (Mb)	901	50.3	874	157	875	157	115	677	341	702	313	720	720	720
N50 (kb)	253.3	76.1	263.3.	51.6	263.4	51.6	192.4	283.1	83.0	305.2	80.3	318.9	322.7	31.2 Mb
Completeness	83.3	10.2	86.1	28.6	87.5	29.0	‐	81.2	56.5	82.0	54.2	82.8	82.8	82.8
Both	84.6		90.05		91.5		96.2	93.9		93.65		93.66
QV	27.5	28.5	29.7	29.5	33.3	32.7	‐	34.3	33.5	34.1	34.4	34.0	34.0	34.0
Both	27.56		29.68		33.18		33.66	34.03		34.2		34.0

Open in a new tab

QV, phred scaled quality score; SRC, short‐read contigs.

To estimate the success of haplotypic separation and assembly quality we performed k‐mer based analyses using Merqury (Rhie et al., 2020). Using raw, highly accurate short read sequencing representing data from both haploid sequences, k‐mers, which exist in one or two copies arise from heterozygous and homozygous regions, respectively. The k‐mer distributions are then represented by the number of times each k‐mer appears in the assembly allowing for the comparison of observed and expected coverage, estimation of reference‐free completeness, and overall phasing success.

We first observed that even after polishing INDELs with pilon (Walker et al., 2014), a peak of heterozygous (1‐copy) k‐mers are missing from either the primary or alternate assemblies (Figure S3). As our goal was to assemble a full heterozygous diploid phased assembly, we sought to maximize the amount of haplotypic sequence assembled. To this end, we supplemented the long‐read assembly with short‐read contigs (SRC) containing additional heterozygous sequence. We identified k‐mers that contained the short reads pertaining to the missing heterozygous sequence and assembled them using SPAdes (Bankevich et al., 2012). Some of these extra SRC contained duplicates of already assembled sequences, but importantly many included the missing heterozygous sequence. Adding these SRCs to the full assembly brought the total assembled sequence to 1.15 Gb, or nearly the anticipated diploid size of approximately 1.4 Gb. The number of missing k‐mers was brought down from 23.7 to 9.9 m using this approach and the “Completeness” score was brought up to 96.2% when including the SRCs (Table 1). Based on the missing k‐mers, after adding the SRC an estimated 9.8 Mb of missing heterozygous sequence remained un‐assembled.

Haplotypic purging and deduplication

To complement the graph‐based assembly approach used in FALCON‐Unzip (Chin et al., 2016), other orthogonal tools have since been developed to extract haplotypic sequences from primary assemblies (e.g., Guan et al., 2020; Huang et al., 2017; Roach et al., 2018). These typically use read mapping coverage and sequence homology to identify potential haplotigs and “purge” them from the primary assembly (Roach et al., 2018). After maximizing our diploid assembly size to include as much haplotypic sequence as possible, our goal was to purge the primary assembly of haplotypic contigs, overlaps and sequence duplication, including those from our SRCs (Figure 2; Figure S3). To this end we used purge_dups (Guan et al., 2020), which improves on the previous state‐of‐the‐art, purge_haplotigs (Roach et al., 2018), by identifying and purging haplotypic overlaps. The final set of primary contigs included approximately 677 Mb assembled in 9925 contigs with an N50 of 283.1 kb. The resulting alternate assembly contained over 341 Mb assembled in haplotigs, representing an approximately 50% “unzipped” genome.

K‐mer copy number and assembly analyses for the final phased TME7 assemblies.

(a) K‐mer count spectra for the alternate (haplotigs) and primary assemblies after phasing.

(b) Diploid (primary + haplotigs) k‐mer count spectra. In both (a) and (b), short read k‐mer distribution plots are colored by the number of times a k‐mer is present in the assembly. K‐mers denoted in gray are missing from the assembly and represent probable short‐read sequencing errors (k‐mer multiplicity <50) or missing assembled sequence (≥50).

(c) Assembly spectra of the diploid assembly suggest that most homozygous k‐mers (approximately 200× peak) are shared between the assemblies, while most of the heterozygous (approximately 100× peak) k‐mers are phase specific.

K‐mer spectra plots showed that the amount of sequence duplication was drastically reduced after purging, and that most of the heterozygous (1‐copy) k‐mers were now successfully separated into the two assemblies (Figure 2; Figure S3). This was further confirmed by alignment of markers from the cassava 20k linkage map (ICGMC, 2015) (Figure 3b) and deduplication of BUSCO genes (Figure 4). After purging, the haplotig N50 size, which corresponds to the haplotype phase block, was 83 kb (90.5 kb if excluding SRC‐derived haplotigs). This is substantively smaller than the 7‐Mb block described in the Arabidopsis F₁ assembly by Chin et al. (2016), but is more similar to that observed in the Carménère grape (89.5 kb, Minio et al., 2019). Furthermore, it is consistent with relatively short dispersed regions of heterozygosity, and with the high rate of linkage disequilibrium decay described in cassava, an obligate out crosser (Ramu et al., 2017).

Validation of Hi‐C scaffolding order and orientation by contact map and linkage map alignment.

(a) Post‐scaffolding Hi‐C contact heatmap of the 18 largest scaffolds in the Phase0 assembly of TME7 showing the density of Hi‐C interactions between regions of the genome. Color represents the intensity of interactions between regions, reported in log(1 + x).

(b) Strong collinearity between the 22K marker Cassava Linkage Map and the TME7 Phase0 assembly. Markers are colored by their originating linkage group in the map.

Summary of BUSCO analyses and phase‐specific BUSCOs.

The BUSCO scores for each step are reported for the (a) alternate or primary assemblies or (b) full (concatenated) assemblies. (b) After polishing with Pilon, short‐read contigs (SRC) were assembled and added to the full assembly, before haplotypic purging.

(c) Overlap in BUSCO categories of the final FALCON‐Phase Unzip‐emit primary (Phase0) and alternate (Phase1) assemblies shows that most BUSCOs are phased and exist in both assemblies.

Haplotype phasing and scaffolding with Hi‐C sequencing

To get a more accurate representation of the TME7 pseudo‐haplotypes, we phased the primary and haplotig assemblies using Hi‐C data and FALCON‐Phase (Kronenberg et al., 2018). We noticed, however, that during the placement and mincing stages of the algorithm, FALCON‐Phase was discarding over 40 Mb of sequence from both primary and haplotig assemblies. We compared the haplotig truncation lengths with the contig vs. haplotig alignment lengths and identified that the FALCON‐Phase coords2hp.py script truncated both contigs and haplotigs at the ends of alignments. We hypothesized that if large SVs exist between haplotypes, this could affect how FALCON‐Phase aligns and places haplotigs vs. their primary contigs. Owing to these large SVs between the haplotypes, haplotig sequences were truncated to exclude the non‐aligning sequences. Merqury analysis showed that removal of these sequences reduced the number of heterozygous k‐mers in the assembly (Figure S4). We thus modified the coords2hp.py script in FALCON‐Phase to force it to include the entire length of each haplotig, rather than only the length of the sequences that aligned.

The result was one complete set of 9925 contigs comprising approximately 720 Mb for each phase, which included almost all the original heterozygosity assembled. This suggests we were able to assemble almost the entirety of the TME7 genome successfully (720 Mb haploid assembly vs. approximately 700 Mb estimated genome size) with a contig N50 of approximately 320 kb for both assemblies. When emitted in the “unzip” format, the total primary and haplotig contig length was 702 Mb (N50 = 305 kb) and 311 Mb (N50 = 80 kb), respectively. We assessed the success of the phasing step using Merqury (Figure 2). A modest increase in homozygous sequence duplication was observed after phasing, probably due to incorporation of homozygous contig boundaries into the primary assembly (Figure 2a). This minor sequence duplication in the “pseudohaplotype” assembly was also observed with the unmodified version of the coords2hp.py script suggesting it may be an inherent issue with the FALCON‐Phase algorithm (Figure S4). While this additional minor duplication is a limitation with this phase correction approach, the benefits of accurate phasing outweigh this issue.

After phasing, the Hi‐C data were further used to scaffold the assembly into 18 chromosome length scaffolds. Contigs designated as part of Phase0 were scaffolded using the Proximo algorithm (Phase Genomics) and manual scaffolding curation with Juicebox (Durand et al., 2016; Rao et al., 2014). This process resulted in placing approximately 80% of all sequences in a set of 18 chromosome‐scale scaffolds containing 580 Mb of sequence (Figure 3a). We validated the scaffolding order and orientation by aligning 22 403 single nucleotide polymorphism (SNP) markers from the cassava composite map (ICGMC, 2015) to both phases. After filtering for >95% identity and >150 bp length, >19 000 markers aligned uniquely to both phases. We plotted the concordance between the new de novo assembly and the linkage map and observed high collinearity between the two (Figure 3b). Except for a few cases, there was high agreement between the physical and linkage maps (average Spearman’s correlation of 0.96). Approximately 1900 marker sequence tags had duplicate mapping sites on the same scaffold in both phases and were distributed along the chromosome scaffolds (Figure S5). While this may suggest potential sequence duplication or retained heterozygosity, an alternate explanation for some of these duplications are genotype‐specific duplications in TME7 that differ from the inbred reference genome. This represents a significant improvement over the previous attempts at assembly of heterozygous African cassava lines (Kuon et al., 2019), where close to 30% of markers had multiple map hits, indicating a not well deduplicated assembly.

Assessing the quality of the final assembly

When compared with the raw diploid short read data, the final assemblies showed approximately 94% completeness and a phred scaled quality score (QV) of >33 (or >99.9995% accurate) (Table 1). More short‐read polishing could be performed to increase accuracy; however, this might come at a cost of falsely correcting heterozygosity. While some heterozygous sequences are still missing from the assembly, the majority of 1‐copy k‐mers are uniquely assigned to one of the haploid assemblies and not shared between them (Figure 2c). These results show that we have accurately produced one full haplotype assembly of TME7 and a second alternate assembly that contains most of the haplotypic variation in this genotype.

We used BUSCO (Simão et al., 2015) analysis to verify that we successfully resolved the TME7 haplotypes (Figure 4). The primary (Phase0‐scaffolded) assembly had a complete BUSCO score of 96.9%, marginally outperforming the AM560‐2 v6.1 assembly (complete: 95.1%; duplicated: 5.1%) (Bredeson et al., 2016). The majority of complete single BUSCOs (969) are assembled in both phases, yet another 374 are missing from the alternate assembly (Figure 4c). This could be because these BUSCOs are homozygous and thus assembled in the collapsed regions of assembly, and/or due to the missing heterozygosity. Importantly, our deduplicated, TME7 Phase0 assembly only contains 7.9% duplicated BUSCOs, which is comparable with that of AM560‐2 and represents a significant improvement compared with approximately 15% and approximately 19% of the non‐haplotype‐purged assemblies described in Kuon et al. (2019). Interestingly, we identified haplotype‐specific complete BUSCOs (Figure 4c), and together the full diploid assembly (Phase0 scaffolds + Phase1 pseudohaplotype contigs) has a complete BUSCO score of greater than that of each phase separately (complete: 98.2%; 80.7% duplicated). This indicates that some BUSCOs may exist in a hemizygous state in the TME7 genome, and that complementation between the phases preserves the existence of these potentially crucial single copy genes.

Transposable elements and gene annotation

Transposable element and repeat annotation

Assembling the repetitive portion of the cassava genome is challenging as it is predicted to contain about 60% repetitive sequence (Figure 1b; Figure S1). We used the long terminal repeat (LTR) Assembly Index (LAI) to assess the quality and contiguity of the repetitive sequence assembly (Ou et al., 2018). Overall, both haploid assemblies display reference‐quality contiguity in the repetitive portions of the genome, with LAI values of 10.53 and 11.17 for the Phase0 and Phase1 assembly, respectively (Figure S6a). Furthermore, we found that the contiguity of the repetitive space in the assembly was much improved compared with the unplaced scaffolds (Figure S6b). We annotated both structurally intact and fragmented transposable elements (TEs) in the full diploid assembly using EDTA (Ou et al., 2019). As expected, 59% of the TME7 genome are repeats and TEs, which are dominated by LTR retrotransposons that contribute about 50.5% of the genome (Table 2; Figure S7). Terminal inverted repeat and Helitron DNA transposons contributed 2.43% to the total genome size. There were only marginal differences in TE content between the phases.

Table 2.

Summary of TEs in the TME7 genome assembly

Category	Phase0, %	Phase1, %	Average, %
LTR/Copia	6.24	6.25	6.25
LTR/Gypsy	35.36	36.22	35.79
LTR/unknown	8.49	8.46	8.48
TIR/CACTA	0.64	0.63	0.64
TIR/Mutator	0.90	0.88	0.89
TIR/PIF_Harbinger	0.13	0.16	0.15
TIR/Tc1_Mariner	0.01	0.01	0.01
TIR/hAT	0.77	0.67	0.72
LINE/unknown	0.44	0.44	0.44
DNA/Helitron	0.02	0.03	0.03
Repeat/unknown	6.05	5.45	5.75
Total LTR	50.09	50.93	50.51
Total DNA TE	2.47	2.38	2.43
Total TE	59.06	59.19	59.13

Open in a new tab

LTR, long terminal repeat; TE, transposable element; TIR, terminal inverted repeat.

Gene annotation, synteny, and tandem duplication

Gene annotation was performed using the MAKER, AUGUSTUS, and SNAP pipelines, including transcript evidence from RNA‐sequencing (RNA‐seq) from 11 tissue types (Wilson et al., 2017). We annotated 33 653 and 35 684 genes in Phase0 and Phase1 assemblies, respectively (Figure 5). Over 70% of annotated genes had an Annotation Edit Distance of <0.25 suggesting most genes were supported by high evidence levels (Figure S8a). Comparison of our annotations with that of the AM560‐2 ref v6.1 showed that gene synteny between the two cassava genomes was largely conserved; however, several macro‐level rearrangements are identifiable (Figure 6). Furthermore, this comparison revealed a largely 2:2 pattern of syntenic depth between the annotations (Figure S9), consistent with the whole genome duplication described in cassava (Bredeson et al., 2016). About 36% of cassava genes exist in one syntenic block reciprocally in either genome, suggesting that these genes may have lost their extra copy since the paleo‐duplication. Based on our analysis, it thus appears that the percentage of genes, which have retained their duplicate status is closer to 60%, rather than approximately 36% as previously reported (Bredeson et al., 2016). The previous analysis used homologous genes identified in Jatropha curcas as the reference; this likely limited the total numbers of homologs in the analysis, leading to the underestimate of retained duplicated genes. Only 2% of AM560‐2 genes were not shared in syntenic blocks in TME7 suggesting they may be unannotated, lost, or translocated out of their block.

Comparison of the TME7 haplotype phased assemblies.

(a) Dot plot of the best sequence alignments of the two haplotype assemblies. Color represents the alignment percentage identity between the alternate assembly (Phase1) contig (haplotig) and the primary assembly (Phase0).

(b) Chromosomal distribution of annotated genes, transposable elements (TE) and large haplotypic structural variants (SVs) between the two phased assemblies. SVs were identified by sequence alignment of the two phases. P0, Phase0 assembly.

Macro‐synteny between of the TME7 and the AM560‐2 Ref6.1 genome.

Gene synteny dot‐plot comparison between the scaffolded TME7‐Phase0 assembly and the AM560‐2 reference genome shows largely co‐linear genomes with multiple inter‐chromosomal duplications attributable to the paleotetraploidy described in cassava in Bredeson et al. (2016).

We examined and compared the number of tandem duplications (arrays) between the two TME7 phase annotations. We identified 1256 arrays in Phase0 containing 2,865 genes and 1,159 arrays containing 2,608 genes in Phase1. While most arrays were of two tandem genes (approximately 1000 in both phases), array sizes varied to 11 and 8 genes in an array in Phase0 and Phase1, respectively (Figure S8b). As expected, genes in tandem arrays were associated with Gene Ontology (GO) terms of important defense‐related biological processes, for example: “oxidation‐reduction process,” “response to other organism,” and “phenylpropanoid biosynthetic process” (Table S2). In synteny analyses between the annotations of the two phases, we identified that approximately 20% of genes in Phase0 were no longer in syntenic blocks in Phase1 (Figure S9). This could be due to specific gene deletion, or alternatively, may reflect the contig format of Phase1, which disrupts the contiguity of the synteny blocks. This, together with gene duplication arising from the evident whole genome duplication limits the ability to identify accurately which genes are haplotype specific and missing from one annotation; however, we have supplied the MCScanX (Wang et al., 2012) gene anchor file as Supporting Information to help users of the genome identify probable best hits between genes of the two phases (File S1). Further exploration of gene hemizygosity in cassava will be interesting from a genome evolution standpoint and have important implications for breeders.

Haplotype‐specific sequence and SV

Comparison with the inbred AM560‐2 reference

The differences in origin, genome size, and levels of heterozygosity between TME7 and the reference line AM560‐2, prompted us to compare the assemblies. Comparison of the TME7 Phase0 assembly to the AM560‐2 ref v6.1 assembly revealed 2 257 216 SNPs and 1 666 639 bases affected by small INDELs (<50 bp) that differed (Figure S10). We further identified over 10 000 large SVs (50–10 000 bp) affecting more than 15.99 Mb of sequence (Figure 7a; File S2). There is increasing evidence pointing to the importance of large genomic SVs, and their contribution to phenotypic traits (Alonge et al., 2020; Zhou et al., 2019). We examined the potential effects of the large INDELs (>50 bp) on gene function by measuring the distance to the nearest genes (Figure 7b). Of 4,354 large INDELs, 1,217 were predicted to be within gene models and another 882 within 2,000 bp upstream of genes, potentially affecting cis‐regulatory elements.

Large structural variants identified in TME7 vs. the AM560‐2 reference genome.

(a) Size distribution histograms of structural variants identified by comparison of the Phase0 assembly to the AM560‐2 reference genome.

(b) Density of distances (<10 kb away) of large deletions (50–10 000 bp) in TME7 from genes annotated in the AM560‐2 reference.

(c,d) Structural variants interrogated by paired‐end reads. Reads with abnormally large insert sizes (color‐coded horizontal bars, left y‐axis) corroborate deletions identified by alignment of the assemblies. The depth of coverage (gray‐filled background, right y‐axis) aid in determining the zygosity of the deletions. Gene models from the AM560‐2 v6.1 annotation are in blue. (c) TME7 Phase0 assembly contains a homozygous 4.11 kb deletion compared with Chromosome14 of the AM560‐2 Reference genome which overlaps the 3′‐end of *Manes.14G160800*. (d) A 7.21‐kb heterozygous deletion is verified on Chromosome3, potentially overlapping with the upstream regulatory region of *Manes.03G086200*. Other smaller sequence duplications are also observable (marked in red in c and d). The 7.21 kb heterozygous deletion in TME7 is correctly phased and assembled as an insertion in haplotig 001856F_006.

(e) Deletion between 64.9 kb and 72.1 kb on the haplotig, is delineated between the two dashed horizontal lines.

To validate and assess visually the heterozygosity state of several of the largest deletions (>4 kb in length), we aligned short reads from TME7 to the AM560‐2 genome. Both homozygous and heterozygous deletions were identified, and an example of each is in Figure 7c and d, respectively. A homozygous deletion identified on Chromosome14, where paired‐end reads map to either side of the 4.11‐kb deletion and a sharp decline in read coverage is observed, overlaps with the 3′‐end of RNA CLEAVAGE STIMULATION FACTOR (Manes.14G160800) (Figure 7c). A heterozygous deletion on Chromosome03, that has read coverage approximately half that of the surrounding area, overlaps the potential promoter region of Manes.03G086200, annotated to encode Ribosomal protein L6 (Figure 7d). This further supports the importance of assembling both haplotypes and suggests that many large haplotypic SVs might be present with potential impact on gene expression or function.

Large haplotypic SVs in TME7

Recently shown in grape (Zhou et al., 2019) and tomato (Alonge et al., 2020), large genomic SVs may have substantive effects on important agricultural traits. For example, the white berries of Chardonnay grape could be a result of a large inversion and deletion, causing hemizygosity at the MybA locus (Zhou et al., 2019). To examine further the within‐genome, haplotypic variation in TME7 we aligned the alternate assembly to the primary assembly. FALCON‐Phase has two options for emitting phased assemblies. In “unzip” style, short haplotigs containing alternate sequences are emitted alongside the phased primary contigs (as in FALCON‐Unzip). In contrast, in “pseudohap” mode, pseudo‐haplotype contigs are generated by collapsing alternate sequences from the phased haplotigs with homozygous sequence from primary assembly. Thus, the pseudo‐haplotype alternate assembly might contain artificially homozygous sequences that were missing from the original alternate assembly, originating from lack of assembly or true hemizygosity in the alternate assembly. We therefore used the “unzip”‐emit‐style haplotigs for comparison with the primary assembly and calculated the mean haplotype divergence to be 2.09 ± 0.18%. We further identified 1 116 832 SNPs and 300 883 small INDELs (<50 bp) in non‐repetitive regions, collectively representing more than 2.14 Mb of heterozygous sequence between the two assemblies (Figure 5a). This confirms the high rate of heterozygosity predicted using k‐mer based approaches and suggests a well extracted set of haplotigs.

To compare directly the two independently assembled TME7 haplotypes, we aligned the Phase1 contigs to the scaffolded Phase0 assembly and identified large SVs. Overall, we identified more than 5000 variants 50–10 000 bp in size including large insertions, deletions, tandem duplications, and contractions as well as repeat expansions and contractions (Table 3, Figure 5b; File S3). The total sequence space that was affected by these SVs was greater than 8 Mb. Thus, this within‐genotype, haplotypic SV amounts to greater than half of the between‐genotype differences that TME7 has with the AM560‐2 reference line. The Assemblytics pipeline can also identify variants >10 kb; however, the accuracy with which these are distinguished from translocations or assembly errors is limited (Nattestad and Schatz, 2016). Though we primarily focused on a more conservative approach to identify large SVs, potentially larger haplotypic SVs were identified using Assemblytics; including SVs up to 50 kb in size in the analysis, yielded close to 16 Mb of sequence affected by SV (Figure S11). While these larger SVs should be considered with caution, we note that this is comparable with structural heterozygosity reported in other species such as wine‐grape (Minio et al., 2019).

Table 3.

Summary of haplotype‐specific structural variants

	50–500 bp Count	50–500 bp Total bp	500–10 000 bp Count	500–10 000 bp Total bp	Total count	Total bp
Insertions	699	110 936	348	791 434	1047	902 370
Deletions	676	99 453	226	663 975	902	763 428
Repeat expansion	649	139 116	938	2 722 146	1587	2 861 262
Repeat contraction	668	144 659	1136	3 486 466	1804	3 631 125
Tandem expansion	27	5575	31	125 712	58	131 287
Tandem contraction	7	819	3	5070	10	5889
				Total:	5408	8 295 361

Open in a new tab

Effects of haplotypic SV on allele‐specific expression

The identified haplotypic SVs are primarily distributed in the chromosome arms and thus are often near genes (Figure 5b). For example, the 7217 bp heterozygous deletion, upstream of Manes.03G086200 (Figure 7c) is correctly phased in our assemblies, as it was detected as an insertion in the Phase1 contigs by alignment of the Phase1 contigs vs. the Phase0 scaffolds (Figure 7e). We posited that large haplotype‐specific INDELs (>50 bp) in the vicinity of genes, such as this one, could impact their allele‐specific expression (ASE). We examined ASE patterns, and their relation to SVs, in previously published RNA‐seq data from 11 cassava tissues harvested from TME204, another heterozygous farmer preferred line (Wilson et al., 2017).

We observed that on average, in any given tissue, approximately 20% of genes showed significant ASE (BH adjusted P < 0.05, Figure 8; File S4). However, of the 15 371 expressed genes detected, nearly half (7222 genes) showed significant ASE in at least one tissue. For comparison, in the analyses of above‐ground tissue of hybrid rice, over 3000 genes showed patterns of ASE (Shao et al., 2019). Genes with ASE in hybrid rice were associated with multiple biological processes (Song et al., 2013) and were potentially implicated in contributing to the genetic basis of heterosis (Shao et al., 2019). Cassava genes showing ASE were also enriched for a variety of biological process GO terms, ranging from primary metabolism to defense (Table S3). Thus, given the high levels of heterozygosity in farmer‐preferred cassava, the large number of ASE genes involved in multiple processes suggests that much of cassava physiology is under allele‐specific control. This might have important biological and agricultural implications as has been observed in other heterozygous/hybrid crops (Shao et al., 2019; Zhang et al., 2020).

Potential effects of large haplotypic structural variants on allele‐specific expression.

(a) Principal components analyses based on total read count (left) and the log2(fold‐change) between allele‐specific read counts (right).

(b) MA‐plots of allele‐specific expression (ASE) patterns in cassava RNA‐sequencing data (Wilson et al., 2017). Each point represents an expressed gene and its average read count vs. the log2(fold‐change) between allele‐specific read counts. If a gene shows >5 log2(fold‐change) between allelic read counts the gene is characterized as having “complete ASE” (purple). Genes showing significant ASE but allelic log2(fold‐change) <5 are categorized as “partial ASE” (blue). If no significant ASE (adjusted P > 0.05) was observed, genes are denoted in green.

(c) Percentage of genes in each category in each tissue assayed.

(d) Distribution of distances to the nearest large insertion or deletion (InDel) in genes that show “complete ASE” in all tissues vs. the rest of the transcriptome.

As a complementary approach to visualizing ASE, we compared principal components analyses (PCA) of global gene expression to PCA of ASE differences for different tissue types. The most notable observation from this comparison was that while all root‐derived samples cluster together in the total count PCA, samples from storage root cluster with above‐ground organs, when performing the analysis based on allelic log fold‐change (Figure 8a). Future research can investigate if and how these unique ASE differences contribute to tissue identity. In total, 1441 ASE showed patterns of ASE in all tissue types, with another 2,218 genes having ASE uniquely in one tissue (Figure S12a).

While there could be multiple reasons for ASE of genes (Castel et al., 2015; Wood et al., 2015), large haplotypic INDELs in nearby or distal cis‐regulatory regions, such as the one in Figure 7d, could cause expression of one allele to be severely repressed. We thus defined two categories of ASE genes: we defined genes as being under “complete ASE” if significant ASE was detected and the log2(fold‐change) between alternate and reference alleles was >5. We further defined genes as having “partial ASE” if statistically significant ASE was observed, but the allelic log2(fold‐change) did not pass the threshold of 5. We observed that >24% of all genes with ASE show patterns of “complete ASE” (approximately 8% of all genes detected per tissue type) (Figure 8b).

We then compared the distribution of distances to the nearest upstream and downstream large INDEL between ASE and non‐ASE genes. In both directions relative to the genes, “complete ASE” genes had significantly different distance distributions from both “partial ASE” and “no ASE” categories (K‐S test, P < 0.05). Genes with “partial ASE” did not have different distance distributions compared with those with no ASE. Specifically examining the set of 847 genes that show “complete ASE” in all tissue types showed that these had an enrichment of large INDELs approximately 5,000 bp upstream and downstream the gene (Figure 8d). We further observed additional differences in more distant regions (up to 1 Mb away) as well (Figure S12b). While the genes themselves are not in a hemizygous state, there is hemizygosity in their cis‐regulatory regions including upstream promoter regions as well as potentially in more distal enhancer regions that may be several Mb away from the gene (Weber et al., 2016). These might have important impacts on their allelic expression and potentially on downstream phenotypes. Thus, future analyses using new approaches, such as ATAC‐seq (Buenrostro et al., 2015), might be informative to assess how the high heterozygosity and number of large haplotypic SVs impact chromatin accessible regions and how this might implicate enhancer sequences in ASE. Furthermore, examining the relationship between ASE and SVs in other datasets under additional treatments and/or conditions may further yield important cases where gene expression is affected by large haplotypic SVs (Knowles et al., 2017).

Together, the single‐nucleotide and large SVs identified by comparing the two phased TME7 assemblies open a window into the complexity of the heterozygous cassava genome. Work in grapevine and their wild relatives suggests that SVs are primarily deleterious and that they are under strong purifying selection (Zhou et al., 2019). Examining the conservation and diversity of large variants within a wide range of farmer‐preferred cassava lines would shed light on the effect of SVs on cassava genome evolution in this clonally propagated crop. Further, potentially deleterious alleles such as these large haplotypic SVs, as well as SNPs previously characterized (Ramu et al., 2017), warrant further research as these may contribute to limits in inbreeding of cassava.

Tissue‐specific gene expression Cassava Atlas

We previously published gene expression patterns for 11 different cassava tissue types based on the AM560‐2 reference genome (Wilson et al., 2017). With our newly assembled phased genome, we updated this existing resource. All 11 RNA‐seq datasets were mapped to the Phase0 scaffolded and annotated TME7 assembly, and differentially expressed genes were identified as previously described. These results can be further explored at: shiny.danforthcenter.org/cassava_atlas.

CONCLUSION

While recently released assemblies of farmer‐preferred cassava lines contain information from both haplotypes in the assembly, the limitation of these assemblies is in the lack of haplotypic purging and sequence deduplication (Kuon et al., 2019). Thus, these assemblies do not fully represent either of the haplotypes. Our assembly was successfully deduplicated of most haplotypic sequences, as evidenced by k‐mer, BUSCO, and linkage map‐based analyses. We further successfully used Hi‐C sequencing data to phase and create pseudo‐haplotype assemblies. The phased assembly described herein, is thus currently the most accurate assembly of a cassava genotype representative of those grown by millions of subsistence farmers around the world. The differences in genome size compared with the published reference (approximately 700 Mb vs. the estimated approximately 750 Mb for AM560‐2), alongside the large SVs identified between the genotypes, showcases how diversity in cassava goes beyond small nucleotide level variation between accessions. We further show that not only does TME7 have large SVs compared with AM560‐2, but that within the genome there are thousands of haplotypic SVs, potentially perpetuated through clonal variation. Many of these SVs are in close proximity to annotated genes and ASE of these genes was observed. Further research will help inform how these variants interact and affect gene hemizygosity, copy number, and expression as well as the impact on agronomically important traits. We believe this assembly will be an invaluable resource to the cassava research and breeding community, and will further aid in developing tools to ensure food security to those who rely on cassava.

EXPERIMENTAL PROCEDURES

Plant material and nucleic acid extraction

Cassava line TME7 (Oko‐iyawo) were obtained from Peter Kulakow at IITA in Ibadan, Nigeria. Plantlets were maintained in tissue culture by Nigel Taylor’s lab at the Donald Danforth Plant Science Center. Fresh young leaves were collected for extraction of high molecular weight DNA using a CTAB extraction method (Clarke, 2009).

Library preparation and sequencing

Illumina

Data from Illumina (San Diego, CA, USA) short sequencing DNA libraries of TME7 were provided by Wilhelm Gruissem’s lab at ETH Zurich. After adapter trimming by the sequencing facility, reads were de novo de‐duped using Nubeam‐dedup (Dai and Guan, 2020) before further use.

PacBio

Initial PacBio (Menlo Park, CA, USA) sequencing was contributed by Todd Michael in 2016 and did not include size selection before sequencing. The PacBio libraries were sequenced on a PacBio RSII system with P6C4 chemistry. A second set of PacBio libraries were constructed using the manufacturer’s protocol and were size selected for 20‐kb fragments on the BluePippen system (Sage Science, Beverly, MA, USA) followed by subsequent purification using AMPure XP beads (Beckman Coulter, Brea, CA, USA). Sequencing was performed by the University of Delaware DNA Sequencing & Genotyping Center.

Chromatin Conformation Capture sequencing (Hi‐C)

Fresh, young cassava leaf material was sent to Dovetail Genomics (Scotts Valley, CA, USA) for DNA extraction, digestion with DpnII, library preparation, and sequencing.

Genome size and heterozygosity estimation

Flow cytometry protocol was performed at the Benaroya Research Institute at Virginia Mason in Seattle, Washington following their standard methods.

Genome size and heterozygosity were also estimated by means of k‐mer counting. We used jellyfish (Marçais and Kingsford, 2011) to count k‐mers of size 21 and plot their depth distributions from the approximately 100× paired‐end adapter‐trimmed and deduped Illumina sequencing reads of TME7. The maximum k‐mer depth was set to 1e6, which allows inclusion of repetitive regions of the genome. We then used the GenomeScope v1 web application (Vurture et al., 2017) to model the genome size and heterozygosity for each one of these histograms, and used the model fit to select the best k‐mer size for analysis.

De novo genome assembly and scaffolding

Maximizing the diploid assembly

We first assembled the PacBio reads de novo using the FALCON and FALCON‐Unzip (Chin et al., 2016) suite of tools (v1.5.2), which included one round of consensus polishing with quiver. The config files for all FALCON tools are supplied as File S5. We further polished only INDELS with one round of Pilon (Walker et al., 2014). We identified missing heterozygous sequences using Merqury count spectra plots (Rhie et al., 2020). The k‐mers unique to the short reads and missing from the assembly were then extracted using the Meryl tool set (Miller et al., 2008; Rhie et al., 2020) and finally extracted the reads containing those k‐mers using the function meryl lookup. The short reads were first down sampled and normalized to approximately 100× coverage using BBnorm from the BBTools suite (https://sourceforge.net/projects/bbmap/) then assembled using SPAdes (Bankevich et al., 2012) and the resulting contigs were filtered for a minimum coverage depth of 10× and length of 500 bp.

Assembly deduplication

The complete set of assembled sequences was concatenated and processed through the purge_dups (Guan et al., 2020) pipeline. Alignment coverage histograms inform assembly purging software, such as purge_dups or purge_haplotigs (Roach et al., 2018), as to what sequences are potential haplotigs or duplication. While these software packages were developed for use with long reads, we found that short reads allow for higher resolution when plotting coverage histograms, which in turn results in more accurate sequence purging. Thus, we aligned approximately 100× deduped PE short reads to the entire diploid assembly for purging. First, duplicates, caused by retained haplotigs, haplotypic overlaps, and junk contigs, were purged from the primary assembly using manual depth cutoff settings of 5, 76, 126, 151, 252, and 453. A second round of purging on the “haplotig” output of purge_dups was useful to remove duplicates and artifact contigs created by purge_dups during purging of overlaps, again using automatic depth cutoffs (5, 70, 136, 137, 219, and 534). We then renamed all contigs and haplotigs in the FALCON‐Unzip naming convention for further processing using scripts in R and python (File S6). Briefly, haplotigs that had associated primary contigs in the dups.bed file were renamed to match their respective primary contigs. Those that did not have matches (i.e., contigs with low coverage in round 1 of purging etc.) were aligned to the primary assembly using nucmer (Marçais et al., 2018) and BLAST. The primary contig with the longest set of alignments was selected as the associated primary contig.

Haplotype phasing

The resulting pseudo‐haplotype primary contigs and haplotigs alongside the Hi‐C data were passed to FALCON‐Phase for phase switch correction, creating one complete set of contigs for each phase (Kronenberg et al., 2018). However, due to the large number of SVs between the TME7 haplotypes, we modified the coords2hp.py script in FALCON‐Phase to always include the entire length of the haplotig in placement (File S6). This reduced the length of haplotig sequence discarded by FALCON‐Phase during phasing. We output the results in both “pseudohap” and “unzip” formats.

Scaffolding

The Proximo Hi‐C genome scaffolding platform from Phase Genomics (Seattle, WA, USA) was used to create chromosome‐scale scaffolds from the FALCON‐Phase Phase0 assembly, following the same single‐phase scaffolding procedure described in Bickhart et al. (2017). As in the LACHESIS method (Burton et al., 2013), this process computes a contact frequency matrix from the aligned Hi‐C read pairs, normalized by the number of Sau3AI restriction sites (GATC) on each contig, and constructs scaffolds in such a way as to optimize expected contact frequency and other statistical patterns in Hi‐C data. Juicebox (Durand et al., 2016; Rao et al., 2014) was then used to correct scaffolding errors. The Hi‐C contact map was created by separately aligning the Hi‐C read pairs to the scaffolded genome then generating a Hi‐C contact matrix using the command line version of HiCExplorer (Wolff et al., 2020). A 10‐kb matrix was first created, then bins were merged to get a 500‐kb resolution for ease of plotting. Bin interaction data was then exported to table separated format (tsv) and imported to R for plotting.

Assembly quality assessment

Linkage map alignment

To confirm the order and contiguity of the assembly further we aligned the 22k marker composite linkage map (ICGMC, 2015) from cassava base (cassavabase.org). In this map, each SNP marker is aligned to the cassava v4.1 draft genome assembly and a scaffold and physical position is reported alongside the genetic position. Using the marker_seqs.py python script (File S6) we extracted sequences from 100 nt on both sides of each SNP in the v4.1 assembly. If the SNP marker was closer than 100 nt from the end of a scaffold, then the sequence with the maximum length possible around that SNP was extracted. These approximately 200 nt sequence tags were then aligned via BLAST to each phase of the current assembly. The numbers of uniquely mapping markers with alignment length >150 nt and >95% identity were used to assess levels of sequence duplication.

K‐mer based evaluation

Merqury (Rhie et al., 2020) and the built‐in Meryl implementation were used to enumerate the k‐mer distribution in the Illumina PE reads and compare it to the diploid and haploid assemblies. Using the provided script in Merqury, a k = 21 was selected to best represent a genome size of approximately 700 Mb. Copy number spectra and assembly spectra were plotted using the hist files provided and ggplot2. When k‐mer distributions were used to estimate genome sequence length (i.e., to measure missing sequence space), the sum of counts of k‐mers under the respective distribution was divided by the mean k‐mer multiplicity of the distribution:

\begin{array}{l} (s u m (k m e r c o u n t * k m e r m u l t i p l i c i t y)) \\ / m e a n (k m e r m u l t i p l i c i t y) \end{array}

Assessment of genic and repetitive sequence space

The completeness and duplication of the genic regions in the assembly was performed by using BUSCO v4.1.2 (Simão et al., 2015) benchmark software (http://busco.ezlab.org/) and the “eudicotyledons_odb10” ortholog dataset with default settings.

To evaluate the contiguity of the repetitive sequence assembly, LAI was evaluated using LAI beta3.2 (Ou et al., 2018) with input files generated by EDTA. The initial LAI estimation was done using the “‐q” parameter, then average LTR identity and total LTR content were obtained and further provided to the standardization of LAI, with parameters “‐iden 95.63 ‐totLTR 53.” Regional LAI was calculated in 3 Mb windows with 300 kb overlapping steps.

Haplotype‐specific annotation

TE annotation and repeat masking

TEs of each assembly were independently annotated using EDTA v1.9.7 (Ou et al., 2019) with parameters “‐‐sensitive 1 ‐‐anno 1 ‐t 18” and “‐‐cds” providing the coding sequences of the M. esculenta v6.1 assembly. Library sequences from the de novo TE library generated by EDTA were filtered and those present more than three full‐length copies in the respective haploid assembly were retained. The remaining sequences from the two TE libraries were combined using the “make_panTElib.pl” script in the EDTA package, generating a high‐quality TE library. The final TE library was then used to annotate the two haploid genomes using RepeatMasker v4.1.1 (www.repeatmasker.org) with parameters “‐q ‐no_is ‐norna ‐nolow ‐div 40 ‐cutoff 225” that allow for up to 40% of sequence divergence. This step helped to annotate fragmented TEs. To annotate consistently the intact TEs in the two haploid genomes, the final TE library and the final homology‐based TE annotation were provided to EDTA with parameters “‐‐evaluate 1 ‐‐anno 1 ‐t 18 ‐‐step final.” In depth commands for TE annotation and LAI calculation are supplied in File S6.

Gene annotation

Transcriptome data of 11 tissue types (Wilson et al., 2017) was used to generate transcript evidence for annotation. Reads were trimmed with Trimmomatic (Bolger et al., 2014) and aligned to the soft masked diploid reference (Phase0 scaffolds + Phase1 pseudohaplotype contigs concatenated) using Hisat2 v2.1.0 (Kim et al., 2019). Stringtie v1.3.5 was used to assemble transcripts from each alignment file and all files were merged with “stringtie merge” (Pertea et al., 2015). A fasta containing coding sequence for all transcripts was produced using gffread tool from the cufflinks (Trapnell et al., 2010) package. These transcripts, together with AM560‐2 v6.1 coding sequences and protein sequence from Araport11 (Cheng et al., 2017), were used for a first round of MAKER v2.31.8 (Cantarel et al., 2008) gene annotation. Gene prediction was further performed by training SNAP (library 2013‐02‐16) (Korf, 2004) and AUGUSTUS v3.3 (Stanke and Morgenstern, 2005) as suggested in Bowman et al. (2017) and the output of the first round of MAKER annotation. After gene prediction the genes in the gff file were renamed and the file was split to produce one gff for each phase.

Gene synteny and tandem duplication analyses

Comparison of gene synteny between the TME7 Phase0 assembly and the AM560‐2 ref6 assembly was performed with the Python MCScanX pipeline v1.1.12 (Tang et al., 2008; Wang et al., 2012). Briefly, annotation gff files were converted to bed format keeping one isoform per gene using “jcvi.formats.gff ‐‐primary_only”. A pairwise synteny search was performed and the high‐quality synteny block (anchors) was used in syntenic depth comparisons and plotting of karyotypes and dot plots. Gene synteny depths between the Phase0 and Phase1 annotations were calculated in the same way. For each phase, tandem gene duplication arrays were identified similar to the study Van Buren et al. (2018). Briefly, arrays were detected by MCScan (Tang et al., 2008) using results from an all‐vs‐all BLASTP (e‐value < 1e‐10) and a maximum gene distance of 10. Arrays were extracted from the “.tandem” file and enumerated using a short script in R (File S7).

SV and polymorphisms

SVs between TME7 and the AM560‐2 reference genome were identified by aligning the Phase0 contigs vs. the reference genome. The authors of the assemblytics (Nattestad and Schatz, 2016) software recommend analysis using contigs and not scaffolds, to minimize bias from different gap sizes in the assembly. Thus, initially the reference assembly was split at gaps of >10 Ns using the python script split_scaffolds.py (File S6). After alignment with nucmer (Marçais et al., 2018) with settings: “‐‐maxmatch ‐l 100 ‐c 500” the delta file was gziped and uploaded to the Assemblytics web interface (www.assemblytics.com) for analysis using a maximum variant size of 10 000 or 50 000 bp. The results were exported as a bed file and imported into R for plotting. Dot plots of the alignments were produced using scripts modified from https://jmonlong.github.io/Hippocamplus/2017/09/19/mummerplots‐with‐ggplot2/ (File S7).

The locations of the five largest deletions identified were then examined for evidence of SV using short‐read mapping. Deduplicated Illumina reads from TME7 were aligned to the AM560‐2 v6.1 reference using bwa mem (Li and Durbin, 2009). The sorted bam file was then loaded into samplot (Belyeu et al., 2021) to plot the read coverage and identification of discordant mapping. SNPs and INDELs were identified by using dnadiff and show‐snps programs in the MUMmer4 package (Marçais et al., 2018).

SV between the phases was then assessed by aligning Phase1 unzip contigs vs. Phase0 scaffolds (split at >10 Ns) and using Assemblytics as above. Haplotype divergence was calculated by aligning the FALCON‐Phase “Unzip”‐emit‐style haplotigs to the primary, Phase0, assembly using nucmer with these settings: “‐‐maxmatch ‐l 100 ‐c 500”. Alignments were filtered with delta‐filter ‐g and coordinates were output using show‐coords. Finally, divergence from the primary assembly was calculated using scripts from https://github.com/skingan/FC_Unzip_HaplotypeDivergence. SNPs and INDELs between the phases were identified as above.

ASE

The above genomic read alignment bam files for TME7 vs. the Phase0 assembly were then deduplicated with Picard tools and SNPs were called using GATK v4.1.4.1 (Van der Auwera et al., 2013) with minimal quality filtering (QD < 2.0, FS > 60.0, MQ < 40.0, MQRankSum < −12.5, ReadPosRankSum < −8.0). We then aligned all RNA‐seq data from Wilson et al. (2017), to the TME7 Phase0 assembly using STAR v2.7.8 (Dobin et al., 2013), and the bams and SNP VCF file were then imported to phASER (Castel et al., 2016) to phase the variants accurately within each gene model. To decrease running time we ran PhASER on each bam separately and included only genes on the first 18 chromosomes. PhASER settings were “‐‐paired_end 1 ‐‐mapq 255 ‐‐baseq 10”. The haplotypic read counts from all samples were then concatenated, and subsequently the per gene read counts were then exported using the “phASER Gene AE” tool and read into R for statistical analysis.

We used the suggested DESeq2 pipeline for ASE (http://rstudio‐pubs‐static.s3.amazonaws.com/275642_e9d578fe1f7a404aad0553f52236c0a4.html) for analysis. Using this approach, statistical contrasts were evaluated for each tissue type between the counts of REF and ALT alleles, similarly, to how one might identify treatment effects in a regular RNA‐seq analysis. Significance P‐values were adjust using the Benjamini–Hochberg method (Benjamini and Hochberg, 1995) and genes were marked as ASE if the adjusted P‐value was <0.05. PCAs were performed once on the variance‐stabilized read counts and once on the log2(ALT count/REF count) values calculated by PhASER. A constant of 0.01 was added to the counts to limit log2 of zero. We further categorized ASE genes as having “Complete ASE” or “Partial ASE” if log2 allele ratios were ≥5, respectively. Distances to nearest non‐overlapping INDEL were measured using bedtools closest command using the ‐D a ‐io and ‐id or ‐iu settings for upstream and downstream INDELs, respectively (Quinlan and Hall, 2010). The distributions of distances of genes in different ASE categories were compared using the Kolmogorov–Smirnov test in R.

GO annotation and enrichment analyses

A GO term database was established as in Mansfeld et al. (2017); all TME7 proteins were compared with Araport11 (Cheng et al., 2017) by BLASTP and Protein families were identified using InterProScan5 (Jones et al., 2014). GO terms were extracted from these results, and merged and deduplicated to make the final database. The GO term enrichment analysis was performed using TopGO (Alexa et al., 2006) using all expressed genes as a background set and a minimum node size of 100. The GO enrichment of genes in tandem arrays was performed as above but using all annotated genes as a background set.

SHINY app update

Reads from the RNA‐seq dataset for 11 tissue types were aligned to the TME7 Phase0 assembly using HISAT2 (Kim et al., 2019) and abundance was quantified with Stringtie (Pertea et al., 2015). Read counts were transformed into robust‐FPKMs using DESeq2 (Love et al., 2014). Finally, the annotation was matched to the transcript IDs and formatted to be read within the Shiny framework.

Scripts and figures

All scripts described above are supplied in File S6. All R scripts for producing figures and summary results are supplied in File S7.

AUTHOR CONTRIBUTIONS

BNM and RB conceived and designed the assembly strategy with contribution from NF. TPM generated the initial long read data. AB, MW, and BNM performed the assembly steps. AB and SO annotated the gene and TE space. Genome‐wide analyses and assembly assessments were performed by BNM, AB, SO, and SP. ASE analysis was performed by BNM. AB and JCB updated the cassava expression atlas application. BNM and RB wrote the manuscript with comments from all authors.

CONFLICT OF INTERESTS

The authors declare that they have no competing interests.

Supporting information

Table S1. Statistics of other assembly attempts.

Table S2. Gene Ontology enrichment of tandem duplicated genes.

Table S3. Gene Ontology enrichment of genes with allele‐specific expression.

Figure S1. Full GenomeScope k‐mer profile.

Figure S2. PacBio read length distribution.

Figure S3. Diploid k‐mer count spectra for assembly stages.

Figure S4. Reduction of heterozygous sequences by FALCON‐Phase.

Figure S5. Linkage map hit duplication rate.

Figure S6. Distribution of long terminal repeats.

Figure S7. Distributions of transposable elements.

Figure S8. Annotation statistics.

Figure S9. Gene syntenic depth analyses.

Figure S10. Sequence alignment to AM560‐2 reference genome.

Click here for additional data file.^{(1.6MB, docx)}

File S1. Gene comparison between the two phase’s annotation (Anchor file).

Click here for additional data file.^{(2MB, anchors)}

File S2. Structural variants between Phase0 and the AM560‐2 reference genome.

Click here for additional data file.^{(1.4MB, tsv)}

File S3. Structural variants between the two phases.

Click here for additional data file.^{(682.4KB, tsv)}

File S4. Allele specific analysis and relation to structural variants in 11 tissues.

Click here for additional data file.^{(55.2MB, xlsx)}

File S5. Configuration file for FALCON suite.

Click here for additional data file.^{(3.3KB, txt)}

File S6. Additional scripts.

Click here for additional data file.^{(25.1KB, txt)}

File S7. Scripts for figures and assessment.

Click here for additional data file.^{(19.2MB, pdf)}

ACKNOWLEDGEMENTS

We thank Nigel Taylor, Getu Deguma, Narayanan Narayanan, Wilhelm Gruissem, and Yi‐Wen Lim for contributing the TME7 tissue samples and Illumina reads and for helpful discussions throughout this research. We thank Alex Harkess for advice on the assembly strategies. We thank Kerrigan Gilbert for discussion and for the critical reading of the manuscript. We thank Hayley Mangelson and Shawn Sullivan at Phase Genomics for their advice and help scaffolding the assembly. This work was funded by the Donald Danforth Plant Science Center and the Bill and Melinda Gates Foundation (OPP1093529, OPP1194889, and INV‐002958).

DATA AVAILABILITY STATEMENT

Both haplotype genome assemblies and annotations are stored in the Zenodo data repository https://doi.org/10.5281/zenodo.5534689. All short and long reads in assembly have been uploaded under the SRA accession PRJNA765105. Custom scripts used for assembly and analysis are available in Files S5–7 and at https://github.com/bmansfeld/tme7_cassava_genome_figures/.

REFERENCES

Alexa, A. , Rahnenfuhrer, J. & Lengauer, T. (2006) Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics, 22, 1600–1607. [DOI] [PubMed] [Google Scholar]
Alonge, M. , Wang, X. , Benoit, M. , Soyk, S. , Pereira, L. , Zhang, L. et al. (2020) Major impacts of widespread structural variation on gene expression and crop improvement in tomato. Cell, 182, 145–161.e23. [DOI] [PMC free article] [PubMed] [Google Scholar]
Auwera, G.A. , Carneiro, M.O. , Hartl, C. , Poplin, R. , del Angel, G. , Levy‐Moonshine, A. et al. (2013) From fastQ data to high‐confidence variant calls: The genome analysis toolkit best practices pipeline. Current Protocols in Bioinformatics. Hoboken, NJ, USA: John Wiley & Sons Inc, pp. 1–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
Aye, T.M. (2012) Cassava agronomy: Land preparation, time and method of planting and harvest, plant spacing and weed control. In The Cassava Handbook: A Reference Manual Based on the Asian Regional Cassava Training Course held in Thailand. Bangkok, Thailand: CIAT, pp. 588–612. [Google Scholar]
Bankevich, A. , Nurk, S. , Antipov, D. , Gurevich, A.A. , Dvorkin, M. , Kulikov, A.S. et al. (2012) SPAdes: A new genome assembly algorithm and its applications to single‐cell sequencing. Journal of Computational Biology, 19, 455–477. [DOI] [PMC free article] [PubMed] [Google Scholar]
Belyeu, J.R. , Chowdhury, M. , Brown, J. , Pedersen, B.S. , Michael, J. , Cormier, M.J. et al. (2021) Samplot: a platform for structural variant visual validation and automated filtering. Genome Biology, 22, 161. [DOI] [PMC free article] [PubMed] [Google Scholar]
Benjamini, Y. & Hochberg, Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, 57, 289–300. [Google Scholar]
Bickhart, D.M. , Rosen, B.D. , Koren, S. , Sayre, B.L. , Hastie, A.R. , Chan, S. et al. (2017) Single‐molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome. Nature Genetics, 49(4), 643–650. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bolger, A.M. , Lohse, M. & Usadel, B. (2014) Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics, 30, 2114–2120. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bowman, M.J. , Pulman, J.A. , Liu, T.L. & Childs, K.L. (2017) A modified GC‐specific MAKER gene annotation method reveals improved and novel gene predictions of high and low GC content in Oryza sativa. BMC Bioinformatics, 18, 1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bredeson, J.V. , Lyons, J.B. , Prochnik, S.E. et al. (2016) Sequencing wild and cultivated cassava and related species reveals extensive interspecific hybridization and genetic diversity. Nature Biotechnology, 34, 562–570. [DOI] [PubMed] [Google Scholar]
Buenrostro, J.D. , Wu, B. , Chang, H.Y. & Greenleaf, W.J. (2015) ATAC‐seq: A method for assaying chromatin accessibility genome‐wide. Current Protocols in Molecular Biology, 109, 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
Buren, R.V. , Wai, C.M. , Pardo, J. , Giarola, V. , Ambrosini, S. , Song, X. et al. (2018) Desiccation tolerance evolved through gene duplication and network rewiring in Lindernia. The Plant Cell, 30, 2943–2958. [DOI] [PMC free article] [PubMed] [Google Scholar]
Burton, J.N. , Adey, A. , Patwardhan, R.P. , Qiu, R. , Kitzman, J.O. & Shendure, J. (2013) Chromosome‐scale scaffolding of de novo genome assemblies based on chromatin interactions. Nature Biotechnology, 31, 1119–1125. [DOI] [PMC free article] [PubMed] [Google Scholar]
Campoy, J.A. , Sun, H. , Goel, M. et al. (2020) Gamete binning: chromosome‐level and haplotype‐resolved genome assembly enabled by high‐throughput single‐cell sequencing of gamete genomes. Genome Biology, 21, 306. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cantarel, B.L. , Korf, I. , Robb, S.M.C. , Parra, G. , Ross, E. , Moore, B. et al. (2008) MAKER: an easy‐to‐use annotation pipeline designed for emerging model organism genomes. Genome Research, 18, 188–196. [DOI] [PMC free article] [PubMed] [Google Scholar]
Castel, S.E. , Levy‐Moonshine, A. , Mohammadi, P. , Banks, E. & Lappalainen, T. (2015) Tools and best practices for data processing in allelic expression analysis. Genome Biology, 16, 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
Castel, S.E. , Mohammadi, P. , Chung, W.K. , Shen, Y. & Lappalainen, T. (2016) Rare variant phasing and haplotypic expression from RNA sequencing with phASER. Nature Communications, 7, 8–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ceballos, H. , Iglesias, C.A. , Pérez, J.C. & Dixon, A.G.O. (2004) Cassava breeding: opportunities and challenges. Plant Molecular Biology, 56, 503–516. [DOI] [PubMed] [Google Scholar]
Cheng, C.Y. , Krishnakumar, V. , Chan, A.P. , Thibaud‐Nissen, F. , Schobel, S. & Town, C.D. (2017) Araport11: a complete reannotation of the Arabidopsis thaliana reference genome. The Plant Journal, 89, 789–804. [DOI] [PubMed] [Google Scholar]
Chin, C.S. , Peluso, P. , Sedlazeck, F.J. et al. (2016) Phased diploid genome assembly with single‐molecule real‐time sequencing. Nature Methods, 13, 1050–1054. [DOI] [PMC free article] [PubMed] [Google Scholar]
Clarke, J.D. (2009) Cetyltrimethyl ammonium bromide (CTAB) DNA miniprep for plant DNA isolation. Cold Spring Harbor Protocols, 4, 5177–5179. [DOI] [PubMed] [Google Scholar]
Dai, H. & Guan, Y. (2020) Nubeam‐dedup: a fast and RAM‐efficient tool to de‐duplicate sequencing reads without mapping. Bioinformatics, 36(10), 3254–3256. [DOI] [PubMed] [Google Scholar]
Dobin, A. , Davis, C.A. , Schlesinger, F. , Drenkow, J. , Zaleski, C. , Jha, S. et al. (2013) STAR: Ultrafast universal RNA‐seq aligner. Bioinformatics, 29, 15–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
Durand, N.C. , Robinson, J.T. , Shamim, M.S. , Machol, I. , Mesirov, J.P. , Lander, E.S. et al. (2016) Juicebox provides a visualization system for Hi‐C contact maps with unlimited zoom. Cell Systems, 3, 99–101. [DOI] [PMC free article] [PubMed] [Google Scholar]
Guan, D. , McCarthy, S.A. , Wood, J. , Howe, K. , Wang, Y. & Durbin, R. (2020) Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics, 36, 2896–2898. [DOI] [PMC free article] [PubMed] [Google Scholar]
Huang, S. , Kang, M. & Xu, A. (2017) HaploMerger2: Rebuilding both haploid sub‐assemblies from high‐heterozygosity diploid genome assembly. Bioinformatics, 33, 2577–2579. [DOI] [PMC free article] [PubMed] [Google Scholar]
ICGMC (2015) High‐resolution linkage map and chromosome‐scale genome assembly for cassava (Manihot esculenta crantz) from 10 populations. G3, 5, 133. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jones, P. , Binns, D. , Chang, H.‐Y. , Fraser, M. , Li, W. , McAnulla, C. et al. (2014) InterProScan 5: genome‐scale protein function classification. Bioinformatics, 30, 1236–1240. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kim, D. , Paggi, J.M. , Park, C. , Bennett, C. & Salzberg, S.L. (2019) Graph‐based genome alignment and genotyping with HISAT2 and HISAT‐genotype. Nature Biotechnology, 37, 907–915. [DOI] [PMC free article] [PubMed] [Google Scholar]
Knowles, D.A. , Davis, J.R. , Edgington, H. et al. (2017) Allele‐specific expression reveals interactions between genetic variation and environment. Nature Methods, 14, 699–702. [DOI] [PMC free article] [PubMed] [Google Scholar]
Koren, S. , Rhie, A. , Walenz, B.P. et al. (2018) De novo assembly of haplotype‐resolved genomes with trio binning. Nature Biotechnology, 36, 1174–1182. [DOI] [PMC free article] [PubMed] [Google Scholar]
Korf, I. (2004) Gene finding in novel genomes. BMC Bioinformatics, 5, 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kronenberg, Z.N. , Rhie, A. , Koren, S. et al. (2018) Extended haplotype phasing of de novo genome assemblies with FALCON‐phase. bioRxiv, 1–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kuon, J.‐E. , Qi, W. , Schläpfer, P. et al. (2019) Haplotype‐resolved genomes of geminivirus‐resistant and geminivirus‐susceptible African cassava cultivars. BMC Biology, 17, 75. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li, H. & Durbin, R. (2009) Fast and accurate short read alignment with Burrows‐Wheeler transform. Bioinformatics, 25, 1754–1760. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu, J. , Shi, C. , Shi, C.‐C. et al. (2020) The chromosome‐based rubber tree genome provides new insights into spurge genome evolution and rubber biosynthesis. Molecular Plant, 13, 336–350. [DOI] [PubMed] [Google Scholar]
Love, M.I. , Huber, W. & Anders, S. (2014) Moderated estimation of fold change and dispersion for RNA‐seq data with DESeq2. Genome Biology, 15, 1–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lyons, J.B. , Bredeson, J.V. , Mansfeld, B.N. , Bauchet, G.J. , Berry, J. , Boyher, A. et al. (2021) Current status and impending progress for cassava structural genomics. Plant Molecular Biology. 10.1007/s11103-020-01104-w [DOI] [PMC free article] [PubMed] [Google Scholar]
Mansfeld, B.N. , Colle, M. , Kang, Y. , Jones, A.D. & Grumet, R. (2017) Transcriptomic and metabolomic analyses of cucumber fruit peels reveal a developmental increase in terpenoid glycosides associated with age‐related resistance to Phytophthora capsici . Horticulture Research, 4, 17022. [DOI] [PMC free article] [PubMed] [Google Scholar]
Marçais, G. , Delcher, A.L. , Phillippy, A.M. , Coston, R. , Salzberg, S.L. & Zimin, A. (2018) MUMmer4: A fast and versatile genome alignment system. PLOS Computational Biology, 14(1), e100594. [DOI] [PMC free article] [PubMed] [Google Scholar]
Marçais, G. & Kingsford, C. (2011) A fast, lock‐free approach for efficient parallel counting of occurrences of k‐mers. Bioinformatics, 27, 764–770. [DOI] [PMC free article] [PubMed] [Google Scholar]
Michael, T.P. & VanBuren, R. (2020) Building near‐complete plant genomes. Current Opinion in Plant Biology, 54, 26–33. 10.1016/j.pbi.2019.12.009 [DOI] [PubMed] [Google Scholar]
Miller, J.R. , Delcher, A.L. , Koren, S. , Venter, E. , Walenz, B.P. , Brownley, A. et al. (2008) Aggressive assembly of pyrosequencing reads with mates. Bioinformatics, 24, 2818–2824. [DOI] [PMC free article] [PubMed] [Google Scholar]
Minio, A. , Massonnet, M. , Figueroa‐Balderas, R. , Castro, A. & Cantu, D. (2019) Diploid genome assembly of the wine grape Carménère. G3, 9, 1331–1337. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nattestad, M. & Schatz, M.C. (2016) Assemblytics: a web analytics tool for the detection of variants from an assembly. Bioinformatics, 32, 3021–3023. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ou, S. , Chen, J. & Jiang, N. (2018) Assessing genome assembly quality using the LTR Assembly Index (LAI). Nucleic Acids Research, 46, 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ou, S. , Su, W. , Liao, Y. et al. (2019) Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline. Genome Biology, 20, 1–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
Padgitt‐Cobb, L.K. , Kingan, S.B. , Wells, J. et al. (2021) A draft phased assembly of the diploid Cascade hop (Humulus lupulus) genome. Plant Genome, 14, 1–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
Patel, S. , Lu, Z. , Jin, X. , Swaminathan, P. , Zeng, E. & Fennell, A.Y. (2018) Comparison of three assembly strategies for a heterozygous seedless grapevine genome assembly. BMC Genomics, 19, 57. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pertea, M. , Pertea, G.M. , Antonescu, C.M. , Chang, T.C. , Mendell, J.T. & Salzberg, S.L. (2015) StringTie enables improved reconstruction of a transcriptome from RNA‐seq reads. Nature Biotechnology, 33, 290–295. [DOI] [PMC free article] [PubMed] [Google Scholar]
Prochnik, S. , Marri, P.R. , Desany, B. et al. (2012) The cassava genome: Current progress, future directions. Tropical Plant Biology, 5, 88–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
Quinlan, A.R. & Hall, I.M. (2010) BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 26, 841–842. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rabbi, I. , Hamblin, M. , Gedil, M. , Kulakow, P. , Ferguson, M. , Ikpan, A.S. et al. (2014) Genetic mapping using genotyping‐by‐sequencing in the clonally propagated cassava. Crop Science, 54, 1384–1396. 10.2135/cropsci2013.07.0482 [DOI] [Google Scholar]
Ramu, P. , Esuma, W. , Kawuki, R. et al. (2017) Cassava haplotype map highlights fixation of deleterious mutations during clonal propagation. Nature Genetics, 49, 959–963. [DOI] [PubMed] [Google Scholar]
Rao, S. , Huntley, M. , Durand, N. , Stamenova, E. , Bochkov, I. , Robinson, J. et al. (2014) A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell, 159, 1665–1680. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rhie, A. , Walenz, B.P. , Koren, S. & Phillippy, A.M. (2020) Merqury: reference‐free quality, completeness, and phasing assessment for genome assemblies. Genome Biology, 21, 245. [DOI] [PMC free article] [PubMed] [Google Scholar]
Roach, M.J. , Schmidt, S.A. & Borneman, A.R. (2018) Purge Haplotigs: Allelic contig reassignment for third‐gen diploid genome assemblies. BMC Bioinformatics, 19, 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rojas, M.C. , Pérez, J.C. , Ceballos, H. , Baena, D. , Morante, N. & Calle, F. (2009) Analysis of inbreeding depression in eight S₁ cassava families. Crop Science, 49, 543–548. [Google Scholar]
Shao, L. , Xing, F. , Xu, C. et al. (2019) Patterns of genome‐wide allele‐specific expression in hybrid rice and the implications on the genetic basis of heterosis. Proceedings of the National Academy of Sciences of the United States of America, 116, 5653–5658. [DOI] [PMC free article] [PubMed] [Google Scholar]
Simão, F.A. , Waterhouse, R.M. , Ioannidis, P. , Kriventseva, E.V. & Zdobnov, E.M. (2015) BUSCO: assessing genome assembly and annotation completeness with single‐copy orthologs. Bioinformatics, 31, 3210–3212. [DOI] [PubMed] [Google Scholar]
Song, G. , Guo, Z. , Liu, Z. , Cheng, Q. , Qu, X. , Chen, R. et al. (2013) Global RNA sequencing reveals that genotype‐dependent allele‐specific expression contributes to differential expression in rice F1 hybrids. BMC Plant Biology, 13, 221. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stanke, M. & Morgenstern, B. (2005) AUGUSTUS: a web server for gene prediction in eukaryotes that allows user‐defined constraints. Nucleic Acids Research, 33, 465–467. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tang, H. , Wang, X. , Bowers, J.E. , Ming, R. , Alam, M. & Paterson, A.H. (2008) Unraveling ancient hexaploidy through multiply‐aligned angiosperm gene maps. Genome Research, 18, 1944–1954. [DOI] [PMC free article] [PubMed] [Google Scholar]
Trapnell, C. , Williams, B.A. , Pertea, G. , Mortazavi, A. , Kwan, G. , van Baren, M.J. et al. (2010) Transcript assembly and quantification by RNA‐Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology, 28, 511–515. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vurture, G.W. , Sedlazeck, F.J. , Nattestad, M. , Underwood, C.J. , Fang, H. , Gurtowski, J. et al. (2017) GenomeScope: fast reference‐free genome profiling from short reads. Bioinformatics, 33, 2202–2204. [DOI] [PMC free article] [PubMed] [Google Scholar]
Walker, B.J. , Abeel, T. , Shea, T. , Priest, M. , Abouelliel, A. , Sakthikumar, S. et al. (2014) Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One, 9, e112963. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang, Y. , Tang, H. , Debarry, J.D. et al. (2012) MCScanX: A toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Research, 40, 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
Weber, B. , Zicola, J. , Oka, R. & Stam, M. (2016) Plant Enhancers: a call for discovery. Trends in Plant Science, 21, 974–987. [DOI] [PubMed] [Google Scholar]
Wilson, M.C. , Mutka, A.M. , Hummel, A.W. et al. (2017) Gene expression atlas for the food security crop cassava. New Phytologist, 213, 1632–1641. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wolff, J. , Rabbani, L. , Gilsbach, R. , Richard, G. , Manke, T. , Backofen, R. et al. (2020) Galaxy HiCExplorer 3: A web server for reproducible Hi‐C, capture Hi‐C and single‐cell Hi‐C data analysis, quality control and visualization. Nucleic Acids Research, 48, W177–W184. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wood, D.L.A. , Nones, K. , Steptoe, A. , Christ, A. , Harliwong, I. , Newell, F. et al. (2015) Recommendations for accurate resolution of Gene and isoform allele‐specific expression in RNA‐seq data. PLoS One, 10, 1–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang, X. , Wu, R. , Wang, Y. , Yu, J. & Tang, H. (2020) Unzipping haplotypes in diploid and polyploid genomes. Computational and Structural Biotechnology Journal, 18, 66–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhou, Y. , Minio, A. , Massonnet, M. , Solares, E. , Lv, Y. , Beridze, T. et al. (2019) The population genetics of structural variants in grapevine domestication. Nat. Plants, 5, 965–979. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Table S1. Statistics of other assembly attempts.

Table S2. Gene Ontology enrichment of tandem duplicated genes.

Table S3. Gene Ontology enrichment of genes with allele‐specific expression.

Figure S1. Full GenomeScope k‐mer profile.

Figure S2. PacBio read length distribution.

Figure S3. Diploid k‐mer count spectra for assembly stages.

Figure S4. Reduction of heterozygous sequences by FALCON‐Phase.

Figure S5. Linkage map hit duplication rate.

Figure S6. Distribution of long terminal repeats.

Figure S7. Distributions of transposable elements.

Figure S8. Annotation statistics.

Figure S9. Gene syntenic depth analyses.

Figure S10. Sequence alignment to AM560‐2 reference genome.

Click here for additional data file.^{(1.6MB, docx)}

File S1. Gene comparison between the two phase’s annotation (Anchor file).

Click here for additional data file.^{(2MB, anchors)}

File S2. Structural variants between Phase0 and the AM560‐2 reference genome.

Click here for additional data file.^{(1.4MB, tsv)}

File S3. Structural variants between the two phases.

Click here for additional data file.^{(682.4KB, tsv)}

File S4. Allele specific analysis and relation to structural variants in 11 tissues.

Click here for additional data file.^{(55.2MB, xlsx)}

File S5. Configuration file for FALCON suite.

Click here for additional data file.^{(3.3KB, txt)}

File S6. Additional scripts.

Click here for additional data file.^{(25.1KB, txt)}

File S7. Scripts for figures and assessment.

Click here for additional data file.^{(19.2MB, pdf)}

Data Availability Statement

[tpj15543-bib-0001] Alexa, A. , Rahnenfuhrer, J. & Lengauer, T. (2006) Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics, 22, 1600–1607. [DOI] [PubMed] [Google Scholar]

[tpj15543-bib-0002] Alonge, M. , Wang, X. , Benoit, M. , Soyk, S. , Pereira, L. , Zhang, L. et al. (2020) Major impacts of widespread structural variation on gene expression and crop improvement in tomato. Cell, 182, 145–161.e23. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0003] Auwera, G.A. , Carneiro, M.O. , Hartl, C. , Poplin, R. , del Angel, G. , Levy‐Moonshine, A. et al. (2013) From fastQ data to high‐confidence variant calls: The genome analysis toolkit best practices pipeline. Current Protocols in Bioinformatics. Hoboken, NJ, USA: John Wiley & Sons Inc, pp. 1–33. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0004] Aye, T.M. (2012) Cassava agronomy: Land preparation, time and method of planting and harvest, plant spacing and weed control. In The Cassava Handbook: A Reference Manual Based on the Asian Regional Cassava Training Course held in Thailand. Bangkok, Thailand: CIAT, pp. 588–612. [Google Scholar]

[tpj15543-bib-0005] Bankevich, A. , Nurk, S. , Antipov, D. , Gurevich, A.A. , Dvorkin, M. , Kulikov, A.S. et al. (2012) SPAdes: A new genome assembly algorithm and its applications to single‐cell sequencing. Journal of Computational Biology, 19, 455–477. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0006] Belyeu, J.R. , Chowdhury, M. , Brown, J. , Pedersen, B.S. , Michael, J. , Cormier, M.J. et al. (2021) Samplot: a platform for structural variant visual validation and automated filtering. Genome Biology, 22, 161. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0007] Benjamini, Y. & Hochberg, Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, 57, 289–300. [Google Scholar]

[tpj15543-bib-0008] Bickhart, D.M. , Rosen, B.D. , Koren, S. , Sayre, B.L. , Hastie, A.R. , Chan, S. et al. (2017) Single‐molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome. Nature Genetics, 49(4), 643–650. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0009] Bolger, A.M. , Lohse, M. & Usadel, B. (2014) Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics, 30, 2114–2120. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0010] Bowman, M.J. , Pulman, J.A. , Liu, T.L. & Childs, K.L. (2017) A modified GC‐specific MAKER gene annotation method reveals improved and novel gene predictions of high and low GC content in Oryza sativa. BMC Bioinformatics, 18, 1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0011] Bredeson, J.V. , Lyons, J.B. , Prochnik, S.E. et al. (2016) Sequencing wild and cultivated cassava and related species reveals extensive interspecific hybridization and genetic diversity. Nature Biotechnology, 34, 562–570. [DOI] [PubMed] [Google Scholar]

[tpj15543-bib-0012] Buenrostro, J.D. , Wu, B. , Chang, H.Y. & Greenleaf, W.J. (2015) ATAC‐seq: A method for assaying chromatin accessibility genome‐wide. Current Protocols in Molecular Biology, 109, 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0013] Buren, R.V. , Wai, C.M. , Pardo, J. , Giarola, V. , Ambrosini, S. , Song, X. et al. (2018) Desiccation tolerance evolved through gene duplication and network rewiring in Lindernia. The Plant Cell, 30, 2943–2958. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0014] Burton, J.N. , Adey, A. , Patwardhan, R.P. , Qiu, R. , Kitzman, J.O. & Shendure, J. (2013) Chromosome‐scale scaffolding of de novo genome assemblies based on chromatin interactions. Nature Biotechnology, 31, 1119–1125. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0015] Campoy, J.A. , Sun, H. , Goel, M. et al. (2020) Gamete binning: chromosome‐level and haplotype‐resolved genome assembly enabled by high‐throughput single‐cell sequencing of gamete genomes. Genome Biology, 21, 306. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0016] Cantarel, B.L. , Korf, I. , Robb, S.M.C. , Parra, G. , Ross, E. , Moore, B. et al. (2008) MAKER: an easy‐to‐use annotation pipeline designed for emerging model organism genomes. Genome Research, 18, 188–196. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0017] Castel, S.E. , Levy‐Moonshine, A. , Mohammadi, P. , Banks, E. & Lappalainen, T. (2015) Tools and best practices for data processing in allelic expression analysis. Genome Biology, 16, 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0018] Castel, S.E. , Mohammadi, P. , Chung, W.K. , Shen, Y. & Lappalainen, T. (2016) Rare variant phasing and haplotypic expression from RNA sequencing with phASER. Nature Communications, 7, 8–13. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0019] Ceballos, H. , Iglesias, C.A. , Pérez, J.C. & Dixon, A.G.O. (2004) Cassava breeding: opportunities and challenges. Plant Molecular Biology, 56, 503–516. [DOI] [PubMed] [Google Scholar]

[tpj15543-bib-0020] Cheng, C.Y. , Krishnakumar, V. , Chan, A.P. , Thibaud‐Nissen, F. , Schobel, S. & Town, C.D. (2017) Araport11: a complete reannotation of the Arabidopsis thaliana reference genome. The Plant Journal, 89, 789–804. [DOI] [PubMed] [Google Scholar]

[tpj15543-bib-0021] Chin, C.S. , Peluso, P. , Sedlazeck, F.J. et al. (2016) Phased diploid genome assembly with single‐molecule real‐time sequencing. Nature Methods, 13, 1050–1054. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0022] Clarke, J.D. (2009) Cetyltrimethyl ammonium bromide (CTAB) DNA miniprep for plant DNA isolation. Cold Spring Harbor Protocols, 4, 5177–5179. [DOI] [PubMed] [Google Scholar]

[tpj15543-bib-0023] Dai, H. & Guan, Y. (2020) Nubeam‐dedup: a fast and RAM‐efficient tool to de‐duplicate sequencing reads without mapping. Bioinformatics, 36(10), 3254–3256. [DOI] [PubMed] [Google Scholar]

[tpj15543-bib-0024] Dobin, A. , Davis, C.A. , Schlesinger, F. , Drenkow, J. , Zaleski, C. , Jha, S. et al. (2013) STAR: Ultrafast universal RNA‐seq aligner. Bioinformatics, 29, 15–21. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0025] Durand, N.C. , Robinson, J.T. , Shamim, M.S. , Machol, I. , Mesirov, J.P. , Lander, E.S. et al. (2016) Juicebox provides a visualization system for Hi‐C contact maps with unlimited zoom. Cell Systems, 3, 99–101. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0026] Guan, D. , McCarthy, S.A. , Wood, J. , Howe, K. , Wang, Y. & Durbin, R. (2020) Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics, 36, 2896–2898. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0027] Huang, S. , Kang, M. & Xu, A. (2017) HaploMerger2: Rebuilding both haploid sub‐assemblies from high‐heterozygosity diploid genome assembly. Bioinformatics, 33, 2577–2579. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0028] ICGMC (2015) High‐resolution linkage map and chromosome‐scale genome assembly for cassava (Manihot esculenta crantz) from 10 populations. G3, 5, 133. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0029] Jones, P. , Binns, D. , Chang, H.‐Y. , Fraser, M. , Li, W. , McAnulla, C. et al. (2014) InterProScan 5: genome‐scale protein function classification. Bioinformatics, 30, 1236–1240. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0030] Kim, D. , Paggi, J.M. , Park, C. , Bennett, C. & Salzberg, S.L. (2019) Graph‐based genome alignment and genotyping with HISAT2 and HISAT‐genotype. Nature Biotechnology, 37, 907–915. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0031] Knowles, D.A. , Davis, J.R. , Edgington, H. et al. (2017) Allele‐specific expression reveals interactions between genetic variation and environment. Nature Methods, 14, 699–702. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0032] Koren, S. , Rhie, A. , Walenz, B.P. et al. (2018) De novo assembly of haplotype‐resolved genomes with trio binning. Nature Biotechnology, 36, 1174–1182. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0033] Korf, I. (2004) Gene finding in novel genomes. BMC Bioinformatics, 5, 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0034] Kronenberg, Z.N. , Rhie, A. , Koren, S. et al. (2018) Extended haplotype phasing of de novo genome assemblies with FALCON‐phase. bioRxiv, 1–27. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0035] Kuon, J.‐E. , Qi, W. , Schläpfer, P. et al. (2019) Haplotype‐resolved genomes of geminivirus‐resistant and geminivirus‐susceptible African cassava cultivars. BMC Biology, 17, 75. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0036] Li, H. & Durbin, R. (2009) Fast and accurate short read alignment with Burrows‐Wheeler transform. Bioinformatics, 25, 1754–1760. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0037] Liu, J. , Shi, C. , Shi, C.‐C. et al. (2020) The chromosome‐based rubber tree genome provides new insights into spurge genome evolution and rubber biosynthesis. Molecular Plant, 13, 336–350. [DOI] [PubMed] [Google Scholar]

[tpj15543-bib-0038] Love, M.I. , Huber, W. & Anders, S. (2014) Moderated estimation of fold change and dispersion for RNA‐seq data with DESeq2. Genome Biology, 15, 1–34. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0039] Lyons, J.B. , Bredeson, J.V. , Mansfeld, B.N. , Bauchet, G.J. , Berry, J. , Boyher, A. et al. (2021) Current status and impending progress for cassava structural genomics. Plant Molecular Biology. 10.1007/s11103-020-01104-w [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0040] Mansfeld, B.N. , Colle, M. , Kang, Y. , Jones, A.D. & Grumet, R. (2017) Transcriptomic and metabolomic analyses of cucumber fruit peels reveal a developmental increase in terpenoid glycosides associated with age‐related resistance to Phytophthora capsici . Horticulture Research, 4, 17022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0041] Marçais, G. , Delcher, A.L. , Phillippy, A.M. , Coston, R. , Salzberg, S.L. & Zimin, A. (2018) MUMmer4: A fast and versatile genome alignment system. PLOS Computational Biology, 14(1), e100594. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0042] Marçais, G. & Kingsford, C. (2011) A fast, lock‐free approach for efficient parallel counting of occurrences of k‐mers. Bioinformatics, 27, 764–770. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0043] Michael, T.P. & VanBuren, R. (2020) Building near‐complete plant genomes. Current Opinion in Plant Biology, 54, 26–33. 10.1016/j.pbi.2019.12.009 [DOI] [PubMed] [Google Scholar]

[tpj15543-bib-0044] Miller, J.R. , Delcher, A.L. , Koren, S. , Venter, E. , Walenz, B.P. , Brownley, A. et al. (2008) Aggressive assembly of pyrosequencing reads with mates. Bioinformatics, 24, 2818–2824. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0045] Minio, A. , Massonnet, M. , Figueroa‐Balderas, R. , Castro, A. & Cantu, D. (2019) Diploid genome assembly of the wine grape Carménère. G3, 9, 1331–1337. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0046] Nattestad, M. & Schatz, M.C. (2016) Assemblytics: a web analytics tool for the detection of variants from an assembly. Bioinformatics, 32, 3021–3023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0047] Ou, S. , Chen, J. & Jiang, N. (2018) Assessing genome assembly quality using the LTR Assembly Index (LAI). Nucleic Acids Research, 46, 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0048] Ou, S. , Su, W. , Liao, Y. et al. (2019) Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline. Genome Biology, 20, 1–18. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0049] Padgitt‐Cobb, L.K. , Kingan, S.B. , Wells, J. et al. (2021) A draft phased assembly of the diploid Cascade hop (Humulus lupulus) genome. Plant Genome, 14, 1–22. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0050] Patel, S. , Lu, Z. , Jin, X. , Swaminathan, P. , Zeng, E. & Fennell, A.Y. (2018) Comparison of three assembly strategies for a heterozygous seedless grapevine genome assembly. BMC Genomics, 19, 57. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0051] Pertea, M. , Pertea, G.M. , Antonescu, C.M. , Chang, T.C. , Mendell, J.T. & Salzberg, S.L. (2015) StringTie enables improved reconstruction of a transcriptome from RNA‐seq reads. Nature Biotechnology, 33, 290–295. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0052] Prochnik, S. , Marri, P.R. , Desany, B. et al. (2012) The cassava genome: Current progress, future directions. Tropical Plant Biology, 5, 88–94. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0053] Quinlan, A.R. & Hall, I.M. (2010) BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 26, 841–842. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0054] Rabbi, I. , Hamblin, M. , Gedil, M. , Kulakow, P. , Ferguson, M. , Ikpan, A.S. et al. (2014) Genetic mapping using genotyping‐by‐sequencing in the clonally propagated cassava. Crop Science, 54, 1384–1396. 10.2135/cropsci2013.07.0482 [DOI] [Google Scholar]

[tpj15543-bib-0055] Ramu, P. , Esuma, W. , Kawuki, R. et al. (2017) Cassava haplotype map highlights fixation of deleterious mutations during clonal propagation. Nature Genetics, 49, 959–963. [DOI] [PubMed] [Google Scholar]

[tpj15543-bib-0056] Rao, S. , Huntley, M. , Durand, N. , Stamenova, E. , Bochkov, I. , Robinson, J. et al. (2014) A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell, 159, 1665–1680. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0057] Rhie, A. , Walenz, B.P. , Koren, S. & Phillippy, A.M. (2020) Merqury: reference‐free quality, completeness, and phasing assessment for genome assemblies. Genome Biology, 21, 245. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0058] Roach, M.J. , Schmidt, S.A. & Borneman, A.R. (2018) Purge Haplotigs: Allelic contig reassignment for third‐gen diploid genome assemblies. BMC Bioinformatics, 19, 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0059] Rojas, M.C. , Pérez, J.C. , Ceballos, H. , Baena, D. , Morante, N. & Calle, F. (2009) Analysis of inbreeding depression in eight S₁ cassava families. Crop Science, 49, 543–548. [Google Scholar]

[tpj15543-bib-0060] Shao, L. , Xing, F. , Xu, C. et al. (2019) Patterns of genome‐wide allele‐specific expression in hybrid rice and the implications on the genetic basis of heterosis. Proceedings of the National Academy of Sciences of the United States of America, 116, 5653–5658. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0061] Simão, F.A. , Waterhouse, R.M. , Ioannidis, P. , Kriventseva, E.V. & Zdobnov, E.M. (2015) BUSCO: assessing genome assembly and annotation completeness with single‐copy orthologs. Bioinformatics, 31, 3210–3212. [DOI] [PubMed] [Google Scholar]

[tpj15543-bib-0062] Song, G. , Guo, Z. , Liu, Z. , Cheng, Q. , Qu, X. , Chen, R. et al. (2013) Global RNA sequencing reveals that genotype‐dependent allele‐specific expression contributes to differential expression in rice F1 hybrids. BMC Plant Biology, 13, 221. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0063] Stanke, M. & Morgenstern, B. (2005) AUGUSTUS: a web server for gene prediction in eukaryotes that allows user‐defined constraints. Nucleic Acids Research, 33, 465–467. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0064] Tang, H. , Wang, X. , Bowers, J.E. , Ming, R. , Alam, M. & Paterson, A.H. (2008) Unraveling ancient hexaploidy through multiply‐aligned angiosperm gene maps. Genome Research, 18, 1944–1954. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0065] Trapnell, C. , Williams, B.A. , Pertea, G. , Mortazavi, A. , Kwan, G. , van Baren, M.J. et al. (2010) Transcript assembly and quantification by RNA‐Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology, 28, 511–515. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0066] Vurture, G.W. , Sedlazeck, F.J. , Nattestad, M. , Underwood, C.J. , Fang, H. , Gurtowski, J. et al. (2017) GenomeScope: fast reference‐free genome profiling from short reads. Bioinformatics, 33, 2202–2204. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0067] Walker, B.J. , Abeel, T. , Shea, T. , Priest, M. , Abouelliel, A. , Sakthikumar, S. et al. (2014) Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One, 9, e112963. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0068] Wang, Y. , Tang, H. , Debarry, J.D. et al. (2012) MCScanX: A toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Research, 40, 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0069] Weber, B. , Zicola, J. , Oka, R. & Stam, M. (2016) Plant Enhancers: a call for discovery. Trends in Plant Science, 21, 974–987. [DOI] [PubMed] [Google Scholar]

[tpj15543-bib-0070] Wilson, M.C. , Mutka, A.M. , Hummel, A.W. et al. (2017) Gene expression atlas for the food security crop cassava. New Phytologist, 213, 1632–1641. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0071] Wolff, J. , Rabbani, L. , Gilsbach, R. , Richard, G. , Manke, T. , Backofen, R. et al. (2020) Galaxy HiCExplorer 3: A web server for reproducible Hi‐C, capture Hi‐C and single‐cell Hi‐C data analysis, quality control and visualization. Nucleic Acids Research, 48, W177–W184. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0072] Wood, D.L.A. , Nones, K. , Steptoe, A. , Christ, A. , Harliwong, I. , Newell, F. et al. (2015) Recommendations for accurate resolution of Gene and isoform allele‐specific expression in RNA‐seq data. PLoS One, 10, 1–27. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0073] Zhang, X. , Wu, R. , Wang, Y. , Yu, J. & Tang, H. (2020) Unzipping haplotypes in diploid and polyploid genomes. Computational and Structural Biotechnology Journal, 18, 66–72. [DOI] [PMC free article] [PubMed] [Google Scholar]

[tpj15543-bib-0074] Zhou, Y. , Minio, A. , Massonnet, M. , Solares, E. , Lv, Y. , Beridze, T. et al. (2019) The population genetics of structural variants in grapevine domestication. Nat. Plants, 5, 965–979. [DOI] [PubMed] [Google Scholar]

PERMALINK

Large structural variations in the haplotype‐resolved African cassava genome

Ben N Mansfeld

Adam Boyher

Jeffrey C Berry

Mark Wilson

Shujun Ou

Seth Polydore

Todd P Michael

Noah Fahlgren

Rebecca S Bart

Summary

Significance Statement

INTRODUCTION

RESULTS AND DISCUSSION

Genome size and heterozygosity

Figure 1.

Maximizing the diploid assembly

Table 1.

Haplotypic purging and deduplication

Figure 2.

Figure 3.

Figure 4.

Haplotype phasing and scaffolding with Hi‐C sequencing

Assessing the quality of the final assembly

Transposable elements and gene annotation

Transposable element and repeat annotation

Table 2.

Gene annotation, synteny, and tandem duplication

Figure 5.

Figure 6.

Haplotype‐specific sequence and SV

Comparison with the inbred AM560‐2 reference

Figure 7.

Large haplotypic SVs in TME7

Table 3.

Effects of haplotypic SV on allele‐specific expression

Figure 8.

Tissue‐specific gene expression Cassava Atlas

CONCLUSION

EXPERIMENTAL PROCEDURES

Plant material and nucleic acid extraction

Library preparation and sequencing

Illumina

PacBio

Chromatin Conformation Capture sequencing (Hi‐C)

Genome size and heterozygosity estimation

De novo genome assembly and scaffolding

Maximizing the diploid assembly

Assembly deduplication

Haplotype phasing

Scaffolding

Assembly quality assessment

Linkage map alignment

K‐mer based evaluation

Assessment of genic and repetitive sequence space

Haplotype‐specific annotation

TE annotation and repeat masking

Gene annotation

Gene synteny and tandem duplication analyses

SV and polymorphisms

ASE

GO annotation and enrichment analyses

SHINY app update

Scripts and figures

AUTHOR CONTRIBUTIONS

CONFLICT OF INTERESTS

Supporting information

ACKNOWLEDGEMENTS

DATA AVAILABILITY STATEMENT

REFERENCES

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases