Abstract
In genome sequence analysis, phasing, i.e., determining which genetic variants reside on the same chromosome, is essential for assembling genomic sequences into scaffolds. However, phasing remains challenging for plant species with autopolyploid genomes such as potato (Solanum tuberosum L.) due to the high sequence similarity among homologous and homoeologous chromosomes. Bin marker is a DNA tag for a chromosome segment, helping create genetic maps and find trait-related genes. Traditional bacterial artificial chromosome (BAC)–based methods use end-sequencing of BAC clones to provide information on phasing; however, they need for labor-intensive library construction and selection of single colonies limits these methods. Here, we present BacPhase, an innovative sequence-based approach in which constructed BACs are digested with a restriction enzyme and self-ligated to produce small inserts that can be amplified by PCR and sequenced, thus removing the need for selecting single colonies. The restriction enzyme used affects the evenness and spacing of markers and we evaluated 14 restriction enzymes to select optimal restriction enzymes in multiple crop species. Using PacBio HiFi long-read sequencing to span repetitive regions, we generated 39,484 high-confidence, high-resolution bin markers in potato. Unlike Hi-C, which relies on chromatin interactions, BacPhase uses sequence polymorphisms directly, enabling precise haplotype resolution. Indeed, BacPhase anchored 59.58 % of scaffolds to the chromosomes in the autotetraploid potato cultivar C88, substantially improving contiguity without requiring physical maps or Hi-C data. The BacPhase method could facilitate trait mapping, genomic selection, and accelerated breeding in polyploid crops such as potato, sugarcane (Saccharum officinarum), and alfalfa (Medicago sativa).
Keywords: BacPhase, Polyploid, Potato, Bin marker, Genome phasing
1. Introduction
Polyploidy is widespread in prokaryotic and eukaryotic organisms, and presents unique challenges to phasing, i.e., determining which genetic variants are located on the same chromosome, and therefore to genome assembly. This is particularly important for agriculturally important crops [1,2], as over 70 % are polyploid [3,4], with species such as potato (Solanum tuberosum), wheat (Triticum aestivum), and sugarcane (Saccharum officinarum) having complex polyploid genomes. High sequence similarity among homologous and homoeologous chromosomes in these crops impedes accurate phasing [[5], [6], [7], [8], [9], [10], [11], [12]]. Moreover, phasing for autopolyploid genomes such as potato remains particularly challenging due to extensive shared haplotypes [13]. Current methods employing family-based or population inference often yield incomplete variant phasing and require ancillary data [14,15]. Tools such as Hifiasm have advanced capabilities for diploid and allopolyploid phasing but require extensive long-read sequencing, highlighting the need for innovative approaches to resolve complex genomes [16], [17], [18].
Potato (Solanum tuberosum L.) serves as a prime example of these challenges. The potato genome is complex, autotetraploid, and highly repetitive; indeed, the haploid potato genome contains 62 % repetitive sequences [7], making accurate phasing particularly difficult. Moreover, reference genomes exist for haploid [9] and diploid potato [19], [20], but accurate phasing of the cultivated tetraploid potato genome [5,6,8] remains challenging due to the near-identical sequences shared among homologous and homoeologous chromosomes.
We previously developed an improved BAC-end sequencing strategy for constructing high-resolution physical maps and facilitating genome assembly for plants with complex polyploid genomes [21]. In this study, we introduced two key enhancements to this strategy. First, we employed optimized restriction enzyme selection and identified BstBI as an excellent restriction enzyme for increasing marker density and ensuring an even genomic distribution of markers. Second, we integrated PacBio HiFi long-read sequencing to resolve repetitive regions with greater accuracy. Unlike Hi-C, which may generate ambiguous or spurious interaction signals, our method provides sequence-based, high-confidence anchoring information without requiring the selection of single colonies, thereby improving efficiency and reliability. Although BAC library construction is less commonly used today, our streamlined approach is compatible with high-throughput sequencing workflows and has the potential to replace Hi-C for certain applications. This method, known as BacPhase, enables precise haplotype separation, improves bin marker density, and enhances chromosome-scale genome assembly, especially for autopolyploid crops for which traditional methods fall short. BacPhase serves as a valuable tool for genotyping and physical mapping of complex polyploid genomes.
2. Results
2.1. Consistent restriction enzyme cutting across diverse polyploid plant genomes
The effectiveness of BAC end sequencing largely depends on the availability of numerous, evenly distributed restriction sites throughout the genome. Therefore, the selection of an appropriate restriction enzyme is a key factor in achieving this goal. A good restriction enzyme must have high efficiency, specificity, and the ability to cut at numerous, evenly distributed restriction sites. Based on these criteria, we chose 14 enzymes compatible with the CopyControl pCC1 BAC-cloning vector and examined the lengths of their digestion products through electronic enzyme digestion in specific genomes. To evaluate whether these enzymes are broadly applicable to polyploid crops in addition to potato, we selected several polyploid crops, such as hexaploid sweet potato (Ipomoea batatas) [22], hexaploid wheat (Triticum aestivum) [23], triploid banana (Musa acuminata) [11], and tetraploid alfalfa (Medicago sativa) [24].
We detected consistent fragment length distributions for BstBI and ClaI through electronic digestion, as shown by both median and mean fragment lengths (Fig. 1A and B, Table S1). This suggests that BstBI and ClaI exhibit comparable cleavage efficiencies in these diverse genomes, with no significant outliers of specific restriction enzyme site distributions in any particular species. Similar trends were observed for other commonly used enzymes such as NsiI, BfrBI, AvrII, Bsu36I, PmlI, and NheI, which also displayed similar fragment length distributions across the tested genomes, reinforcing their reliability for cross-species genomic studies. For RsrII, PacI, and AatII, we detected slight variations in the median and mean fragment lengths among species, possibly due to differences in enzyme site distributions, yet the overall cleavage patterns remained comparable. MluI, PmeI, and BbvCI generated relatively long restriction fragments among various species, but the differences were not substantial enough to suggest inefficiency. These findings indicate that multiple restriction enzymes, including BstBI, ClaI, and AvrII, function effectively across diverse plant genomes, supporting their broad applicability in genomic research. Although species-specific optimization may still be beneficial in certain cases, these data suggest that these enzymes are generally robust for comparative studies. Further validation using additional species could help refine their use for specialized applications.
Fig. 1.
DNA fragment lengths generated by electronic enzymatic digestion. A Median fragment length produced by restriction enzyme digestion. B Mean fragment lengths produced by restriction enzyme digestion. Note: colors in panels A and B represent the following: yellow-green for banana, coral for C88 (potato), royal blue for DM (potato), gray for Medicago, dark blue for sweet potato, dark sea green for wheat.
For the target species (potato, here we used a commercial potato cultivar Cooperation-88 (C88) which is autotetraploid, high-yielding, multi-purpose, and resistant to late blight), we further evaluated these enzymes by analyzing fragment size distributions (maximum, minimum, and standard deviation; Table S2). In regard to fragment quantity, we prioritized enzymes that produced fragments with a more uniform size distribution. We established the following thresholds: mean or median fragment size between 3 and 5 kb was considered optimal for BAC insert fragments, and fragments exceeding 10 kb in length were excluded, and the size was decided by the length of sequencing. Excessively small fragments that would lead to inefficiencies in PacBio sequencing were discarded.
We selected BstBI for further analysis of the genome of the potato cultivar C88 and compared it to two other enzymes: ClaI and MulI. We detected a total of 168,248 cleavage sites, which is >21 % more than the number of sited detected for ClaI (138,535) and approximately 11 times as many as detected for MulI (15,390) (Fig. S1)[21]. The mean interval between adjacent enzyme sites was 4,591 bp, which is significantly shorter than those for ClaI (5,849 bp) and MulI (15,390 bp), and intervals ranged from a minimum of 6 bp to a maximum of 186,808 bp, both of which are smaller than observed with ClaI (236,389 bp) and MulI (1,254,638 bp). The median interval for BstBI was 2,857 bp, which is considerably shorter than those for ClaI (3,178 bp) and MulI (29,680 bp; Fig. S1). Compared to ClaI and MulI, BstBI demonstrated a more even distribution of cleavage sites, making it an ideal candidate complementary enzyme for BAC end sequencing [21]. Therefore, we chose BstBI and ClaI for BAC-end HiFi sequencing.
2.2. Workflow of BacPhase library construction and analysis
To obtain BAC end sequences, we extracted genomic DNA from potato cultivar Cooperation-88 (C88), partial digested it with HindIII, and ligated it to the CopyControl pCC1 BAC-cloning vector to create recombinant vectors containing the inserted sequences (Fig. 2A). We then digested the construct with BstBI/ClaI and separated the restriction products using electrophoresis to isolate fragments larger than 8 kb. These fragments were allowed to autoligate and then amplified by PCR to obtain the desired BAC fragments. Subsequently, we sequenced the desired BAC fragments on the PacBio platform (Fig. 2B). We split single reads into “paired-end” reads based on the enzyme cleavage site and mapped them to the C88 autotetraploid potato genome, a chromosome-level phased genome but with 1,368 scaffolds [8]. We then identified target reads based on the alignment results. The mapped pair reads were classified into three different types: in the same haplotype, on the same chromosome, and on different chromosomes. Based on the average insert fragment size of approximately 90 kb for the results for the initial BAC library [21], we used reads that mapped to the same haplotype that fit our expectations to calculate their distance (Fig. 2C).
Fig. 2.
The workflow of BacPhase. A Construction of a BAC end library and generation of material for sequencing. B HiFi sequencing, sequence mapping onto the C88 genome, and four mapping types. C Defining possible gaps using reads mapped to the same haplotype.
Following separate single digestions with either BstBI or ClaI, the reads contained different landmark sequences, such as forward sequencing primer, reverse sequencing primer, HindIII primers, and other restriction enzyme sites. We sorted the reads into four types based on different landmarks: reads containing the restriction sites and the forward primer for HindIII (FH), reads containing the restriction sites and the reverse primer for HindIII (HR), reads containing the restriction sites, FH, and HR (FHR), and reads only containing the restriction sites (RS). For each type except FHR, the location of the restriction sites was classified into inner reads or end reads (Table S3). For FH, BstBI and ClaI recognition sites were primarily located in inner regions, with counts of 29,462 and 15,473, respectively, and fewer were found at the ends (37 and 6, respectively). Similarly, for HR, BstBI and ClaI recognition sites were primarily located in inner regions, with counts of 26,033 (BstBI) and 12,534 (ClaI), whereas end region counts were only 44 and 7, respectively, because all restriction sites within FHRs are all “inner” due to the presence of primers at both ends. FHR exhibited lower inner region counts of 6,494 (BstBI) and 7,391 (ClaI), which merely reflects the proportion of FHR among all reads. Additionally, for RS, inner regions had significantly higher counts (32,939 for BstBI and 24,928 for ClaI) compared to ends (59 and 23, respectively) (Table S3). We also added a control, which was directly ligated after enzymatic cleavage without the need for ligation to a vector. Therefore, the control sequences were genomic sequences, and even if they contained cleavage sites, there were no gaps between cleavage sites. The control sequences were used to filter out genomic DNA. These results highlight the distribution bias of restriction sites.
2.3. More, evenly distributed restriction sites are key to the effectiveness of BacPhase
To validate the in silico results using BstBI and ClaI in the C88 genome, we performed BAC end sequencing using PacBio HiFi, yielding 280,442 reads for the control, 197,442 reads for BstBI, and 240,684 reads for ClaI, with total read lengths of 735 Mb, 353 Mb, and 425 Mb, respectively. The average read lengths were 2,623 bp for the control, 1,788 bp for BstBI, and 1,763 bp for ClaI (Table S4).
We then selected sequences that contained restriction sites. Following digestion with ClaI, we determined that only 60,362 reads contained ClaI cleavage sites, accounting for 25.08 % of the raw reads (Table S5). The newly introduced BstBI enzyme performed much better, as 95,068 reads, comprising 48.15 % of the raw reads, contained BstBI cleavage sites, i.e., nearly double the number of ClaI cleavage sites. To account for redundant sequences introduced by PCR before sequencing, we performed self-redundancy filtering based on sequences that contain restriction enzyme cleavage sites. This resulted in 64,779 reads for BstBI and 45,123 reads for ClaI, with 68.15 % and 74.75 % of the reads containing the cleavage sites, respectively (Table S5). After removing the control sequences (those directly derived from the genome rather than from BAC ends), we ultimately obtained 64,144 reads for BstBI and 44,146 reads for ClaI (Table S5). We used these final sequences for subsequent analysis. These results indicate that BstBI generated more useable sequences than ClaI.
Finally, we assessed the completeness of the restriction enzyme digestion during BAC end library construction. BstBI performed better than ClaI, with a higher proportion of reads containing only one cleavage site, i.e., 86.30 % of reads for BstBI compared to 72.59 % for ClaI (Table S6). The proportion of reads with two cleavage sites was 11.63 % for BstBI compared to 21.12 % for ClaI. Furthermore, the maximum number of cleavage sites detected in BstBI reads was six, while ClaI reads contained up to 12 cleavage sites (Table S6).
The selection of BstBI as a restriction enzyme significantly improved the efficiency of BAC end sequencing in terms of both the number of useable reads and the even distribution of cleavage sites. The new enzyme combination (BstBI and ClaI) offers greater advantages over the original combination (ClaI and MluI) in terms of uniformity, the number of cleavage sites, and so on. These findings provide a solid foundation for the BAC end sequencing of complex genomes, such as that of potato, and hold great promise for future applications in marker development and genome assembly of polyploid crops.
2.4. Mapping results reveal the effectiveness of BacPhase for complex genome assembly
BacPhase can effectively support the assembly of complex genomes, as evidenced by the mapping of final sequences derived from BstBI and ClaI digestion sites to the C88 genome (http://spuddb.uga.edu/c88_potato_download.shtml). We split the final sequences obtained by digestion using BstBI and ClaI according to the cleavage site to the paired-end and mapping to the C88 genome. For BstBI, the majority of sequences with one recognition site were grouped into the “multiple map” category (56.02 %, 62,016), followed by the “unique map” category (36.10 %, 39,970), and the fewest were grouped into the “unmap” category (7.88 %) (Table 1). The same trend was observed for sequences with more than one recognition site: the greatest proportion were grouped into the “multiple” map category (54.00 %, 9,493), followed by the “unique map” category (31.43 %, 5,525) and the “unmapped” category (14.57 %) (Table 1). For ClaI, the distribution followed a similar trend. The majority of sequences with one recognition site (58.35 %, 37,395) were grouped in the “multiple map” category, as were sequences with more than one recognition site (58.09 %, 14,059), followed by the “unique map” category (32.32 %, 20,712 sequences with one recognition site and 28.58 %, 6,916 sequences with more than one recognition site). The “unmapped” category had the fewest sequences in both cases, with 9.34 % for sequences with one recognition site and 13.33 % for sequences with more than one recognition site (Table 1). Overall, the “unmapped” category contained the fewest sequences for both enzymes, highlighting the effectiveness of our BAC end sequencing. These mapping distributions closely align with the unique and repetitive content of the potato genome [7], confirming that the BAC end sequences were appropriately targeted and successfully mapped. Additionally, the “unmapped” category consistently contained a lower percentage of sequences with both one and multiple enzyme sites, indicating that our BAC end sequencing was effective.
Table 1.
Statistics of mapping to C88.
| Map type |
BstBI |
ClaI |
||||||
|---|---|---|---|---|---|---|---|---|
| No. of enzyme site = 1 | Percentage (%) | No. of enzyme site>1 | Percentage (%) | No. of enzyme site = 1 | Percentage (%) | No. of enzyme site>1 | Percentage (%) | |
| Unique map | 39,970 | 36.10 | 5,525 | 31.43 | 20,712 | 32.32 | 6,916 | 28.58 |
| Multiple map | 62,016 | 56.02 | 9,493 | 54.00 | 37,395 | 58.35 | 14,059 | 58.09 |
| Unmap | 8,722 | 7.88 | 2,562 | 14.57 | 5,983 | 9.34 | 3,227 | 13.33 |
The mapping results indicate that BacPhase, when combined with the results of single BstBI and ClaI enzyme digestion, offers an effective approach for the assembly of complex genomes, producing high-quality, well-distributed BAC end sequences that align with the inherent complexity of polyploid genomes.
2.5. BacPhase is useful for genotyping complex genomes
To gain further insight into the mapping results, we analyzed the mapping information based on chromosome and haplotype positions. Because cultivated potato is autotetraploid, here, the term “same haplotype” refers to the same haplotype on the same chromosome, while the term “same chromosome” refers to different haplotypes on the same chromosome. Considering the unique and repetitive sequences in the potato genome, we categorized the types of sequence mapping into three categories: (1) unique mapping (both ends map uniquely), (2) one end unique and the other end with multiple mapping, and (3) multiple mapping (both ends with multiple mapping). For each category, we examined the chromosome and haplotype positions to assess their distribution and potential biases (Table 2).
Table 2.
Mapping information of chromosome and haplotype.
| Map type |
BstBI |
ClaI |
Total | |||
|---|---|---|---|---|---|---|
| No. of enzyme site = 1 | No. of enzyme site>1 | No. of enzyme site = 1 | No. of enzyme site>1 | |||
| Unique map | same hap | 8,234 | 622 | 1,629 | 330 | 10,815 |
| same chr | 232 | 42 | 203 | 49 | 526 | |
| diff chr | 2,430 | 541 | 2,292 | 777 | 6,040 | |
| One unique & one multiple map | same hap | 6,221 | 468 | 1,605 | 396 | 8,690 |
| same chr | 530 | 129 | 555 | 220 | 1,434 | |
| diff chr | 8,034 | 1,779 | 8,270 | 3,075 | 21,158 | |
| Multiple map | same hap | 4,671 | 546 | 1,307 | 384 | 6,908 |
In the unique mapping category, the greatest number of sequences—10,815 total—mapped to the same haplotype, which aligns with our expectations, although this number only occupies 5.48 % of the raw sequence dataset. A significant portion of sequences mapped to different chromosomes (6,040 sequences), possibly due to the genome's complexity and the challenges inherent in genome assembly. By contrast, relatively few sequences mapped to the same chromosome, with only 526 occurrences across both enzymes (Table 2). These findings reflect the inherent complexity of the potato genome and suggest that further refinements in the assembly might be necessary. In the one end unique and the other end multiple mapping category, the number of same-haplotype mappings decreased slightly, to 8,690. We theorized that this reduction was attributable to a decrease in the occurrence of regions where one end of the fragment mapped to a unique sequence and the other end mapped to a repetitive sequence. This inference was validated by the results of k-mer analysis of putative repetitive sequences at read ends (Fig. S2). The plot shows distinct frequency patterns between unique sequences (lighter colors) and repetitive sequences (darker colors). In the same category, multiple read ends always had a higher frequency than unique read ends, suggesting that multiple reads were repetitive sequences. Numerous sequences (21,158) mapped to different chromosomes, suggesting a higher frequency of repetitive sequences, which could explain the increased mapping on different chromosomes. In the multiple mapping category, we focused on sequences that mapped to the same haplotype, yielding a total of 6,908 sequences (Table 2). This further underscores the power of using both BstBI and ClaI to provide more accurate and consistent mapping in complex genome regions.
Finally, we used the reads that mapped to the same haplotype for the next round of gap distance analysis (Table 2). This step further validated the utility of BacPhase for genotyping complex genomes and highlighted the significant advantages of using both BstBI and ClaI for more accurate and comprehensive genome assembly in polyploid crops such as potato.
Because the orientation of the BAC-end sequencing did not change, we only evaluated the insert sizes of paired BAC-end reads. To assess internal gaps in the same haplotypes for BstBI and ClaI sequences, we analyzed gaps according to the three mapping categories. Since the fragments that were ligated to the vector were primarily selected within the 60–100 kb range, we focused on this gap region for further investigation (Fig. 3, Table S7).
Fig. 3.
Gap distance of BAC end sequences within the same haplotype generated using the restriction enzymes BstBI (A) and ClaI (B). Enlarged views of the 60–100 kb regions for BstBI (C) and ClaI (D). x-axis indicates the gap distance in reads generated by BAC-end sequencing using BstBI; y-axis indicates the number of reads mapped by BAC-end sequencing using BstBI. Each color represents different marker sources, as indicated in the figure.
For both enzymes, the majority of internal gaps fell within the 60–100 kb range. Specifically, BstBI exhibited 95.21 % of gaps in this range for the unique mapping category, 93.90 % for the one end unique and the other end multiple mapping category, and 78.23 % for the both ends multiple mapping category, when considering only one enzyme site. For sequences with more than one enzyme site, the proportion of 60–100 kb gaps decreased slightly but remained above 80 % in the one end unique and the other end multiple mapping category (Fig. 3A,Table S7). Overall, the proportion of 60–100 kb gaps for BstBI was quite high and met our expectations, indicating that BstBI effectively captured these internal gap regions.
For ClaI, 49.39 % of gaps in the 60–100 kb range were in the unique mapping category, while 46.97 % were in the one end unique and the other end multiple mapping category, and 31.75 % were in the both ends multiple mapping category. The proportion of gaps in the 60–100 kb range was consistently lower for sequences with more than one enzyme site than for sequences with a single enzyme site (Fig. 3B,Table S7). Both enzymes showed notable peaks in the 76–80 kb gap range (Fig. 3C and D, Table S7). The frequency of gaps in the 0–10 kb range was higher for ClaI than for BstBI. Perhaps non-specific cleavage during digestion can generate unintended short fragments, which could form 0–10 kb gaps in subsequent ligation steps. The sequences that mapped within the same haplotype, along with their internal gaps in the 60–100 kb range, could serve as valuable bin markers for genome analysis. We refer to these gap regions as bin markers hereafter; these markers are critical for further genome assembly and marker development in polyploid crops such as potato.
BacPhase demonstrated superior performance in genotyping complex genomes, as evidenced by the high proportion of uniquely mapped sequences, particularly in regions with 60–90 kb gaps. This consistent mapping pattern indicates a high degree of accuracy in the original genome assembly, validating the integrity of the sequencing data. Any discrepancies observed in the mapping results warrant further validation. However, we are confident in the correctness of our approach based on the overall consistency and reliability of the data.
2.6. Construction of a physical map by BacPhase using high-density bin markers
To examine the distribution of each type of marker across the chromosomes and haplotypes of potato, we visualized the data as haplotype distribution maps (Fig. 4). For BstBI, in the one enzyme site category, the majority of unique mapping markers (indicated by dark blue) were distributed across the chromosomes with larger numbers of markers located on specific haplotypes, such as chr1_1, chr4_1, and chr8_1. These unique markers accounted for the largest proportion, followed by one end unique and the other end multiple mapping markers (indicated by aqua), comprising nearly half of the markers. The smallest proportion of markers was found in the multiple mapping category (indicated by light blue), which were more prominent on chromosomes with fewer markers or smaller haplotypes, such as chr5_3, chr5_4, chr10_2, and chr12_1 (Fig. 4, Table S8). This pattern reflects the inherent complexity of polyploid genomes, where the presence of diplotigs, triplotigs, and tetralotigs in chromosomes contributes to variations in marker distribution [8]. If a chromosome contains fewer unique markers, the total number of markers on that haplotype will also be smaller, demonstrating that the number of unique markers in the one enzyme site category is dependent on the overall marker count on the haplotype.
Fig. 4.
Number of markers in 60–100 kb gaps. A The number of markers within 60–100 kb gaps for each mapping type on different haplotypes derived from BAC end sequences produced by the enzyme BstBI. B The number of markers within 60–100 kb gaps for each type on different haplotypes produced by the enzyme ClaI. Each color represents a different marker source, as indicated in the figure.
For ClaI, a similar distribution pattern was observed, with unique mapping markers predominantly found across various chromosomes. Chromosomes such as chr1_1, chr1_3, and chr4_2 exhibited high counts of unique markers. The one end unique and the other end multiple and multiple mapping categories contained fewer markers, although some chromosomes (such as chr7_2 and chr10_1) showed higher counts for these categories (Fig. 4, Table S8).
Overall, the total marker count was substantial, with approximately one marker per 81 kb, highlighting the effectiveness of our sequencing approach. BstBI produced 17,692 pairs of markers across all chromosomes, while ClaI produced 2,050 pairs of markers, providing valuable information with BstBI in both marker number and position (Fig. S3A and S3B). Moreover, the markers produced from sequences with more than one ClaI cleavage site complement markers produced from sequences with one BstBI cleavage site, highlighting the complementary nature of the two enzymes in generating a comprehensive and robust set of markers for complex genome analysis.
BAC end sequencing is expected to be used for genome assembly and bin marker development, and therefore, an even mapping distribution of markers on different chromosomes and haplotypes is required. To examine mapping distribution, we constructed a density marker map across chromosomes and haplotypes. The distribution of the markers was generally uniform, covering every haplotype on each chromosome (Fig. 4, Fig. 5, Table S8). A comparison with the C88 genome sequence revealed that low-density or blank regions correspond to areas populated by diplotigs, triplotigs, and tetraplotigs, confirming the accuracy of our mapping approach. Notably, the number of markers across the four haplotypes varied (Fig. 4, Fig. 5, Table S8). Despite this, the overall distribution remained even across the chromosomes.
Fig. 5.
Density distribution of markers. A The number of bin markers within 1-Mb windows across the C88 genome, represented by different colors to indicate varying densities. B Magnified views of the locations of markers obtained by BstBI and ClaI digestion in three randomly selected regions, chr1_1, chr4_1, and chr5_1, highlighted by red circles for fragments generated by the enzyme BstBI and blue triangles for fragments generated by the enzyme ClaI.
To further assess the relationship between the number and positions of two types of source markers from BstBI and ClaI digestion, we randomly selected three 5-Mb regions, Chr1_1, Chr4_1, and Chr5_1, and generated a genomic visualization plot (Fig. 5B). Each panel shows marker positions along the genomic coordinates, with varying densities observed in different regions. Notably, distinct patterns of marker distribution emerged, with red and blue markers frequently clustering together in some sections while remaining sparsely distributed in others. Overall, the two source markers were complementary in terms of both number and position (Fig. S3A and S3B).
In terms of enzyme performance, ClaI complemented BstBI for both single restriction sites and multiple restriction sites (Table S8). When we examined the number of restriction enzyme sites, one enzyme site consistently performed better than multiple enzyme sites.
2.7. BacPhase anchored the unmounted scaffolds of C88 onto the chromosomes
BacPhase was effectively used to anchor the unmounted scaffolds of the C88 genome onto the corresponding chromosomes. Several sequences could be mapped to scaffolds through either unique mapping or multiple mapping. In the case of unique mapping, paired-end sequences were mapped to both the chromosomes and scaffolds, facilitating the anchoring of these scaffolds to specific chromosomes. A total of three paired-end sequences for BstBI and two for ClaI were mapped to the scaffolds (Table 3, Table S9). For multiple mapping, the presence of four haplotypes on each chromosome and scaffold, which had already been phased, allowed us to resolve scaffold positions more precisely. For instance, when a sequence was mapped to chr9_1, chr9_2, chr9_4, and scaffold2099_3, all with 996 matches, we could confidently infer that scaffold2099_3 was localized to chr9_3 and deduced its approximate position. Leveraging this approach, we successfully anchored 272–662 scaffolds for BstBI and 389–610 scaffolds for ClaI to chromosomes based on different data (Table 3, Table S10). Moreover, for paired-end sequences with multiple mapping, 55–242 scaffolds were anchored to the C88 genome based on different mapping types (Table 3, Table S11). After merging and removing redundant reads, we anchored 815 scaffolds to chromosomes (Table S12), accounting for 59.58 % of the 1,368 scaffolds. This approach provides a robust framework for future studies of plants with complex polyploid genomes.
Table 3.
No. of scaffold anchored to the chromosome.
| Map info | Map type |
BstBI |
ClaI |
||
|---|---|---|---|---|---|
| 1 | >1 | 1 | >1 | ||
| Unique map | pair end | 3 | 0 | 2 | 0 |
| Multiple map | single end | 662 | 272 | 610 | 389 |
| Multiple map | pair end | 242 | 71 | 188 | 55 |
3. Discussion
In this study, we constructed a BAC library of the complex, autotetraploid potato genome but omitted the requirement to pick single colonies, simplifying the BAC end workflow [21]. Compared to previous workflows, this process, known as BacPhase, is more straightforward, although it still involves constructing a BAC library. In the future, we aim to further optimize the workflow, with specific improvements as follows: 1. Eliminating BAC library construction: It may become possible to bypass the BAC library construction process entirely. 2. Using restriction enzymes with shorter recognition sequences, such as Sau3A, with a 4-bp recognition sequence. The choice of restriction enzymes could be broader and not limited by the absence of restriction sites in the vector. We expect these efforts to make the process simpler and more efficient. Also, there are two ways to increase the number of effective sequencing reads: 1. Library capacity: Library capacity could be increased by increasing the number of sequencing reads or by performing multiple rounds of ligation. 2. Enzyme selection: Additional suitable enzymes could be identified and the distribution of their recognition sites across the genome could be examined. Alternatively, a mixture of two or more enzymes could be employed for sequence digestion.
There are two major aspects concerning the application scope of BAC end sequencing: bin markers and genome assembly. Bin markers are used to construct an electronic physical map. We proposed a novel method for efficiently constructing physical maps. In this study, we introduce a superior enzyme, BstBI, which is nine times more efficient than the previously used enzyme ClaI (Fig. 4, Table S8). The differential marker densities of BstBI and ClaI arise solely from the distinct distributions of their recognition sequences in the genome (genomic sequence bias), as both enzymes achieved complete digestion. This demonstrates how sequence-specific binding patterns naturally create variation in marker coverage. Additionally, we anchored some scaffolds onto chromosomes. When developing markers, if a marker is located on one of these anchored scaffolds, its approximate position can be determined. When BacPhase is used for marker development, low-depth sequencing is recommended, as high-density, evenly distributed markers can be obtained at low cost. When BacPhase is used for genome assembly or T2T(telomere-to-telomere) genome construction, higher BAC end content and a higher sequencing depth are recommended. BAC end sequences can also be used for genome assembly and to determine the locations of anchored sequences, such as anchoring scaffolds to chromosomes. For example, in the C88 assembly, 1368 scaffolds were not assigned to chromosomes. In this study, 59.58 % of these scaffolds could be positioned onto chromosomes.
Hi-C generates signals based on spatial interactions within the genome, allowing sequences from the same chromosome or haplotype to exhibit significant interaction signals. However, undesirable signals for scaffolding purposes may occasionally occur [25], [26]. By contrast, BAC-end sequences can connect only sequences from the same haplotype, making their haplotype information highly reliable and a potential substitute for Hi-C anchoring signals.
Most of our BAC end sequences ranged from 60 to 100 kb long, indicating that our technique could serve as a good alternative to ultra-long sequencing using the Nanopore platform due to the similar read length, even though our method only sequences the two ends of an insert fragment. However, the accuracy of Nanopore sequencing is much lower than that of PacBio HiFi, making it less appropriate for plants with complex polyploid genomes such as potato, whose genome contains numerous homologous regions, including diplotigs, triplotigs, and tetraplotigs. These complex regions make genome assembly rather difficult. To generate the C88 genome assembly, 1,034 progeny were used for haplotype phasing, a highly laborious, expensive process [8]. For the tetraploid commercial potato variety Otava, 717 pollen cells were used for haplotype phasing [27], which is also quite laborious.
BAC-end sequencing using PacBio HiFi achieves exceptionally high accuracy, generally around 99.999 %, which surpasses Nanopore's accuracy of approximately 95.64–99.75 % for DNA and RNA (https://nanoporetech.com/platform/accuracy). Nanopore sequencing errors, in addition to indels and mismatches, are primarily caused by the presence of homopolymer regions and tandem repeats, with homopolymer deletions being particularly frequent [28]. Such errors are often localized to specific genomic sequences or regions, making them difficult to correct even with internal error correction or second-generation sequencing data. This leads to challenges in distinguishing between sequencing errors and true variants, ultimately affecting the accuracy of genome assembly [29], [30]. Furthermore, inverted repeat sequences in the genome can degrade the quality of Nanopore sequencing, thereby affecting the accuracy of the resulting sequence [31]. Consequently, in genomes with high repeat content or polyploid species, the accuracy in repetitive regions may not be high. However, using HiFi sequencing based on BAC-end sequences with a combination of different enzymes and library capacities enables these complex genomic structures to be accurately resolved.
The key advantage of Nanopore sequencing lies in its ultra-long read length, which can range from 20 bp to 4 Mb (https://nanoporetech.com). This makes Nanopore sequencing suitable for assembling large genomes, even at the cost of lower accuracy. However, as the demand for genome assemblies increases, including the need for both improved assembly length and quality, Nanopore sequencing struggles to meet the requirements for complex genomic regions. By contrast, PacBio HiFi reads, while shorter (typically 15–30 kb), offer much higher accuracy. If future improvements in BacPhase workflows can extend read lengths to more than 300 kb, these longer, highly accurate sequences would outperform Nanopore's ultra-long reads in terms of both length and accuracy. BAC-end sequencing, combining long-read capability with high precision, could address both the length and quality challenges for the high-quality assembly of complex genomes and for variant analysis.
4. Conclusion
Our novel strategy, known as BacPhase, which leverages PacBio HiFi sequencing, overcomes the limitations of traditional methods for constructing high-resolution physical maps and performing genome assembly for plants with complex genomes by eliminating the need for library construction followed by single clone selection. BacPhase significantly enhances genome phasing accuracy and marker generation, as demonstrated by our identification of 39,484 bin markers and anchoring of 59.58 % of scaffolds in the complex potato cultivar C88 genome. This approach provides a robust platform for constructing bin markers and assembling complex polyploid genomes and supports advanced breeding applications for polyploid crops.
5. Materials and methods
5.1. Electronic enzyme digestion
We chose 14 enzymes compatible with the CopyControl pCC1 BAC-cloning vector and selected several polyploid crops, such as hexaploid sweet potato (Ipomoea batatas) [22], hexaploid wheat (Triticum aestivum) [23], triploid banana (Musa acuminata) [11], and tetraploid alfalfa (Medicago sativa)[24] to conduct electronic enzyme digestion. Fragment sizes were calculated using an R script, and also the figures were also generated in R (https://www.r-project.org/).
5.2. Plant material used for BAC end library construction
The commercial potato cultivar Cooperation-88 (C88) was used to construct a BAC end library. This cultivar is autotetraploid, high-yielding, multi-purpose, and resistant to late blight [32].
5.3. BAC end library construction
The BAC end library for C88 tetraploid potato used in this study was constructed using the CopyControl pCC1 BAC-cloning vector as previously described [18]. High-quality DNA was extracted from two-week-old seedlings and digested by the restriction enzyme HindIII to generate 100–300 kb fragments. The fragments were ligated to the pCC1 BAC-cloning vector and digested with BstBI/ClaI (New England Biolabs, Ipswich, MA), producing two types of sequences: BAC end sequences with the pCC1 BAC-cloning vector and insertion fragments without the pCC1 BAC-cloning vector. After autoligation, the digested sequence mix was amplified by PCR and subjected to sequencing. DNA not ligated to the pCC1 BAC-cloning vector was used as a control and directly sequenced.
5.4. PacBio HiFi sequencing and BacPhase analysis
Sequencing was conducted using PacBio HiFi CCS. The C88 genome was used as a reference to analyze the sequencing reads [8]. Seqtk (GitHub - lh3/seqtk: Toolkit for processing sequences in FASTA/Q formats) was used to analyze the raw reads. Sequences with BstBI/ClaI restriction enzyme sites were extracted, and redundant reads produced by PCR were removed using cd-hit with default parameters [33]. Next, control sequences from genomic DNA rather than BAC end reads were removed using cd-hit with default parameters [33].
The final sequences were split according to their restriction enzyme sites to paired-end reads and mapping to the C88 genome using Winnowmap with default parameters (Fig. 2) [34]. For reads with multiple cleavage sites, alignment was conducted using the front end of the first cleavage site and the rear end of the last cleavage site. Mapping types were classified into four categories: (1) unique mapping (both ends map uniquely), (2) one end unique and the other end multiple mapping, (3) multiple mapping (both ends multiply mapped), and (4) unmapped. In each category, reads were categorized as having one restriction enzyme site or more than one restriction enzyme site. The mapping results were subjected to statistical analysis for different parameters, including mapping ratio, mapping position, gap distance, and chromosome distribution. The Python script used for statistical analysis is provided at GitHub (https://github.com/Jianyq-1/BacPhase). The figures were constructed using an R script (https://www.r-project.org/).
CRediT authorship contribution statement
Yinqiao Jian: Writing – review & editing, Writing – original draft, Visualization, Validation, Software, Methodology, Funding acquisition, Formal analysis, Data curation, Conceptualization. Xiao Guo: Validation, Methodology. Yangyang Shang: Validation, Methodology. Yu Yang: Validation, Methodology. Daofeng Dong: Funding acquisition. Xiaohui Yang: Writing – review & editing, Supervision, Methodology, Funding acquisition, Conceptualization. Guangcun Li: Writing – review & editing, Supervision, Resources, Project administration, Methodology, Funding acquisition, Conceptualization.
Declaration of competing interest
The authors declare no competing interests.
Acknowledgments
This work was supported by the National Natural Science Fund of China (W2412002 and 32472189), the China Agricultural Research System (CARS-9-ES12), the Breeding Program of Shandong Province, China (2022LZGC017), Agricultural Science and Technology Innovation Program (CAAS-ZDRW202404), Hebei 14th Five-year Breeding Innovation Team (21326320D).
Footnotes
Supplementary data to this article can be found online at https://doi.org/10.1016/j.abiote.2025.100012.
Contributor Information
Xiaohui Yang, Email: xiaohuiy_0601@163.com.
Guangcun Li, Email: liguangcun@caas.cn.
Appendix A. Supplementary data
The following are the Supplementary data to this article:
Data availability
Raw sequencing data of the BAC-end produced by BstBI and ClaI, as well as control is available in the National Genomics Data Center (NGDC, https://bigd.big.ac.cn) under project PRJCA033658. A data file including the details of sequencing information and sequenced BioSamples has been deposited in CRA021329.
References
- 1.Otto S.P. The evolutionary consequences of polyploidy. Cell. 2007;131(3):452–462. doi: 10.1016/j.cell.2007.10.022. [DOI] [PubMed] [Google Scholar]
- 2.Comai L. The advantages and disadvantages of being polyploid. Nat Rev Genet. 2005;6(11):836–846. doi: 10.1038/nrg1711. [DOI] [PubMed] [Google Scholar]
- 3.Salman-Minkov A., Sabath N., Mayrose I. Whole-genome duplication as a key factor in crop domestication. Nat Plants. 2016;2(8) [Google Scholar]
- 4.Feldman M., Levy A.A., Fahima T., Korol A. Genomic asymmetry in allopolyploid plants: wheat as a model. J Exp Bot. 2012;63(14):5045–5059. doi: 10.1093/jxb/ers192. [DOI] [PubMed] [Google Scholar]
- 5.Hoopes G., Meng X., Hamilton J.P., Achakkagari S.R., de Alves Freitas Guesdes F., Bolger M.E., et al. Phased, chromosome-scale genome assemblies of tetraploid potato reveal a complex genome, transcriptome, and predicted proteome landscape underpinning genetic diversity. Mol Plant. 2022;15(3):520–536. doi: 10.1016/j.molp.2022.01.003. [DOI] [PubMed] [Google Scholar]
- 6.Serra Mari R., Schrinner S., Finkers R., Ziegler F.M.R., Arens P., Schmidt M.H.W., et al. Haplotype-resolved assembly of a tetraploid potato genome using long reads and low-depth offspring data. Genome Biol. 2024;25(1) [Google Scholar]
- 7.The Potato Genome Sequencing Consortium Genome sequence and analysis of the tuber crop potato. Nat. 2011;475(7355):189–195. [Google Scholar]
- 8.Bao Z., Li C., Li G., Wang P., Peng Z., Cheng L., et al. Genome architecture and tetrasomic inheritance of autotetraploid potato. Mol Plant. 2022;15(7):1211–1226. doi: 10.1016/j.molp.2022.06.009. [DOI] [PubMed] [Google Scholar]
- 9.Yang X., Zhang L., Guo X., Xu J., Zhang K., Yang Y., et al. The gap-free potato genome assembly reveals large tandem gene clusters of agronomical importance in highly repeated genomic regions. Mol Plant. 2023;16(2):314–317. doi: 10.1016/j.molp.2022.12.010. [DOI] [PubMed] [Google Scholar]
- 10.Jiao C., Xie X., Hao C., Chen L., Xie Y., Garg V., et al. Pan-genome bridges wheat structural variations with habitat and breeding. Nat. 2024 [Google Scholar]
- 11.Li X., Yu S., Cheng Z., Chang X., Yun Y., Jiang M., et al. Origin and evolution of the triploid cultivated banana genome. Nat Genet. 2024;56(1):136–142. doi: 10.1038/s41588-023-01589-3. [DOI] [PubMed] [Google Scholar]
- 12.Healey A.L., Garsmeur O., Lovell J.T., Shengquiang S., Sreedasyam A., Jenkins J., et al. The complex polyploid genome architecture of sugarcane. Nat. 2024;628(8009):804–810. [Google Scholar]
- 13.Zhang J., Zhang X., Tang H., Zhang Q., Hua X., Ma X., et al. Allele-defined genome of the autopolyploid sugarcane Saccharum spontaneum L. Nat Genet. 2018;50(11):1565–1573. doi: 10.1038/s41588-018-0237-2. [DOI] [PubMed] [Google Scholar]
- 14.Edger P.P., Poorten T.J., VanBuren R., Hardigan M.A., Colle M., McKain M.R., et al. Origin and evolution of the octoploid strawberry genome. Nat Genet. 2019;51(3):541–547. doi: 10.1038/s41588-019-0356-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Mao J., Wang Y., Wang B., Li J., Zhang C., Zhang W., et al. High-quality haplotype-resolved genome assembly of cultivated octoploid strawberry. Hortic Res. 2023;10(1) uhad002. [Google Scholar]
- 16.Cheng H., Concepcion G.T., Feng X., Zhang H., Li H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods. 2021;18(2):170–175. doi: 10.1038/s41592-020-01056-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Cheng H., Asri M., Lucas J., Koren S., Li H. Scalable telomere-to-telomere assembly for diploid and polyploid genomes with double graph. Nat Methods. 2024;21(6):967–970. doi: 10.1038/s41592-024-02269-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Abou Saada O., Tsouris A., Eberlein C., Friedrich A., Schacherer J. nPhase: an accurate and contiguous phasing method for polyploids. Genome Biol. 2021;22(1):126. doi: 10.1186/s13059-021-02342-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Zhou Q., Tang D., Huang W., Yang Z., Zhang Y., Hamilton J.P., et al. Haplotype-resolved genome analyses of a heterozygous diploid potato. Nat Genet. 2020;52(10):1018–1023. doi: 10.1038/s41588-020-0699-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Feng Y., Zhou J., Li D., Wang Z., Peng C., Zhu G. The haplotype-resolved T2T genome assembly of the wild potato species Solanum commersonii provides molecular insights into its freezing tolerance. Plant Commun. 2024;5(10) [Google Scholar]
- 21.Yang X., Yang Y., Ling J., Guan J., Guo X., Dong D., et al. A high-throughput BAC end analysis protocol (BAC-anchor) for profiling genome assembly and physical mapping. Plant Biotechnol J. 2019;18(2):364–372. doi: 10.1111/pbi.13203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Wu S., Sun H., Hamilton J.P., Mollinari M., Gesteira G.D.S., Kitavi M., et al. Phased chromosome-level genome assembly provides insight into the origin of hexaploid sweetpotato. bioRxiv. 2024 [Google Scholar]
- 23.Liu S., Li K., Dai X., Qin G., Lu D., Gao Z., et al. A telomere-to-telomere genome assembly coupled with multi-omic data provides insights into the evolution of hexaploid bread wheat. Nat Genet. 2025;57(4):1008–1020. doi: 10.1038/s41588-025-02137-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Chen H., Zeng Y., Yang Y., Huang L., Tang B., Zhang H., et al. Allele-aware chromosome-level genome assembly and efficient transgene-free genome editing for the autotetraploid cultivated alfalfa. Nat Commun. 2020;11(1) [Google Scholar]
- 25.Cameron C.J., Dostie J., Blanchette M. HIFI: estimating DNA-DNA interaction frequency from Hi-C data at restriction-fragment resolution. Genome Biol. 2020;21(1):11. doi: 10.1186/s13059-019-1913-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Pokrovac I., Pezer Z. Recent advances and current challenges in population genomics of structural variation in animals and plants. Front Genet. 2022;13 [Google Scholar]
- 27.Sun H., Jiao W.-B., Krause K., Campoy J.A., Goel M., Folz-Donahue K., et al. Chromosome-scale and haplotype-resolved genome assembly of a tetraploid potato cultivar. Nat Genet. 2022;54(3):342–348. doi: 10.1038/s41588-022-01015-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Wick R.R., Judd L.M., Holt K.E. Performance of neural network basecalling tools for Oxford Nanopore sequencing. Genome Biol. 2019;20(1) [Google Scholar]
- 29.Hotaling S., Wilcox E.R., Heckenhauer J., Stewart R.J., Frandsen P.B. Highly accurate long reads are crucial for realizing the potential of biodiversity genomics. BMC Genom. 2023;24(1):117. [Google Scholar]
- 30.Chen Y., Zhang Y., Wang A.Y., Gao M., Chong Z. Accurate long-read de novo assembly evaluation with Inspector. Genome Biol. 2021;22(1):312. doi: 10.1186/s13059-021-02527-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Spealman P., Burrell J., Gresham D. Nanopore sequencing undergoes catastrophic sequence failure at inverted duplicated DNA sequence. bioRxiv. 2019 [Google Scholar]
- 32.Li C., Wang J., Chien D.H., Chujoy E., Song B., VanderZaag P. Cooperation-88: a high yielding, Multi-purpose, late blight resistant cultivar growing in Southwest China. Am J Potato Res. 2010;88(2):190. [Google Scholar]
- 33.Li W., Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658–1659. doi: 10.1093/bioinformatics/btl158. [DOI] [PubMed] [Google Scholar]
- 34.Jain C., Rhie A., Zhang H., Chu C., Walenz B.P., Koren S., et al. Weighted minimizer sampling improves long read mapping. Bioinformatics. 2020;36:i111–i118. doi: 10.1093/bioinformatics/btaa435. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Raw sequencing data of the BAC-end produced by BstBI and ClaI, as well as control is available in the National Genomics Data Center (NGDC, https://bigd.big.ac.cn) under project PRJCA033658. A data file including the details of sequencing information and sequenced BioSamples has been deposited in CRA021329.





