Skip to main content
Plant Communications logoLink to Plant Communications
. 2024 Apr 10;5(8):100919. doi: 10.1016/j.xplc.2024.100919

Telomere-to-telomere genome assemblies of cultivated and wild soybean provide insights into evolution and domestication under structural variation

Kai-Hua Jia 1,6, Xiaoyan Zhang 1,6, Lei-Lei Li 1,6, Tian-Le Shi 2, Dan Liu 3, Yongyi Yang 1, Yunzhe Cong 1, Runfang Li 1, Yanyan Pu 1, Yongchao Gong 1, Xue Chen 4, Yu-Jun Si 4, Rumei Tian 1, Zhenya Qian 5,, Hanfeng Ding 1,∗∗, Nana Li 1,∗∗∗
PMCID: PMC11369727  PMID: 38605518

Dear Editor,

Soybean (Glycine max) is a critical crop worldwide, renowned as a major source of edible oil and protein. It was domesticated approximately 5000–9000 years ago in China, between latitudes 32°N and 40°N, from its wild progenitor, Glycine soja. In recent years, numerous genome assemblies for both cultivated and wild soybeans have been published, with cultivated soybean genomes even achieving telomere-to-telomere (T2T) completeness (Zhang et al., 2023a, 2023b; Wang et al., 2023; Huang et al., 2024). However, considerable genetic variability still exists among different cultivated varieties, as well as between them and their wild relatives.

Here, we present T2T gap-less genome assemblies of cultivated soybean (G. max) and its wild counterpart (G. soja). For G. max, the assembly was facilitated by integrating 58 Gb of PacBio HiFi reads, 141 Gb of Oxford Nanopore Technologies (ONT) reads, and 148 Gb of high-throughput chromosome conformation capture (Hi-C) data. Assembly of the G. soja genome similarly harnessed 70 Gb of PacBio HiFi reads, 125 Gb of ONT reads, and 164 Gb of Hi-C data, ensuring a comprehensive and precise assembly of its genome (Supplemental Table 1). In brief, we used the de novo assembler HiFiasm to process HiFi, ONT, and Hi-C reads to generate contigs and then aligned the Hi-C reads to the primary assembly with Juicer, followed by preliminary Hi-C-assisted chromosome assembly using 3D-DNA. We then used LR_Gapcloser to align ONT reads to the genome and find bridging reads that spanned gaps, filling in the ONT read sequence into the genomic gaps. We reassembled the HiFi reads aligned to the ends of the scaffolds in order to extend their lengths. Finally, T2T and gap-less reference genomes of cultivated soybean ‘Yundou1’ and wild soybean ‘Yesheng71’ were produced, which contained 20 chromosomes with total lengths of 1.02 and 1.03 Gb, respectively (Figure 1A; Supplemental Table 2). In cultivated soybean, the only gap is located in the highly tandemly repeated centromeric region of chromosome 11. By contrast, the wild soybean genome has two gaps: one in the highly tandemly repeated centromeric region of chromosome 13 and the other in the rDNA region of chromosome 19. Our assemblies exhibited high collinearity with the previously published T2T genomes of ‘ZH13,’ ‘Wm82,’ and ‘Jack’ (Figure 1B).

Figure 1.

Figure 1

Landscape of genome features in cultivated and wild soybean genomes.

(A) Circular genome feature map for cultivated and wild soybeans: the circular plot displays genome features for cultivated and wild soybean species. The outermost layer comprises colored blocks representing the 20 homologous chromosomes (Mchr for G. max, Schr for G. soja), which are marked with thick lines at 10-Mb intervals for easy reference.

(B) Structural variations (SVs) in cultivated versus wild soybeans.

(C) Distribution of repetitive sequence density (CentGm981 and CentGm384) on chromosome 2.

(D) Comparison of SNPs and SVs in cultivated and wild soybeans based on data from whole-genome resequencing and comparative genomics. The plot includes residuals of SV density on SNP density regression based on 10-Mb windows from both resequencing and comparative genomics, with chromosome markers at the top.

(E and F) Population structure, principal-component analysis (PCA), and phylogenetic analysis: This part shows the population structure, PCA, and phylogenetic tree of cultivated and wild soybeans based on their SNPs (E) and SVs (F), specifically at k = 2.

(G) Site frequency spectra of private SVs and SNPs in cultivated and wild soybeans.

(H) Expression of genes involving private SVs in cultivated soybeans: genes with high expression (p < 0.05, fold change > 2) that are associated with private SVs in cultivated soybeans are shown.

(I) Gene Ontology enrichment analysis of genes associated with the intersection of the top 5% fixation index (FST) and SVs.

Using EDTA and RepeatMasker, we determined that repetitive sequences account for approximately 57.14% and 57.57% of the cultivated and wild soybean genomes, respectively (Supplemental Table 3). Long terminal repeat elements were the most abundant, accounting for 34.12% and 34.14%. Another major type was terminal inverted repeat elements, accounting for 13.66% and 14.26% in cultivated and wild soybean, respectively. After masking these repetitive sequences, we compiled 176 and 201 second-generation transcriptomic datasets and 14 and 13 full-length transcriptomic datasets for the cultivated and wild varieties, enhancing the robustness of our gene annotation endeavor (Supplemental Table 4). In total, we annotated 53 508 and 53 495 protein-coding genes in cultivated and wild soybean, respectively (Supplemental Table 2).

By searching for the canonical telomeric repeat “TTTAGGG/CCCTAAA,” we identified telomeric sequences at the ends of all chromosomes except for one terminus on chromosomes 8 and 15 in the cultivated soybean genome. By contrast, telomeric sequences were detected at the ends of all chromosomes in the wild soybean genome (Supplemental Figure 1). Within the centromeric regions, 92-bp tandem repeat sequences (CentGm92) were identified, consistent with previous studies (Zhang et al., 2023a; Liu et al., 2023) (Figure 1A; Supplemental Figure 2). A notable discovery was made in the centromeric regions of the Yundou1 and Yesheng71 genomes, where the pronounced presence of 384-bp repeat sequences (CentGm384) and 981-bp repeat sequences (CentGm981) was observed (Figure 1C; Supplemental Figures 3–12; Supplemental Table 5). Such repeats were not previously observed in the studies of Zhang et al. (2023a) and Liu et al. (2023). In light of this observation, we reannotated the repetitive sequences in the genomes of ZH13, Wm82, and Jack using RepeatMasker. These specific repetitive sequences were present in these genomes as well, indicating that such sequence invasions occur broadly in the centromeric regions of soybean genomes.

The accuracy and completeness of the genome assemblies for cultivated and wild soybean were rigorously assessed using multiple methodologies. The assemblies for cultivated and wild soybean captured 99.8% and 99.6% of the complete benchmarking universal single-copy orthologs, respectively, and the protein-coding gene sets captured 99.6% and 99.3% (Supplemental Tables 2 and 6). The consensus quality values were recorded as 65.47 for cultivated soybean and 63.35 for wild soybean, with k-mer completeness scores of 99.27% and 98.38%, respectively (Supplemental Figure 13). Finally, Hi-C mapping of both genomes revealed no significant structural variations (SVs), confirming the accurate order and orientation of all chromosomes (Supplemental Figure 14).

We combined samples from 77 cultivated and 84 wild soybean resequencing datasets to examine patterns of SVs and single-nucleotide polymorphisms (SNPs) (Supplemental Table 7). After stringent filtering, we identified 7 058 474 SNPs with a minor allele frequency (MAF) of at least 0.05. We also detected 2459 SVs, including deletions, duplications, and inversions, with a minor allele count of 3 or more. The cumulative length of all identified SVs was 1.84 Mb, which constituted only 7.70% of the total SV length of 23.89 Mb, as determined by genome comparison.

The densities of SNPs and SVs in our resequencing data showed a strong correlation (general linear model: t value = 13.6, p < 2e−16) (Figure 1D). This correlation was also observed in the data obtained from genome comparisons (t value = −3.689, p < 2e−4). A significant correlation was also identified between SNP densities obtained from second-generation resequencing and genome-based approaches (t value = 17.32, p < 2e−16), and this trend was also evident for SVs (t value = −4.51, p < 7e−06).

Population structure, principal-component analysis, and phylogenetic relationships based on SNPs significantly differentiated wild populations from cultivated groups (Figure 1E). However, SVs did not consistently achieve this separation, despite a semblance of similarity in population structure (Figures 1E and 1F). This discrepancy may stem from the limited capture of SVs by short-read sequencing, which only identifies a fraction of the total SVs. The potential for false positives in SV identification cannot be discounted. Therefore, the use of long-read sequencing technologies is imperative for comprehensive identification of all SVs. In addition, the genetic diversity observed in cultivated soybeans was lower than that in their wild counterparts, as evidenced by analyses of SNPs and SVs; SNP diversity was 0.204 for cultivated versus 0.269 for wild soybean, and SV diversity was 0.217 for cultivated versus 0.230 for wild soybean.

The cultivated population had a much lower number of private SNPs (214 823) than the wild population (1 768 364). When comparing SVs, we observed a consistent pattern, with a reduced count in the cultivated population (266 SVs) compared with the wild population (399 SVs). Site frequency spectra for private SVs in both cultivated and wild populations exhibited parallel trends, with the preponderance of private SVs characterized by an MAF below 0.05 (Figure 1G). Notably, there was a reduced incidence of SVs with intermediate frequencies in the cultivated population compared with its wild counterparts. Furthermore, upon evaluating the MAFs for private SNPs, we unexpectedly found that SNPs with lower MAFs (MAF < 0.1) were not the most prevalent; instead, private SNPs with MAFs ranging from 0.1 to 0.15 constituted the majority. In addition, the cultivated soybeans contained fewer low-frequency private alleles.

These private SVs affected a total of 50 genes in the wild population and 58 genes in the cultivated varieties. Intriguingly, private genes in the cultivated soybeans were predominantly associated with pathways involved in fatty acid and lipid metabolism, such as “fatty acid catabolic process,” “lipid oxidation,” and “fatty acid beta-oxidation” (Supplemental Figure 15). This suggests an optimization of lipid metabolism in cultivated soybeans, which may be linked to the production and accumulation of soybean oil. Further analysis of the expression levels of these 58 genes revealed that nine genes were significantly upregulated (p < 0.05, > two-fold expression) in the cultivated soybeans, and only one gene was significantly downregulated (p < 0.05, > two-fold expression) (Figure 1H). Among the upregulated genes, Glmax02G0105700 was found to be implicated in fatty acid metabolic and catabolic pathways.

We also evaluated the degree of differentiation between cultivated and wild populations based on SVs and SNPs. The genome-wide average FST estimates were significantly higher for SNPs (0.314) than for SVs (0.077), reflecting the generally lower population frequencies observed for SVs. By focusing on the top 5% of 20-kb windows with the highest FST for SNPs and examining their intersections with the SV dataset, we identified a set of 79 genes. These genes were enriched in pathways related to “gibberellin metabolic process” and “cell population proliferation,” which are key to rapid plant growth (Figure 1I). In addition, a cross-population composite likelihood ratio test was used to detect species-specific selection pressures between cultivated and wild soybean populations. Similarly, by examining the intersection of the SV dataset with the top 5% of SNPs within 20-kb windows, identified by their highest cross-population composite likelihood ratio values, we pinpointed seven genes (Supplemental Table 8). These genes were associated with nitrogen metabolism and responses to various environmental conditions. For example, Glmax19G0320700, corresponding to Arabidopsis GLB1, is associated with the regulation of nitrogen metabolism (Li et al., 2020). Selection for this gene in soybeans suggests a preference for varieties with efficient nitrogen fertilizer utilization, an important trait for effective nutrient use in agricultural production.

In conclusion, our work on T2T genome assemblies of cultivated and wild soybeans not only unravels genetic intricacies and identifies potential domestication genes but also paves the way for advances in agricultural traits and breeding strategies.

Data and code availability

The whole-genome sequence data reported in this paper have been deposited in the Genome Sequence Archive under accession number CRA013518. The genome assembly and annotation data reported in this paper have been deposited in the Genome Warehouse at the National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences/China National Center for Bioinformation, under accession numbers GWHEQVB00000000 and GWHEQVD00000000. The gene function annotations and repeat annotations are available at https://doi.org/10.6084/m9.figshare.25237489.v1.

Funding

This research was supported by the Natural Science Foundation of Shandong Province for Young Scholars (ZR2023QC153), the Key R&D Program of Shandong Province (2021LZGC025, 2022LZGC022, and 2023LZGC001), the National Natural Science Foundation of China (32201736), and the Agricultural Science and Technology Innovation Project of SAAS (CXGC2023F13 and CXGC2023C02).

Author contributions

K.H.J. and N.L. conceived and designed the study; K.H.J., L.L.L., X.Z., R.L., T.L.S., R.T., Y.P., Y.G., D.L., X.C., Y.J.S., and Z.Q. collected materials and analyzed the data; K.H.J., T.L.S., and L.L.L. prepared figures and tables; K.H.J. wrote the manuscript; N.L., H.D., and K.H.J. revised the manuscript; and all authors approved the final manuscript.

Acknowledgments

No conflict of interest is declared.

Published: April 10, 2024

Footnotes

Published by the Plant Communications Shanghai Editorial Office in association with Cell Press, an imprint of Elsevier Inc., on behalf of CSPB and CEMPS, CAS.

Supplemental information is available at Plant Communications Online.

Contributor Information

Zhenya Qian, Email: qyz1127@163.com.

Hanfeng Ding, Email: dinghf2005@163.com.

Nana Li, Email: q15254785555@163.com.

Supplemental information

Document S1. Supplemental Figures 1–15, Supplemental Tables 1–8, and supplemental methods
mmc1.pdf (5.5MB, pdf)
Document S2. Article plus supplemental information
mmc2.pdf (8.3MB, pdf)

References

  1. Huang Y., Koo D.H., Mao Y., Herman E.M., Zhang J., Schmidt M.A. A complete reference genome for the soybean cv. Plant Commun. 2024;5 doi: 10.1016/j.xplc.2023.100765. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Li H., Liu H., Wang Y., Teng R.M., Liu J., Lin S., Zhuang J. Cytosolic ascorbate peroxidase 1 modulates ascorbic acid metabolism through cooperating with nitrogen regulatory protein P-II in tea plant under nitrogen deficiency stress. Genomics. 2020;112:3497–3503. doi: 10.1016/j.ygeno.2020.06.025. [DOI] [PubMed] [Google Scholar]
  3. Liu Y., Yi C., Fan C., Liu Q., Liu S., Shen L., Zhang K., Huang Y., Liu C., Wang Y., et al. Pan-centromere reveals widespread centromere repositioning of soybean genomes. Proc. Natl. Acad. Sci. USA. 2023;120 doi: 10.1073/pnas.2310177120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Wang L., Zhang M., Li M., Jiang X., Jiao W., Song Q. A telomere-to-telomere gap-free assembly of soybean genome. Mol. Plant. 2023;16:1711–1714. doi: 10.1016/j.molp.2023.08.012. [DOI] [PubMed] [Google Scholar]
  5. Zhang A., Kong T., Sun B., Qiu S., Guo J., Ruan S., Guo Y., Guo J., Zhang Z., Liu Y. A telomere-to-telomere genome assembly of Zhonghuang 13, a widely-grown soybean variety from the original center of Glycine max. J. Crop Sci. 2023;10:142–153. [Google Scholar]
  6. Zhang C., Xie L., Yu H., Wang J., Chen Q., Wang H. The T2T genome assembly of soybean cultivar ZH13 and its epigenetic landscapes. Mol. Plant. 2023;16:1715–1718. doi: 10.1016/j.molp.2023.10.003. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Supplemental Figures 1–15, Supplemental Tables 1–8, and supplemental methods
mmc1.pdf (5.5MB, pdf)
Document S2. Article plus supplemental information
mmc2.pdf (8.3MB, pdf)

Data Availability Statement

The whole-genome sequence data reported in this paper have been deposited in the Genome Sequence Archive under accession number CRA013518. The genome assembly and annotation data reported in this paper have been deposited in the Genome Warehouse at the National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences/China National Center for Bioinformation, under accession numbers GWHEQVB00000000 and GWHEQVD00000000. The gene function annotations and repeat annotations are available at https://doi.org/10.6084/m9.figshare.25237489.v1.


Articles from Plant Communications are provided here courtesy of Elsevier

RESOURCES