Abstract
Effective utilization of wild relatives is key to overcoming challenges in genetic improvement of cultivated tomato, which has a narrow genetic basis; however, current efforts to decipher high-quality genomes for tomato wild species are insufficient. Here, we report chromosome-scale tomato genomes from nine wild species and two cultivated accessions, representative of Solanum section Lycopersicon, the tomato clade. Together with two previously released genomes, we elucidate the phylogeny of Lycopersicon and construct a section-wide gene repertoire. We reveal the landscape of structural variants and provide entry to the genomic diversity among tomato wild relatives, enabling the discovery of a wild tomato gene with the potential to increase yields of modern cultivated tomatoes. Construction of a graph-based genome enables structural-variant-based genome-wide association studies, identifying numerous signals associated with tomato flavor-related traits and fruit metabolites. The tomato super-pangenome resources will expedite biological studies and breeding of this globally important crop.
Subject terms: Plant breeding, Genomics, Genomics
A tomato super-pangenome constructed using chromosome-scale genomes of nine wild species and two cultivated accessions highlights genomic diversity and structural variation across wild and cultivated tomatoes.
Main
Tomato (Solanum lycopersicum L.) is among the most important vegetable crops in terms of global production (http://www.fao.org/faostat/en/#data/QCL), also serving as a classic model system for genetic, developmental and physiological studies of fleshy fruits1. It belongs to the genus Solanum in the nightshade family Solanaceae. Cultivated tomatoes have lost substantial genetic diversity owing to a domestication bottleneck and intensive artificial selection in pursuit of bigger fruits and higher yield2, which has impeded tomato improvement. By contrast, wild tomatoes in Solanum section Lycopersicon, which have adapted to various ecological environments in western South America including the Galapagos islands, from offshore to 3,600 m above sea level3, exhibit broad genetic and phenotypic diversity4–6. These wild species represent a rich source of allelic variation and harbor genes underlying biotic and abiotic stress tolerance, as well as consumer-preferred traits such as high levels of soluble solid content, lycopene and flavor compounds2,7. Hence, effective harnessing of natural diversity from these wild germplasms is essential to facilitate tomato genetic improvement.
The availability of the tomato reference genome (S. lycopersicum var. lycopersicum cv. Heinz 1706)8,9 has enabled comprehensive characterization of genetic diversity in terms of SNPs and small insertion/deletions (indels) by resequencing numerous accessions, revealing the domestication and wild introgression history of tomato10,11. Despite this, increasing numbers of studies have indicated that large structural variants (SVs), such as presence/absence variants and copy number variants (CNVs), also have vital roles in plant adaptive evolution and functional diversity12,13. However, conventional strategies based on a single linear reference genome can only capture a portion of genetic diversity, resulting in strong reference biases, and accurate detection of SVs is still challenging using merely short-read resequencing approaches.
To overcome these limitations, pangenomics, as applied in human and many plant species, has emerged as a promising approach to capture the nearly full spectrum of genetic diversity of crops and their wild relatives14,15. Recent genomic advances in tomato include a pangenome of 725 tomato accessions constructed using short reads13, a pan-SV map built from Oxford Nanopore long reads of 100 diverse tomato lines12 and a graph pangenome integrating variant information from 838 tomato genomes16. These studies suggest that SVs contribute to phenotypic variance and can be powerful when utilized in genome-wide association studies (GWAS). However, most of the accessions sampled in the studies were domesticated tomatoes and their closely related progenitor species Solanum pimpinellifolium. Currently, genome assemblies for five wild tomato species, Solanum habrochaites11,17, Solanum pennellii11,18,19, Solanum galapagense17, S. pimpinellifolium20,21 and Solanum lycopersicoides22, are available, with different assembly approaches and qualities, which impedes the characterization and utilization of genetic variants in tomato wild relatives. Recently, it was highlighted that a super-pangenome that includes genomic information of many diverse species, especially wild relatives within a genus, could expedite crop improvement23. Therefore, it is necessary to assemble additional reference genomes for tomato wild relatives to accelerate biological studies and genetic improvement in tomato. In this study, we construct a section-wide super-pangenome by de novo assembling 11 chromosome-level genomes from ten tomato species, representing major clades of tomato wild relatives and their cultivated counterparts in Lycopersicon. Comparative analyses reveal the panorama of genomic content, evolutionary history and structural variation across tomato species, empowering the discovery of a wild tomato gene that has the potential to increase yield in modern cultivated tomatoes. These results will provide insight for the construction and exploitation of super-pangenomes in other crop species.
Results
Eleven wild and cultivated tomato reference genomes
To represent the diversity of wild and cultivated tomato species, we selected nine wild tomatoes (eight species from Solanum section Lycopersicon: S. habrochaites, Solanum chilense, Solanum peruvianum, Solanum corneliomulleri, Solanum neorickii, Solanum chmielewskii, S. pimpinellifolium and S. galapagense; and one from Solanum section Lycopersicoides: S. lycopersicoides) and two diverse domesticated tomatoes (S. lycopersicum var. cerasiforme and S. lycopersicum var. lycopersicum cv. M82; Table 1). We assembled a high-quality chromosome-scale reference genome of wild tomato S. galapagense ‘LA0436’, using a hybrid assembly approach integrating Pacific Biosciences (PacBio) sequencing, optical genome mapping (Bionano Genomics) and high-throughput chromosome conformation capture (Hi-C; Supplementary Note and Supplementary Tables 1–5). The 802-Mb final assembly had a contig N50 length of 15.5 Mb, and more than 99.5% of sequences in the final assembly were anchored to the 12 chromosomes, higher than the corresponding percentages for the three existing reference genomes ‘LA2093’ (99.0%), ‘Heinz 1706’ (97.5%) and ‘LA716’ (93.6%) (Table 1). The ten other tomato genomes were also assembled at chromosome level using the above-mentioned strategy, except that Bionano data were not generated. These genomes had monoploid assembly lengths ranging between 770.0 Mb (S. chmielewskii) and 1.2 Gb (S. lycopersicoides), close to their predicted genome sizes (Table 1 and Supplementary Tables 6 and 7). More than 99% of Illumina short reads and 95.7% of ESTs could be mapped to the 11 tomato genome assemblies, and 94.0% of embryophyte Benchmarking Universal Single-Copy Orthologs (BUSCO)24 were captured in these assemblies, indicative of their high completeness (Supplementary Tables 8–10).
Table 1.
Accession | Assembly size (Mb) | Percentage of anchoring (%) | Contig N50 (kb) | No. of predicted genes | Repeats (%) | BUSCO (%) |
---|---|---|---|---|---|---|
S. lycopersicoides (LA2951) | 1,200 | 92.23 | 579 | 32,295 | 71.81 | 90.1 |
S. habrochaites (LA1777) | 960 | 86.07 | 546 | 32,386 | 69.05 | 92.0 |
S. pennellii (LA716)a | 990 | 93.62 | 46 | 44,965 | 64.82 | 96.3 |
S. chilense (LA1969) | 917 | 88.11 | 425 | 34,375 | 73.70 | 94.9 |
S. peruvianum (LA0446) | 867 | 91.90 | 678 | 31,877 | 73.83 | 94.9 |
S. corneliomulleri (LA1331) | 877 | 88.60 | 449 | 31,692 | 74.49 | 94.2 |
S. neorickii (LA0247) | 778 | 94.07 | 2,079 | 32,831 | 72.74 | 93.1 |
S. chmielewskii (LA1028) | 770 | 95.44 | 2,002 | 31,613 | 72.26 | 94.4 |
S. pimpinellifolium (LA1547) | 803 | 94.78 | 3,691 | 33,427 | 72.77 | 93.1 |
S. galapagense (LA0436) | 802 | 99.56 | 15,538 | 32,773 | 71.45 | 96.7 |
S. lycopersicum var. cerasiforme (LA1464) | 778 | 94.68 | 2,513 | 32,941 | 73.50 | 96.7 |
S. lycopersicum var. lycopersicum (M82) | 881 | 76.83 | 600 | 31,773 | 64.31 | 93.5 |
S. lycopersicum var. lycopersicum (Heinz 1706)a | 828 | 97.48 | 6,008 | 35,768 | 63.46 | 96.4 |
aGenomes reported in previous studies.
We combined ab initio prediction, homology search and transcriptome mapping approaches for protein-coding gene prediction (Methods), resulting in gene numbers ranging from 31,613 (S. chmielewskii) to 34,375 (S. chilense), similar to that of Heinz 1706 (35,768) but fewer than that of LA716 (44,965) (Table 1). A total of 81.7% to 89.5% of exons of the predicted genes were supported by transcript data, suggesting the high quality of gene predictions. All assembled genome sequences and annotations are publicly accessible through a web-based database (http://caastomato.biocloud.net).
Eukaryotic genomes are rich in transposable elements (TEs), which shape genome evolution through expansions, eliminations and transpositions25. The TE contents of the 11 tomato genomes ranged from 64.3% to 74.5%, with long terminal repeat retrotransposons (LTR-RTs) representing the most abundant class of TE (Fig. 1a). A higher abundance of Gypsy LTR-RTs was found in S. lycopersicoides, which possibly contributed to it having the largest assembled genome size (1.2 Gb) among the tomato species22 (Fig. 1a). To trace the evolutionary history of the expanded TEs in S. lycopersicoides, we estimated insertion times of 162,216 intact LTR-RTs and detected a lineage-specific burst of Gypsy LTR-RTs occurring c. 2 million years ago (Ma) in S. lycopersicoides, after its divergence from potato, probably leading to its large extant genome (Supplementary Fig. 3). Notably, we observed recent amplification of Gypsy and Copia LTR-RTs in four wild tomato species (S. lycopersicoides, S. corneliomulleri, S. peruvianum and S. chilense; Supplementary Fig. 3), implying that these wild species may have increasing degrees of genomic diversity and environmental adaptability compared with cultivated tomatoes. These results provide insight into the role of TEs in genome evolution of the Solanum genus.
Phylogeny of Lycopersicon and neighboring species
Reconstructing the phylogeny of Lycopersicon species has been problematic owing to the conflict between gene trees and morphological trees, especially for the wild tomato clade3. The phylogenetic relationship between S. pennellii and other tomatoes remains unresolved6, owing largely to limited available genomic data, despite S. pennellii being considered to be a unique group based on morphological classification. Using 9,343 single-copy orthologous genes, we inferred the phylogeny of ten wild and three domesticated tomatoes, using potato (Solanum tuberosum) as an outgroup; the results indicated that section Lycopersicoides (including S. lycopersicoides) was sister to section Lycopersicon (Fig. 1b), consistent with previous research3. Based on the phylogeny, we resolved the polytomy issue in Lycopersicon and unambiguously classified Lycopersicon species into four main clades. Clade I encompassed two species, S. pennellii and S. habrochaites, which diverged from the common ancestor of the other wild and cultivated tomatoes (except S. lycopersicoides) c. 1.97 Ma. Clade IV, which comprised domesticated tomatoes and two closely related wild species (S. galapagense and S. pimpinellifolium), divided from the ancestor of clade III (S. neorickii and S. chmielewskii) approximately 1.73 Ma. Similar to a recent study of Oryza genus evolution26, a few conflicts were observed between the phylogeny constructed using genes from one chromosome and that built using whole-genome genes (Fig. 1b and Supplementary Fig. 4). For example, within Lycopersicon, phylogenetic analyses using genes from chromosomes 1, 2, 9 and 11 showed that S. pennellii was sister to other wild and cultivated tomato species, rather than clustering into a monophyletic group with S. habrochaites as inferred from the genome-wide phylogeny (Supplementary Fig. 4), suggesting possible incomplete lineage sorting and/or hybridization events. These results enhance our understanding of the evolutionary history within Solanum section Lycopersicon.
Super-pangenome of tomato
Although pangenomes for cultivated tomato and its close wild relatives have been reported13, the gene pool of Lycopersicon, which contains wild and cultivated tomato species, remains largely inaccessible. Here, we extended the tomato pangenome that integrates genomes from three Solanum species13 to a super-pangenome covering 11 species in the Solanum genus. We defined 40,457 pangene families by clustering protein-coding genes of the 11 chromosome-scale genomes assembled herein and two previously released genomes8,18; this number of gene families was higher than that of the Oryza genus26 but lower than that of soybean27. The number of gene families increased rapidly when including more genomes, suggesting that the 13 genomes are diverse and that a single reference genome cannot capture the full genetic diversity in tomato (Fig. 2a). Only 54.0% of gene families were conserved among the 13 tomato genomes (core gene families), and the number of core genes (23,839) was lower than that of the previously reported pangenome of 519 cultivated tomatoes and 67 closely related wild tomato accessions belonging to S. pimpinellifolium and S. galapagense (29,938)13, owing largely to the higher level of divergence among the wild tomato species used in this study. The dispensable gene families (present in two to 12 accessions) occupied 38.4% of gene families, and 7.6% of pangene families were categorized as accession-specific.
Gene ontology (GO) enrichment analysis showed that core genes were enriched for biological processes including carboxylic acid, lipid or organic substance metabolic process, RNA modification or processing and amide transport, consistent with the results of a previous study13 (Supplementary Tables 11 and 12), whereas the dispensable genes were enriched for terpenoid biosynthesis, telomere maintenance, mitochondrial electron transport and photosynthesis (Supplementary Fig. 6). Expression levels of core genes were significantly higher than those of dispensable genes at different fruit ripening stages (P < 2.2 × 10−16, Wilcoxon rank-sum test; Supplementary Fig. 7). We found that 3,441 out of the 4,874 nonreference genes reported from the previous tomato pangenome13 were captured in our super-pangenome (Supplementary Table 13), and we also identified 9,320 nonredundant genes absent from the reported tomato pangenome13 (Supplementary Note and Supplementary Table 14), indicating the rich diversity of the 13 wild and domesticated tomatoes. This super-pangenome dataset lays a foundation for exploration and exploiting of genes or alleles in wild tomato species.
Extensive variation among wild and cultivated tomatoes
Despite efforts to characterize genetic variants among cultivated tomatoes and their proposed progenitor species S. pimpinellifolium12,13,16, the genetic diversity among distantly related wild tomato species, for example, S. peruvianum, S. habrochaites and S. chilense, remains poorly explored. We identified 2.0–8.1 million SNPs and 0.6–1.5 million small indels (≤50 base pairs (bp) in size) in the 12 tomato genomes, relative to the reference S. galapagense genome. The total number of SNPs and small indels (42.4 M) was much higher than that of each accession (Supplementary Tables 15 and 16), suggesting a diverse nature among the 12 wild and cultivated tomato accessions (Supplementary Note). Leveraging genome alignments, we identified 103,333 insertions, 119,794 deletions, 41,960 CNVs, 23,516 translocations and 1,320 inversions (<1 Mb in length) in the 12 tomato accessions compared with the S. galapagense genome (Supplementary Tables 17 and 18). Species in clade II (S. chilense, S. peruvianum and S. corneliomulleri) contained markedly varied numbers of SVs (Fig. 2b), possibly associated with the recent proliferation of LTR-RTs in those genomes (Supplementary Fig. 3). The majority of insertions, deletions and CNVs were shorter than 2 kb, 2 kb and 8 kb, respectively, and most of the translocations had lengths shorter than 20 kb, whereas some inversions were longer than 300 kb (Supplementary Fig. 17). We found that insertions and deletions were more likely to be found at both ends of the chromosomes, consistent with previous studies12,20, whereas inversions and translocations were randomly distributed along the 12 chromosomes (Fig. 2c). SVs were more likely to occur at repeat regions than nonrepeat genomic regions (Student’s t test, P = 1.03 × 10−4). We further identified 5,186 large indels (>50 bp) fixed either in all wild or all domesticated tomato genomes investigated in this study, some of which led to insertions of protein-coding genes present only in certain wild tomato genomes (Supplementary Table 19 and Fig. 2d). Further functional characterization of these variants may enable a better understanding of the genetic basis of phenotypic divergence between domesticated tomatoes and their wild relatives.
Previous studies have identified several SVs responsible for phenotypic variation, including a 1.4-kb deletion in the CSR gene resulting in increased fruit weight28, a 7.1-kb deletion in the LNK2 locus responsible for a light-conditional clock deceleration29, an 85-bp deletion in the promoter of ENO that regulates floral meristem activity30 and a CNV affecting NSGT associated with biosynthesis of a fruit flavor volatile guaiacol12. These SVs were all accurately detected in this study (Supplementary Figs. 18–21), indicating the broad diversity of our collection. Two different alleles (4,724 bp and 4,151 bp) have been identified at 149 bp upstream of TomLoxC, a gene encoding a 13-lipoxygenase; the 4,151-bp allele was reported to contribute to desirable fruit flavor and is rare in cultivated tomatoes13. We found that S. pennellii, S. habrochaites, S. chilense and S. neorickii carried the 4,151-bp allele upstream of TomLoxC (Supplementary Fig. 22), suggesting that these wild species have the potential to improve fruit flavor in cultivated tomato by backcrossing. The extensive variation among wild and cultivated tomato species presented herein provides access for further harnessing of the genetic diversity of distantly related wild tomato species in genomic-based breeding.
Hidden genetic diversity of tomato wild species
Large inversions have been reported to suppress recombination by reducing crossing-over31,32, resulting in severe linkage drag when conducting backcross breeding. To overcome this, it is necessary to choose donor lines without inverted segments harboring targeted genes. However, a holistic view of genome-wide inversions is not available, owing to the lack of chromosome-scale wild tomato genomes. Based on the 11 high-quality tomato genomes, we identified 12 (S. lycopersicum var. lycopersicum cv. Heinz 1706) to 42 (S. chmielewskii) megabase-scale inversions compared with the S. galapagense genome (Supplementary Table 20). Notably, a 7.1-Mb inversion on chromosome 3, carrying 55 genes, was present in all clade IV tomato accessions compared with other wild species (except S. pennellii) and was supported by clear chromatin interactions around the breakpoints when Hi-C reads of S. neorickii and S. chmielewskii were mapped to the S. galapagense genome (Fig. 2e). This inversion might occur after the divergence between species from clade IV and other clades. Given that S. pennellii does not carry this inversion within this region, this wild tomato species would be an ideal donor parent to introduce possibly favored genes within this 7.1-Mb segments into elite cultivars by backcrossing.
Previous research reported a tomato pan-SV map, which was built by long-read sequencing of 100 cultivated and closely related wild tomato accessions12. Compared with this pan-SV map, 180,314 out of the 224,447 SVs were exclusively identified in this study, of which 4,124 (2.3%) were localized within coding regions (CDS) of 3,515 genes (Supplementary Note and Supplementary Table 21), suggesting that the majority of SVs found in this study were captured owing to the inclusion of distantly related wild tomato species. Integrating our identified SVs with the pan-SV dataset generated 153,873 insertions, 203,364 deletions, 2,952 inversions and 45,987 duplications in 112 tomato accessions (12 in this study and 100 in the pan-SV map), allowing us to investigate the divergence of SVs during tomato evolution. We divided these 112 accessions into four groups: wild (19 non-S. pimpinellifolium wild accessions), SP (22 S. pimpinellifolium accessions), SLC (24 S. lycopersicum var. cerasiforme accessions) and SLL (47 big-fruited S. lycopersicum var. lycopersicum accessions; Supplementary Fig. 24a). The vast majority of SVs displayed relatively low frequencies (<0.25) in all four groups, and accessions from the wild group contained a higher proportion of SVs with presence frequency between 0 and 0.25 (Supplementary Fig. 24b). We observed that 8,094 SVs exhibited significant frequency changes between the wild and cultivated (SLL and SLC) groups (Fisher’s exact test, false discovery rate (FDR) < 0.01; Supplementary Fig. 24c), affecting upstream regions and exons of 2,585 genes. Functional analyses indicated that these genes were mainly enriched for biological processes such as meristem development and ammonium transport (Supplementary Fig. 24d). We further identified 388 highly divergent SVs between wild and cultivated tomatoes, which disrupted CDS of 328 genes by causing frameshift, loss of exons or in-frame insertions (Supplementary Table 22). These results suggest that SVs in these distantly related wild tomatoes have undergone distinct evolutionary trajectories compared with cultivated tomatoes and their progenitors. Our analyses also provide a candidate dataset for further characterizing genes underlying phenotypes with great divergence between wild and cultivated tomatoes.
A wild tomato cytochrome P450 gene that increases yield
A major goal of tomato breeding is to increase yield by developing varieties with larger fruit size and/or more effective shoot branches. Regulation of shoot architecture is thus of great interest to the tomato research community33. Wild tomato species usually display a markedly greater number of lateral fruit-bearing branches than their domesticated counterpart; however, whether we can introduce this trait into cultivated tomatoes, especially modern processing tomato varieties, remains elusive. Among the 388 highly divergent SVs between wild and cultivated tomatoes that greatly affected gene CDS, a 244-bp deletion, showing the second most significant frequency change (FDR = 1.43 × 10−8; Supplementary Fig. 24c), was present in the first exon of Sgal12g015720 (Fig. 3a,b). This gene encodes a protein belonging to the cytochrome P450 superfamily, which has been reported to play important parts in plant growth, development and secondary metabolite biosynthesis34. The 244-bp deletion was found in 22.22% of the 19 wild accessions and 100% of cultivated tomatoes, which represented the derived state, as this deletion was absent from all the nine wild tomato species used in this study (Fig. 3a,b and Supplementary Fig. 25). Sgal12g015720 was expressed at the highest level in stems of the wild tomato S. pennellii (Fig. 3c), whereas its expression in two cultivated tomatoes could barely be detected (Supplementary Fig. 26). These results suggest that the 244-bp deletion event may have occurred during tomato domestication, which might lead to pseudogenization of Sgal12g015720 in cultivated tomato.
To investigate functions of this gene and its potential value for tomato breeding, we generated Sgal12g015720-overexpression (OE) transgenic lines under the background of a tomato cultivar ‘Micro-Tom’ (Supplementary Fig. 27). Compared with the wild-type (WT) plants, the transgenic lines possessed a greater number of lateral branches, resulting in a greater than twofold increase in total fruit number, whereas only slight reductions in single fruit weight, transverse diameter and longitudinal diameter were observed (red fruits; Fig. 3d–i). To further validate the function of Sgal12g015720, we screened previously reported introgression lines (ILs)35 generated using wild tomato S. pennellii (LA716, donor parent) and cultivated accession M82 (recurrent parent). As expected, two ILs, IL12-2 and IL12-3, carrying a homozygous introgressed segment that harbors an Sgal12g015720 ortholog from the wild tomato donor, generated markedly more lateral branches and fruits compared with the recurrent parent M82 (Supplementary Fig. 28). Therefore, this gene represents a promising target for regulation of plant architecture as well as increasing yield or biomass in tomato breeding. These analyses also present an example of how the super-pangenome could facilitate tomato biological studies and breeding.
Graph-based genome enables SV-based GWAS in tomato
Numerous studies have suggested that SVs are causative variants responsible for agronomically important traits12,13,36,37. However, population-scale SV genotyping is challenging in plants, impeding the exploitation of SVs in identifying genotype–phenotype associations. Here, we constructed a tomato graph-based genome by integrating the linear reference genome sequence of S. galapagense and the 360,189 SVs identified from the 12 tomato genomes and the 100 previously reported tomato genomes12. Graph-based genomes are capable of storing both reference and alternative allele sequences while retaining the coordinate systems of the linear reference genome, which facilitates mapping of short reads from SV regions and thus SV genotyping38,39. We then genotyped these SVs in a tomato population comprising 321 accessions2 and performed SV-based GWAS for 32 flavor-related compounds2 and 362 fruit metabolites40. For comparison, we also called SNPs and indels from the 321 accessions and employed SNP-based GWAS.
Significantly associated signals were detected for 17 flavor volatiles and 249 fruit metabolites. Surprisingly, we observed that only 5.2% (161) of peaks (quantitative trait loci) overlapped (800-kb flanking region) between SV-based and SNP-based GWAS results, and 21.3% (658) could only be identified by SVs. The remaining 2,263 (73.4%) were exclusively detected by SNPs (Fig. 4a,b and Supplementary Table 23). Examples included a peak at 65.2 Mb on chromosome 10 that could only be detected using SVs, which was strongly associated with the content of geranylacetone (P = 7.91 × 10−9), one of the important tomato flavor volatiles contributing a leafy flavor to fruits (Fig. 4c and Supplementary Fig. 29). The leading SV was a 347-bp deletion, and the content of geranylacetone in tomato fruits significantly differed between accessions carrying the reference allele and those carrying the alternative allele (Student’s t test, P = 3.7 × 10−8, Fig. 4c). Similarly, we detected significantly associated SVs for the content of additional metabolites (Fig. 4d–f, Supplementary Figs. 30–32 and Supplementary Table 23). Tomato accessions carrying alleles of corresponding leading SVs showed significantly increased content of these metabolites (Fig. 4d–f). This identification of SVs exhibiting significant associations with important tomato fruit flavor compounds and metabolites will pave the way for further fine mapping and isolation of putative candidate genes. Our SV-based GWAS provide an important complement to the conventional SNP-based GWAS, which will be helpful to develop markers for breeding flavor-improved tomato cultivars.
Discussion
Domestication of tomato has led to a substantial loss of genetic diversity in modern varieties due to the bottleneck and successive rounds of artificial selection; therefore, the rich diversity of wild tomato species contains valuable breeding materials. However, the availability of only a few wild tomato genomes has hampered the exploration and utilization of alleles and gene repertoire in those wild species. The chromosome-scale reference genomes for nine wild tomato species presented here offer valuable resources for not only comparative genomics but also biological studies and molecular breeding in tomato. Notwithstanding, our dataset still lacks three wild tomato species in Solanum section Lycopersicon (Solanum cheesmaniae, Solanum huaylasense and Solanum arcanum). S. cheesmaniae is endemic to the Galápagos island with yellow to orange fruits41, whereas S. huaylasense and S. arcanum are wild tomatoes segregated from S. peruvianum42. Development of their genome sequences and annotation will further enrich our understanding of the biodiversity and evolutionary trajectory within Lycopersicon.
Although pangenomes for numerous crops have been reported, most of them incorporated one or a few species43. Here, we constructed a super-pangenome by analyzing 11 distinct tomato species, representative of major wild and cultivated tomato clades. Coupling this with an existing dataset12, we identified a wild tomato gene that could increase fruit yield by an average of 67.1% in OE transgenic lines (Fig. 3d–f). As both OE lines and ILs carrying this gene had higher numbers of fruit-bearing branches (Fig. 3d and Supplementary Fig. 27), we anticipate its use in modern processing tomatoes. According to tomato population resequencing data, this gene was predominantly found in wild tomato accessions (52% of S. pimpinellifolium, 80% of S. cheesmaniae and 100% of S. galapagense), in contrast to a mere 6% and 19% in cultivated tomato forms S. lycopersicum var. lycopersicum and S. lycopersicum var. cerasiforme, respectively (Supplementary Table 24). These results indicate that this gene, although potentially important, has not been widely utilized in tomato breeding programs. Backcrossing would be an ideal approach to introduce this gene into cultivated tomatoes from wild species. However, hybridization between wild and cultivated crops may lead to severe repression of genetic recombination, owing largely to the presence of large-scale genomic divergence, such as large inversions10,31. This may ultimately result in the introduction of exotic genomic fragments carrying unfavorable alleles that are hard to purge40. We did not observe chromosomal rearrangements between the genome of Heinz 1706 and those of eight out of the nine wild species surrounding this gene (Supplementary Table 25), suggesting that introgression of this gene by backcrossing, when the donor parent is properly selected, would be less likely to cause linkage drag.
To facilitate the utilization of genetic diversity from our super-pangenome, we constructed a graph-based genome reference for wild and cultivated tomatoes by integrating SV information for 112 tomatoes from 11 Solanum species into the linear reference sequence, offering a powerful platform for population-level SV genotyping. As previous research has suggested that SVs are more likely to be causal variants in tomato16, further studies could use this graph-based genome to perform SV-based association analyses to identify additional signals responsible for agronomically important traits. However, the current graph-based tomato genome is only capable of storing certain types of SVs: insertions, deletions and inversions. Other SVs of relatively high complexity, such as inverted duplications and translocations, cannot yet be integrated. Furthermore, SVs with multiple alleles are not represented in the graph, as downstream analytic pipelines can only handle biallelic variants. It is possible that an insertion with distinct inserted fragments in various individuals contributes to different phenotypic outcome. We anticipate further implementation of relevant tools and algorithms that could tackle these issues. This research will accelerate comparative genomics and biological studies in tomato and shed light on the utilization of super-pangenomes in crop improvement.
Methods
Plant materials
Briefly, eight wild species from section Lycopersicon (S. galapagense, S. pimpinellifolium, S. chmielewskii, S. neorickii, S. corneliomulleri, S. peruvianum, S. chilense and S. habrochaites), one wild species from section Lycopersicoides (S. lycopersicoides) and two domesticated tomatoes (S. lycopersicum var. lycopersicum cv. M82 and S. lycopersicum var. cerasiforme) were collected. All seedlings were planted in Anningqu field test station of Xinjiang Academy of Agricultural Sciences.
De novo genome assembly
Methods for library construction and sequencing are provided in the Supplementary Note. Contig-level assemblies for the 11 representative accessions were conducted using a pipeline based on Canu (v.1.5)16,44 with the following procedures: longer seed reads were selected with the settings corOutCoverage = 35; raw read overlapping was detected using a highly sensitive overlapper MHAP45 (v.2.1.2, parameter corMhapSensitivity = normal), and error correction was performed using the Falcon46 sense method (option correctedErrorRate = 0.025); error-corrected reads were trimmed of unsupported bases and hairpin adapters to reach their longest supporting range with default parameters, and the draft assemblies were then generated using the top 80% longest trimmed reads. Finally, to ensure base accuracy of assembly results from SMRT molecules, we further polished the consensus genome sequences based on Illumina paired-end reads using Pilon47 (v.1.22) with parameter: -mindepth 10–fix bases.
Scaffolding using Bionano optical maps
For S. galapagense, we constructed Bionano optical maps. Young leaves were collected after two days of dark treatment. High-molecular-weight DNA was isolated and labeled with the restriction endonuclease Nb.BssSI, and labeled DNA was imaged with a Bionano Irys system. Molecules with lengths >150 kb, label SNR ≥3.0 and average molecule intensity <0.6 were retained for scaffolding. These molecules were de novo assembled into genome maps using IrysSolve v.3.5_12162019 (https://bionanogenomics.com/support/software-downloads/). Pairwise comparison was first performed with RefAligner (https://bionanogenomics.com/support/software-downloads/) to identify overlaps among these molecules, and consensus maps were constructed. All molecules were then mapped back to the consensus maps and recursively refined and extended.
The Bionano IrysSolve module ‘HybridScaffold’ was used to perform hybrid assembly between the assembled contigs and genome maps. Assembled contigs were first converted into cmap format and then aligned to the contig cmaps with RefAligner, followed by label rescaling. The rescaled Bionano cmaps were aligned again to the contig cmaps, and sequences were split at the conflict points. Finally, scaffolds were built according to the alignment information. To improve the contiguity of assembly results, PBJelly48 (v.15.8.24) was used to fill gaps using the error-corrected PacBio reads.
Pseudomolecule construction
The Hi-C data were mapped to the assemblies using BWA49 (v.0.7.10-r789) with default parameters. Only uniquely aligned read pairs with mapping quality >20 were retained for further analysis. Duplicate removal, sorting and quality assessment were performed using HiC-Pro50 (v.2.8.1) with default parameters. Only valid interaction pairs of Hi-C reads were fed into LACHESIS (v.1.0)51 for chromosome-scale scaffold construction. Briefly, contigs or scaffolds for each tomato assembly were broken into fragments with a length of 200 kb and then clustered using valid interaction read pairs by LACHESIS with the following parameters: ‘CLUSTER_MIN_RE_SITES = 22, CLUSTER_MAX_LINK_DENSITY = 2, CLUSTER_NONINFORMATIVE_RATIO = 2, ORDER_MIN_N_RES_IN_TRUN = 10, ORDER_MIN_N_RES_IN_SHREDS = 10’. We manually checked the Hi-C interaction heat maps to identify potential genomic regions containing assembled haplotigs due to heterozygosity, which were then excluded from the assembly. The manual curation step was reperformed several times, until the chromatin interaction signals reflecting putative haplotigs were undetectable.
Evaluation of genome assemblies
Completeness of the assembled tomato genomes was first assessed using BUSCO24 (v.5.2.0) based on the embryophyta_odb9 database. We also assessed the mapping proportions of transcripts assembled with Trinity (v.2.8.5)52 to corresponding genome assemblies using BLASTN (v.2.12.0+)53 with minimum alignment length of 300 bp and sequence identity >95%. These assemblies were also evaluated by mapping the Illumina short reads using BWA (default parameters).
Repeat sequence annotation
Both homology-based and de novo strategies were applied to identify repetitive sequences for all the tomato genomes. Four de novo prediction programs were applied: RepeatScout54 (v.1.0.5), LTR-FINDER55 (v.1.05), MITE-hunter (v.1.0)56 and PILER-DF57 (v.1.0). Results from these four programs were integrated into a repetitive sequence database, which was then merged with Repbase58 (v.19.06) and classified into different categories by the PASTEClassifier.py script included in REPET59 (v.2.5). Using this repeat database, repetitive sequences were identified by homolog searching using RepeatMasker60 (v.4.0.5) with default parameters. We computed the genetic distance (K) between both ends of an intact LTR-RT using the distmat (default parameters) program in the EMBOSS package (v.6.6.0)61, and the insertion time (T) of each intact LTR-RT was estimated using the formula T = K/2μ, where μ is the base substitution rate of 1.3 × 10−8 (ref. 62).
Gene prediction and functional annotation
De novo, homology-based and transcriptome-based strategies were used to predict protein-coding genes for all tomato genomes assembled in this study. Predicted proteins from four plant genomes (Arabidopsis thaliana, Oryza sativa, S. lycopersicum and S. tuberosum) were used to perform homology-based prediction with GeMoMa63 (v.1.3.1). Regarding de novo prediction, three different programs were used: GENSCAN (http://hollywood.mit.edu/GENSCAN.html, v.1.0), AUGUSTUS51 (v.2.4) and GlimmerHMM64 (v.3.0.4). We used AUGUSTUS with parameters trained by unigenes, which were assembled from pooled transcriptome data. As for the third approach, transcriptome data generated from pooled tissues of leaves, stems and roots were assembled using HISAT2 (ref. 65) (v.2.0.4) and StringTie66 (v.1.2.3), and the assembled contigs were aligned to the genome assemblies using BLAT (v.36)67 (identity ≥0.95, coverage ≥0.90). The assembled contigs were then filtered using PASA68 (v.2.0.4). We also mapped pooled transcriptome data to the reference genome using TopHat (v.2.0.12)69 and performed reference-guided assemblies with Cufflinks (v.2.2.1)70. Transdecoder71 (v.2.0) was then used to infer the structures of gene models and transcripts assembled by Cufflinks. By giving weights for the three methods, all predicted gene structures were synthesized into consensus gene models using EVidenceModeler72 (v.1.1.1). All gene models were annotated according to their best BLASTP43 (v.2.2.31; E-value <1 × 10−5) hits in protein databases including KEGG73, Swiss-Prot74, TrEMBL74 and nonredundant protein database NR75. Blast2GO (v.4.1.8)76 was used to assign GO terms for each gene.
Phylogenetic tree construction and divergence time estimation
We selected S. tuberosum as the outgroup to infer species phylogeny. Single-copy orthologous genes were identified using quota-alignment (v.1.0)77 with parameters ‘–merge–format=raw–Dm 30–Nm 40’. A total of 9,343 orthologous groups were identified among the 14 genomes. Protein sequences of the 9,343 single-copy orthologous genes were aligned using MUSCLE78 (v.3.8.31) and the alignments were then concatenated. We constructed a phylogenetic tree using phyML (v.3.3.20190909)79 with parameters ‘–model JTT -f e -v 0.576 -a 0.886–nclasses 4–search SPR -t e’. The divergence time was estimated using the MCMCtree program in the PAML package80 (v.4.7b). Three calibration points (S. tuberosum versus S. lycopersicum var. cerasiforme: 7.0–8.0 Ma; S. lycopersicoides versus S. lycopersicum var. cerasiforme: 2.0–2.7 Ma; and S. pimpinellifolium versus S. lycopersicum var. cerasiforme: 1.0–1.5 Ma)81 were used to constrain the divergence time.
Analyses of the super-pangenome
To identify homologous relationships among the genomes of 11 tomatoes assembled in this study, S. lycopersicum var. lycopersicum cv. Heinz 1706 and S. pennellii, the longest transcript of each predicted gene in each genome was chosen as a representative. To handle unannotated genes, a common issue during gene prediction, we aligned coding sequences of all predicted genes to each of the 13 tomato genomes using GMAP (v.2015-06-12)82. If a gene showed more than 80% alignment coverage and identity, and no gene was predicted within the aligned regions, it was considered to be an unannotated gene in the corresponding genome and was not regarded as ‘missing’ in the further analysis. An all-against-all comparison was then performed using BLASTP53 (E-value <1 × 10−5), followed by clustering using OrthoFinder (v.2.5.2)83 with default parameters. Based on the clustering results, we extracted gene families that were shared among all samples; these were defined as core gene families. Genes that were absent from two or more samples were defined as dispensable gene families, whereas those only present in one individual were considered to be specific gene families. Clade-specific gene families were defined as those exclusively present in one of the four clades of wild and cultivated tomatoes. Enrichment analysis with respect to GO terms was performed using the ‘topGO’ R package (https://bioconductor.org/packages/topGO). Details of the methods used for comparison of the super-pangenome and the previously reported tomato pangenome are provided in the Supplementary Note.
Identification of genetic variants
We performed pairwise genome alignments between each of the 12 genomes and the S. galapagense reference genome using the nucmer program in MUMmer84 (v.4.0.0beta2) with default parameters. The resultant alignments were filtered to retain the one-to-one alignment blocks, and SNPs and indels (<50 bp in length) were identified by the show-snps program within MUMmer with parameters ‘-Clr -x 1 -T’. For identification of SVs, two sets of SV calling results were generated using SVMU (v.0.4-alpha)85 and SyRI (v.1.2)86, respectively, both using default parameters. For SV detection using SVMU, intergenomic alignments were performed using the nucmer program in MUMmer84 (v.4.0.0beta2) with default parameters. The results were then parsed in SVMU to produce collinear blocks and insertions, deletions and CNVs. Insertions and deletions larger than 50 bp inside the syntenic alignment regions were also kept for further analysis. For SV calling using SyRI, minimap2 (v.2.21-r1071)87 was used to generate pairwise genome alignments with parameters ‘-ax asm5–eqx’. The alignment results were subsequently passed to SyRI, and SVs consisting of insertions, deletions, inversions (<1 Mb in size) and translocations (>50 bp in length) were kept. SVs encompassing ‘N’ sequences were removed. SVs with ambitious alignment margins and/or poor synteny alignment surrounding the breakpoint were also filtered. We only kept CNVs from SVMU output by applying a filtering of length >50 bp and coverage of reference or coverage of query ≥2 or ≤0.5. Inversions that were >1 Mb were extracted from the results generated from SyRI, followed by a manual check. The identified SVs from each sample were then merged using SURVIVOR (v.1.0.6)88 with the following parameters: ‘50 1 0 0 0 0’.
Analyses of presence frequency of SVs in tomato populations
Details of integrating SVs reported in the previous study are provided in the Supplementary Note. The 12 tomato genomes used in this study and the 100 previously reported tomato accessions12 were divided into four groups: wild (19 accessions from S. galapagense, S. cheesmaniae, S. chmielewskii, S. neorickii, S. corneliomulleri, S. peruvianum, S. chilense, S. habrochaites and S. lycopersicoides), SP (22 S. pimpinellifolium accessions), SLC (24 S. lycopersicum var. cerasiforme accessions) and SLL (47 big-fruited S. lycopersicum var. lycopersicum accessions). We computed presence frequencies of each SV in the four groups and compared those between the wild and cultivated (SLC and SLL) groups using Fisher’s exact test. The resultant P-values were next adjusted using the p.adjust function in R (v.4.03), with the ‘method = ‘fdr’’ parameter. SVs with FDR < 0.01 were regarded as highly divergent between wild and cultivated tomatoes, showing significantly altered presence frequencies between the two groups.
Functional characterization of the candidate gene
To generate an overexpression construct, the full-length ORF sequence of the candidate gene Sgal12g015720 was amplified from S. galapagense using specific primers (Supplementary Table 26) and cloned into the plant expression vector pCAMBIA1300 by seamless cloning. Micro-Tom (S. lycopersicum var. lycopersicum) was transformed with the overexpressing transgene using Agrobacterium tumefaciens (strain GV3101)-mediated cotyledon transformation.
Quantitative real-time PCR
Total RNA was isolated from young fresh materials (roots, stems and leaves) of WT and transgenic tomato lines using the a Plant RNA Kit (catalog number DP432, Tiangen), and cDNA sequences were synthesized using 5X All-In-One Master Mix (with AccuRT Genomic DNA Removal Kit; catalog number G492, Applied Biological Materials Quidel) according to the manufacturer’s instructions. Real-time quantitative PCR (rt-qPCR) was carried out using a LightCycler96 real-time PCR system. Detection of rt-qPCR product was performed by staining with ChamQ SYBR qPCR Master Mix (catalog number Q311-02/03, Vazyme Biotech Co.). Specific primers are listed in Supplementary Table 26. The relative amplification of the tomato actin gene was used for normalization. The amplification was performed using the following conditions: 95 °C for 2 min followed by 40 cycles of 95 °C for 5 s and 60 °C for 30 s. Three samples (biological replicates) of each treatment were duplicated (technical replicates) in the rt-qPCR experiment. The relative expression level of genes was quantified according to the R = 2−ΔΔCt mathematical model. The final value of relative quantification was described as the fold change of gene expression in the test sample compared with the internal control (actin).
Graph-based tomato genome construction and SV genotyping
To integrate the linear reference genome and large-scale genomic variant information, we constructed a graph-based genome of tomato using vg (v.1.38.0)38. Reference sequences of S. galapagense and SVs in terms of insertions and deletions from the 12 tomato genomes (this study) and the 100 tomato genomes reported from a previous study12 were built into a variation graph by the ‘construct’ subcommand in vg without removing any alternate alleles. The preliminary graph was indexed in XG and GBWT by using ‘vg index’ with the ‘-L’ option to retain alternative allele paths. A GBWT index was then built using ‘vg gbwt’ with the parameter ‘-P’. Previously reported Illumina paired-end reads of 321 tomato accessions were subsequently mapped against the indexed graph, and alignments in GAM format were generated by vg giraffe89. We then excluded low-quality alignments with mapping quality less than 5 and base quality less than 5. Finally, a compressed coverage index was calculated using ‘vg pack’, and snarls were generated using ‘vg snarls’, both with default parameters. SV genotyping in the 321 tomato accessions were performed using ‘vg call’ (default parameters) by examining the state (including read pair and split read information) and coverage of mapped reads around the SV breakpoints. Genotyped SVs with fewer than two supporting reads were marked as ‘missing’.
Genome-wide association studies
We selected the 321 tomato accessions that have been resequenced2,10 for GWAS. A total of 43,901,591 SNPs were identified using the GATK (v.4.1.4.1) pipeline90 with the S. galapagense genome as the reference. Population structure was calculated by principal component analysis in PLINK (v.1.9.0b4.6)91 using 437,028 SNPs showing less linkage disequilibrium, which was extracted using PLINK with parameters ‘–indep-pairwise 50 5 0.1 (windows, step, r2)’. The first five principal components were used as cofactors for population structure correction.
A total of 32 tomato flavor-related metabolite traits reported previously2 and contents of 362 annotated metabolites from tomato fruits reported previously40 were analyzed using EMMAX (v.20120210)92 with the default KN kinship, in which the selected principal components were used as cofactors. SNP-based and SV-based GWAS were performed using SNPs or SVs with minor allele frequency >0.01 and missing call rate <0.1. The genome-wide significance thresholds (7.58 × 10−7) were determined using a uniform threshold of 1/n, where n is the effective number of independent SNPs and SVs calculated using the Genetic type 1 Error Calculator (v.0.2)93. Phenotypic variation explained (PVE) was calculated by the formula PVE = [2 × (beta2) × MAF × (1 − MAF)]/[2 × (beta2) × MAF(1-MAF) + ((s.e. × (beta))2) × 2 × N × MAF × (1 − MAF)], where N represents sample size, s.e. is the standard error of the effect number of genetic variants, beta is the effect number of genetic variants and MAF is the minor allele frequency of the target marker.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at 10.1038/s41588-023-01340-y.
Supplementary information
Acknowledgements
We thank J. Zhang (Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences) for critical discussion of the project and revision of the manuscript. We thank J. Liu’s group (Institute of Crop Sciences, Chinese Academy of Agricultural Sciences), E. Xia and W. Tong (Anhui Agricultural University), Y. Fang (Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences) and H. Du (Hebei University) for their helpful technical support with genome assembly and project discussion. We thank the Tomato Genetics Resource Center of UC Davis for providing seeds for wild tomato accessions. We thank M. Liu (Biomarker Technologies) and C. Ji (Chinese Academy of Tropical Agricultural Sciences) for providing sequencing platforms and support with data analysis. This research was supported by grants from the National Natural Science Foundation of China (grant nos. 31860555, 32260763 and 31991180), Key Projects for crop traits formation and cutting-edge technologies in biological breeding (xjnkywdzc-2022001), Key Research and Development Task special project of Xinjiang (2022B02002), Special Incubation Project of Science & Technology Renovation of Xinjiang Academy of Agricultural Sciences (xjkcpy-2021001), China Agriculture Research System of MOF and MARA (CARS-23-G24), Guangdong Major Project of Basic and Applied Basic Research (2021B0301030004), the National Key Research and Development Program of China (2019YFA0906200 and 2021YFF1000100), Shenzhen Science and Technology Program (grant no. KQTD2016113010482651), Special Funds for Science Technology Innovation and Industrial Development of Shenzhen Dapeng New District (grant no. RC201901-05), Shenzhen Outstanding Talents Training Fund and the US National Science Foundation (IOS-1855585).
Source data
Author contributions
Q.Y., H. Li and H.W. conceived and designed the research. Q.Y. and S.H. managed the project. N.L., Q.H. and H. Li performed data analysis. J.Z. J.W., B.W., S.Y.H., Y.T., R.X., C.J., J.H., Z.L., F.J., J.G. and H. Lan collected the samples and performed experiments. Q.H. and N.L. wrote the draft manuscript. H. Li wrote the final manuscript. Q.Y., H. Li, Z.F., S.H., X.Z. and Y.Z. revised the manuscript. T.Y., S.Y., P.A. and K.M. supervised the field experiment.
Peer review
Peer review information
Nature Genetics thanks Chuanyou Li and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Data availability
All assembled genome sequences and annotations are accessible through our database (http://caastomato.biocloud.net). Assemblies for the tomato genomes have also been deposited in the National Center for Biotechnology Information (NCBI) under BioProject accession number PRJNA809001. Raw PacBio, transcriptome and Hi-C sequencing reads have been deposited in the NCBI sequence read archive (https://www.ncbi.nlm.nih.gov/sra/) under BioProject accession number PRJNA756391. Tomato whole-genome sequencing data were downloaded from NCBI (BioProjects: PRJNA259308, PRJNA353161, PRJNA454805 and PRJEB5235). The RepBase database was downloaded from https://www.girinst.org/server/RepBase/index.php. Source data are provided with this paper.
Code availability
Custom scripts and codes used in this study are available at GitHub (https://github.com/HongboDoll/TomatoSuperPanGenome) and Zenodo (10.5281/zenodo.7396707)94.
Competing interests
N.L., Q.H., B.W., J.W., Q.Y., T.Y. and P.A. have filed patent applications on technology related to the processes described in this article (Chinese patent application number CN202111231381.0; ‘Application of the BFNE gene in improvement of plant architecture and biomass in tomato’). The other authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Ning Li and Qiang He.
Contributor Information
Huan Wang, Email: wanghuan@caas.cn.
Hongbo Li, Email: lihongbo_solab@163.com.
Qinghui Yu, Email: yuqinghui@xaas.ac.cn.
Supplementary information
The online version contains supplementary material available at 10.1038/s41588-023-01340-y.
References
- 1.Giovannoni JJ. Genetic regulation of fruit development and ripening. Plant Cell. 2004;16:S170–S180. doi: 10.1105/tpc.019158. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Tieman D, et al. A chemical genetic roadmap to improved tomato flavor. Science. 2017;355:391–394. doi: 10.1126/science.aal1556. [DOI] [PubMed] [Google Scholar]
- 3.Peralta IE, Spooner DM, Knapp S. Taxonomy of wild tomatoes and their relatives (Solanum sect. Lycopersicoides, sect. Juglandifolia, sect. Lycopersicon; Solanaceae) Syst. Bot. Monogr. 2008;84:1–186. [Google Scholar]
- 4.Rick, C. M. Perspectives from plant genetics: the Tomato Genetics Stock Center. In Genetic resources at risk: scientific issues, technologies, and funding policies. Proceedings of a symposium, American Association for the Advancement of Science annual meeting, San Francisco, California, USA, 16 January 1989 (Eds McGuire, P. E. & Qualset, C. O.) 11–19 (Genetic Resources Conservation Program, University of California, 1990).
- 5.Mutschler MA, et al. QTL analysis of pest resistance in the wild tomato Lycopersicon pennellii: QTLs controlling acylsugar level and composition. Theor. Appl. Genet. 1996;92:709–718. doi: 10.1007/BF00226093. [DOI] [PubMed] [Google Scholar]
- 6.Spooner DM, Peralta IE, Knapp S. Comparison of AFLPs with other markers for phylogenetic inference in wild tomatoes [Solanum L. section Lycopersicon (Mill.) Wettst.] TAXON. 2005;54:43–61. [Google Scholar]
- 7.Beckles DM, Hong N, Stamova L, Luengwilai K. Biochemical factors contributing to tomato fruit sugar content: a review. Fruits. 2012;67:49–64. [Google Scholar]
- 8.The Tomato Genome Consortium. The tomato genome sequence provides insights into fleshy fruit evolution. Nature. 2012;485:635–641. doi: 10.1038/nature11119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Hosmani, P. S. et al. An improved de novo assembly and annotation of the tomato reference genome using single-molecule sequencing, Hi-C proximity ligation and optical maps. Preprint at bioRxiv10.1101/767764 (2019).
- 10.Lin T, et al. Genomic analyses provide insights into the history of tomato breeding. Nat. Genet. 2014;46:1220–1226. doi: 10.1038/ng.3117. [DOI] [PubMed] [Google Scholar]
- 11.Aflitos S, et al. Exploring genetic variation in the tomato (Solanum section Lycopersicon) clade by whole-genome sequencing. Plant J. 2014;80:136–148. doi: 10.1111/tpj.12616. [DOI] [PubMed] [Google Scholar]
- 12.Alonge M, et al. Major impacts of widespread structural variation on gene expression and crop improvement in tomato. Cell. 2020;182:145–161.e23. doi: 10.1016/j.cell.2020.05.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Gao L, et al. The tomato pan-genome uncovers new genes and a rare allele regulating fruit flavor. Nat. Genet. 2019;51:1044–1051. doi: 10.1038/s41588-019-0410-2. [DOI] [PubMed] [Google Scholar]
- 14.Sherman RM, Salzberg SL. Pan-genomics in the human genome era. Nat. Rev. Genet. 2020;21:243–254. doi: 10.1038/s41576-020-0210-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Della Coletta R, Qiu Y, Ou S, Hufford MB, Hirsch CN. How the pan-genome is changing crop genomics and improvement. Genome Biol. 2021;22:3. doi: 10.1186/s13059-020-02224-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Zhou Y, et al. Graph pangenome captures missing heritability and empowers tomato breeding. Nature. 2022;606:527–534. doi: 10.1038/s41586-022-04808-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Yu X, et al. Chromosome-scale genome assemblies of wild tomato relatives Solanum habrochaites and Solanum galapagense reveal structural variants associated with stress tolerance and terpene biosynthesis. Hortic. Res. 2022;9:uhac139. doi: 10.1093/hr/uhac139. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Bolger A, et al. The genome of the stress-tolerant wild tomato species Solanum pennellii. Nat. Genet. 2014;46:1034–1038. doi: 10.1038/ng.3046. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Schmidt MH-W, et al. De novo assembly of a new Solanum pennellii accession using nanopore sequencing. Plant Cell. 2017;29:2336–2348. doi: 10.1105/tpc.17.00521. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Wang X, et al. Genome of Solanum pimpinellifolium provides insights into structural variants during tomato breeding. Nat. Commun. 2020;11:1–11. doi: 10.1038/s41467-020-19682-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Takei H, et al. De novo genome assembly of two tomato ancestors, Solanum pimpinellifolium and Solanum lycopersicum var. cerasiforme, by long-read sequencing. DNA Res. 2021;28:dsaa029. doi: 10.1093/dnares/dsaa029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Powell AF, et al. A Solanum lycopersicoides reference genome facilitates insights into tomato specialized metabolism and immunity. Plant J. 2022;110:1791–1810. doi: 10.1111/tpj.15770. [DOI] [PubMed] [Google Scholar]
- 23.Khan AW, et al. Super-pangenome by integrating the wild side of a species for accelerated crop improvement. Trends Plant Sci. 2020;25:148–158. doi: 10.1016/j.tplants.2019.10.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31:3210–3212. doi: 10.1093/bioinformatics/btv351. [DOI] [PubMed] [Google Scholar]
- 25.Chen J, et al. Tracking the origin of two genetic components associated with transposable element bursts in domesticated rice. Nat. Commun. 2019;10:1–10. doi: 10.1038/s41467-019-08451-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Stein JC, et al. Genomes of 13 domesticated and wild rice relatives highlight genetic conservation, turnover and innovation across the genus Oryza. Nat. Genet. 2018;50:285–296. doi: 10.1038/s41588-018-0040-0. [DOI] [PubMed] [Google Scholar]
- 27.Liu Y, et al. Pan-genome of wild and cultivated soybeans. Cell. 2020;182:162–176.e13. doi: 10.1016/j.cell.2020.05.023. [DOI] [PubMed] [Google Scholar]
- 28.Mu QI, et al. Fruit weight is controlled by Cell Size Regulator encoding a novel protein that is expressed in maturing tomato fruits. PLoS Genet. 2017;13:e1006930. doi: 10.1371/journal.pgen.1006930. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Mora-García S, Yanovsky MJ. A large deletion within the clock gene LNK2 contributed to the spread of tomato cultivation from Central America to Europe. Proc. Natl Acad. Sci. USA. 2018;115:6888–6890. doi: 10.1073/pnas.1808194115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Yuste-Lisbona FJ, et al. ENO regulates tomato fruit size through the floral meristem development network. Proc. Natl Acad. Sci. USA. 2020;117:8187–8195. doi: 10.1073/pnas.1913688117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Wellenreuther M, Bernatchez L. Eco-evolutionary genomics of chromosomal inversions. Trends Ecol. Evol. 2018;33:427–440. doi: 10.1016/j.tree.2018.04.002. [DOI] [PubMed] [Google Scholar]
- 32.Huang K, Rieseberg LH. Frequency, origins, and evolutionary role of chromosomal inversions in plants. Front. Plant Sci. 2020;11:296. doi: 10.3389/fpls.2020.00296. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Xia X, et al. Brassinosteroid signaling integrates multiple pathways to release apical dominance in tomato. Proc. Natl Acad. Sci. USA. 2021;118:e2004384118. doi: 10.1073/pnas.2004384118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Vasav AP, Barvkar VT. Phylogenomic analysis of cytochrome P450 multigene family and their differential expression analysis in Solanum lycopersicum L. suggested tissue specific promoters. BMC Genomics. 2019;20:1–13. doi: 10.1186/s12864-019-5483-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Eshed Y, Zamir D. An introgression line population of Lycopersicon pennellii in the cultivated tomato enables the identification and fine mapping of yield-associated QTL. Genetics. 1995;141:1147–1162. doi: 10.1093/genetics/141.3.1147. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Gamuyao R, et al. The protein kinase Pstol1 from traditional rice confers tolerance of phosphorus deficiency. Nature. 2012;488:535–539. doi: 10.1038/nature11346. [DOI] [PubMed] [Google Scholar]
- 37.Zhang Z, et al. Genome-wide mapping of structural variations reveals a copy number variant that determines reproductive morphology in cucumber. Plant Cell. 2015;27:1595–1604. doi: 10.1105/tpc.114.135848. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Garrison E, et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 2018;36:875–879. doi: 10.1038/nbt.4227. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Ameur A. Goodbye reference, hello genome graphs. Nat. Biotechnol. 2019;37:866–868. doi: 10.1038/s41587-019-0199-7. [DOI] [PubMed] [Google Scholar]
- 40.Zhu G, et al. Rewiring of the fruit metabolome in tomato breeding. Cell. 2018;172:249–261.e12. doi: 10.1016/j.cell.2017.12.019. [DOI] [PubMed] [Google Scholar]
- 41.Darwin SC, Knapp S, Peralta IE. Taxonomy of tomatoes in the Galapagos islands: native and introduced species of Solanum section Lycopersicon (Solanaceae) Syst. Biodivers. 2003;1:29–53. [Google Scholar]
- 42.Peralta IE, Knapp S, Spooner DM. New species of wild tomatoes (Solanum section Lycopersicon: Solanaceae) from Northern Peru. Syst. Bot. 2005;30:424–434. [Google Scholar]
- 43.Bayer PE, Golicz AA, Scheben A, Batley J, Edwards D. Plant pan-genomes are the new reference. Nat. Plants. 2020;6:914–920. doi: 10.1038/s41477-020-0733-0. [DOI] [PubMed] [Google Scholar]
- 44.Koren S, et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 2017;27:722–736. doi: 10.1101/gr.215087.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Berlin K, et al. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol. 2015;33:623–630. doi: 10.1038/nbt.3238. [DOI] [PubMed] [Google Scholar]
- 46.Chin C-S, et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods. 2016;13:1050–1054. doi: 10.1038/nmeth.4035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Walker BJ, et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS ONE. 2014;9:e112963. doi: 10.1371/journal.pone.0112963. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.English AC, et al. Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology. PLoS ONE. 2012;7:e47768. doi: 10.1371/journal.pone.0047768. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009;25:1754–1760. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Servant N, et al. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome Biol. 2015;16:1–11. doi: 10.1186/s13059-015-0831-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Burton JN, et al. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat. Biotechnol. 2013;31:1119–1125. doi: 10.1038/nbt.2727. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Grabherr MG, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol. 2011;29:644–652. doi: 10.1038/nbt.1883. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Camacho C, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10:421. doi: 10.1186/1471-2105-10-421. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Price AL, Jones NC, Pevzner PA. De novo identification of repeat families in large genomes. Bioinformatics. 2005;21:i351–i358. doi: 10.1093/bioinformatics/bti1018. [DOI] [PubMed] [Google Scholar]
- 55.Xu Z, Wang H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res. 2007;35:W265–W268. doi: 10.1093/nar/gkm286. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Han Y, Wessler SR. MITE-Hunter: a program for discovering miniature inverted-repeat transposable elements from genomic sequences. Nucleic Acids Res. 2010;38:e199–e199. doi: 10.1093/nar/gkq862. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Edgar RC, Myers EW. PILER: identification and classification of genomic repeats. Bioinformatics. 2005;21:i152–i158. doi: 10.1093/bioinformatics/bti1003. [DOI] [PubMed] [Google Scholar]
- 58.Bao W, Kojima KK, Kohany O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob. DNA. 2015;6:1–6. doi: 10.1186/s13100-015-0041-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Hoede C, et al. PASTEC: an automatic transposable element classification tool. PLoS ONE. 2014;9:e91929. doi: 10.1371/journal.pone.0091929. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Chen N. Using Repeat Masker to identify repetitive elements in genomic sequences. Curr. Protoc. Bioinformatics. 2004;5:4.10.1–4.10.14. doi: 10.1002/0471250953.bi0410s05. [DOI] [PubMed] [Google Scholar]
- 61.Rice P, Longden I, Bleasby A. EMBOSS: The European Molecular Biology Open Software Suite. Trends Genet. 2000;16:276–277. doi: 10.1016/s0168-9525(00)02024-2. [DOI] [PubMed] [Google Scholar]
- 62.Ma J, Bennetzen JL. Rapid recent growth and divergence of rice nuclear genomes. Proc. Natl Acad. Sci. USA. 2004;101:12404–12410. doi: 10.1073/pnas.0403715101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Keilwagen J, Hartung F, Grau J. GeMoMa: homology-based gene prediction utilizing intron position conservation and RNA-seq data. Methods Mol. Biol. 2019;1962:161–177. doi: 10.1007/978-1-4939-9173-0_9. [DOI] [PubMed] [Google Scholar]
- 64.Majoros WH, Pertea M, Salzberg SL. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics. 2004;20:2878–2879. doi: 10.1093/bioinformatics/bth315. [DOI] [PubMed] [Google Scholar]
- 65.Kim D, Paggi JM, Park C, Bennett C, Salzberg SL. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 2019;37:907–915. doi: 10.1038/s41587-019-0201-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Pertea M, et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 2015;33:290–295. doi: 10.1038/nbt.3122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Kent WJ. BLAT—the BLAST-like alignment tool. Genome Res. 2002;12:656–664. doi: 10.1101/gr.229202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Jia H, et al. PASA: identifying more credible structural variants of Hedou12. IEEE/ACM Trans. Comput. Biol. Bioinformatics. 2019;17:1493–1503. doi: 10.1109/TCBB.2019.2934463. [DOI] [PubMed] [Google Scholar]
- 69.Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009;25:1105–1111. doi: 10.1093/bioinformatics/btp120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Trapnell C, et al. Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 2010;28:511–515. doi: 10.1038/nbt.1621. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Haas BJ, et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat. Protoc. 2013;8:1494–1512. doi: 10.1038/nprot.2013.084. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Haas BJ, et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol. 2008;9:1–22. doi: 10.1186/gb-2008-9-1-r7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Kanehisa M, Goto S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000;28:27–30. doi: 10.1093/nar/28.1.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Bairoch A, Apweiler R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 2000;28:45–48. doi: 10.1093/nar/28.1.45. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007;35:D61–D65. doi: 10.1093/nar/gkl842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Conesa A, et al. Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics. 2005;21:3674–3676. doi: 10.1093/bioinformatics/bti610. [DOI] [PubMed] [Google Scholar]
- 77.Tang H, et al. Screening synteny blocks in pairwise genome comparisons through integer programming. BMC Bioinformatics. 2011;12:102. doi: 10.1186/1471-2105-12-102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–1797. doi: 10.1093/nar/gkh340. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Guindon S, et al. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst. Biol. 2010;59:307–321. doi: 10.1093/sysbio/syq010. [DOI] [PubMed] [Google Scholar]
- 80.Yang Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 2007;24:1586–1591. doi: 10.1093/molbev/msm088. [DOI] [PubMed] [Google Scholar]
- 81.Särkinen T, Bohs L, Olmstead RG, Knapp S. A phylogenetic framework for evolutionary study of the nightshades (Solanaceae): a dated 1000-tip tree. BMC Evol. Biol. 2013;13:214. doi: 10.1186/1471-2148-13-214. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Wu TD, Watanabe CK. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics. 2005;21:1859–1875. doi: 10.1093/bioinformatics/bti310. [DOI] [PubMed] [Google Scholar]
- 83.Emms DM, Kelly S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 2019;20:238. doi: 10.1186/s13059-019-1832-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Marçais G, et al. MUMmer4: a fast and versatile genome alignment system. PLoS Comput. Biol. 2018;14:e1005944. doi: 10.1371/journal.pcbi.1005944. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Chakraborty M, Emerson JJ, Macdonald SJ, Long AD. Structural variants exhibit widespread allelic heterogeneity and shape variation in complex traits. Nat. Commun. 2019;10:4872. doi: 10.1038/s41467-019-12884-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Goel M, Sun H, Jiao W-B, Schneeberger K. SyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies. Genome Biol. 2019;20:277. doi: 10.1186/s13059-019-1911-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87.Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34:3094–3100. doi: 10.1093/bioinformatics/bty191. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Jeffares DC, et al. Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast. Nat. Commun. 2017;8:14061. doi: 10.1038/ncomms14061. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89.Sirén J, et al. Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science. 2022;374:abg8871. doi: 10.1126/science.abg8871. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.McKenna A, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–1303. doi: 10.1101/gr.107524.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.Purcell S, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92.Kang HM, et al. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 2010;42:348–354. doi: 10.1038/ng.548. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 93.Li M-X, Yeung JMY, Cherny SS, Sham PC. Evaluating the effective numbers of independent tests and significant p-value thresholds in commercial genotyping arrays and public imputation reference datasets. Hum. Genet. 2012;131:747–756. doi: 10.1007/s00439-011-1118-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 94.Li, H. Scripts and codes used in the tomato super-pangenome paper (1.0). Zenodo 10.5281/zenodo.7396707 (2022).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All assembled genome sequences and annotations are accessible through our database (http://caastomato.biocloud.net). Assemblies for the tomato genomes have also been deposited in the National Center for Biotechnology Information (NCBI) under BioProject accession number PRJNA809001. Raw PacBio, transcriptome and Hi-C sequencing reads have been deposited in the NCBI sequence read archive (https://www.ncbi.nlm.nih.gov/sra/) under BioProject accession number PRJNA756391. Tomato whole-genome sequencing data were downloaded from NCBI (BioProjects: PRJNA259308, PRJNA353161, PRJNA454805 and PRJEB5235). The RepBase database was downloaded from https://www.girinst.org/server/RepBase/index.php. Source data are provided with this paper.
Custom scripts and codes used in this study are available at GitHub (https://github.com/HongboDoll/TomatoSuperPanGenome) and Zenodo (10.5281/zenodo.7396707)94.