Abstract
Background
Luo-han-guo (Siraitia grosvenorii), also called monk fruit, is a member of the Cucurbitaceae family. Monk fruit has become an important area for research because of the pharmacological and economic potential of its noncaloric, extremely sweet components (mogrosides). It is also commonly used in traditional Chinese medicine for the treatment of lung congestion, sore throat, and constipation. Recently, a single reference genome became available for monk fruit, assembled from 36.9x genome coverage reads via Illumina sequencing platforms. This genome assembly has a relatively short (34.2 kb) contig N50 length and lacks integrated annotations. These drawbacks make it difficult to use as a reference in assembling transcriptomes and discovering novel functional genes.
Findings
Here, we offer a new high-quality draft of the S. grosvenorii genome assembled using 31 Gb (∼73.8x) long single molecule real time sequencing reads and polished with ∼50 Gb Illumina paired-end reads. The final genome assembly is approximately 469.5 Mb, with a contig N50 length of 432,384 bp, representing a 12.6-fold improvement. We further annotated 237.3 Mb of repetitive sequence and 30,565 consensus protein coding genes with combined evidence. Phylogenetic analysis showed that S. grosvenorii diverged from members of the Cucurbitaceae family approximately 40.9 million years ago. With comprehensive transcriptomic analysis and differential expression testing, we identified 4,606 up-regulated genes in the early fruit compared to the leaf, a number of which were linked to metabolic pathways regulating fruit development and ripening.
Conclusions
The availability of this new monk fruit genome assembly, as well as the annotations, will facilitate the discovery of new functional genes and the genetic improvement of monk fruit.
Keywords: Siraitia grosvenorii, monk fruit, PacBio sequencing, ortholog analysis, RNA-Seq, mogrosides biosynthesis
Data Description
Introduction
Siraitia grosvenorii (luo-han-guo or monk fruit, NCBI Taxonomy ID: 190515) is an herbaceous perennial native to southern China and is a famous specialty in Guilin city, Guangxi Province of China (Fig. 1) [1]. In addition to being used as a natural sweetener, S. grosvenorii has been used in China as a folk remedy for the treatment of lung congestion, sore throat and constipation for hundreds of years [2]. The ripe fruit of S. grosvenorii contains mogrosides, which have become a popular research topic due to their pharmacological characteristics, including putative anti-cancer properties [3]. Additionally, mogrosides are purified and used as a non-caloric, non-sugar sweetener in the United States and Japan, as they are estimated to be approximately 300 times as sweet as sucrose [1, 4]. To date, S. grosvenorii fruit was shown to have additional pharmacological effects and contain different types of secondary metabolites [5, 6]. Monk fruit products have been approved as dietary supplements in Japan, the US, New Zealand and Australia [2, 7].
The biosynthesis pathway of mogrosides has been extensively studied, and several genes have been identified [8-11]. Squalene is thought to be the initial substrate and precursor for triterpenoid and sterol biosynthesis. Squalene epoxidases (SQE) perform epoxidation, which creates squalene or oxidosqualene, and cucurbitadinenol synthase (CDS) cyclizes oxidosqualene to form the cucurbitadienol triterpenoid skeleton, which is a distinct step in phytosterol biosynthesis [12]. Epoxide hydrolases (EPH) and cytochrome P450s (CYP450) further oxidize cucurbitadienols to produce mogrol, which is glycosylated by UDP-glycosyl-transferases (UGT) to form mogroside V (Fig. 2).
The genome of S. grosvenorii was first published in 2016 and served the purpose of identifying the genomic organization of the gene families of interest but did not act as the reference in the transcriptome assembly and gene families identification [8]. Although the first draft genome assembly was a useful resource, some improvements remain necessary, including improving the continuity and completeness, genome assembly assessment, annotation of genes and repetitive regions, and analysis of other genomic features. With an average read length now exceeding 10 kb, SMRT sequencing technology from Pacific Biosciences (PacBio) has the potential to significantly improve genome assembly quality [13]. Therefore, we de novo assembled a high-quality genome draft of S. grosvenorii using high-coverage PacBio long reads and applied extensive genomic and transcriptomic analyses. This new assembly, annotations, and other genomic features discussed below will serve as valuable resources for investigating the economic and pharmacological characteristics of monk fruit and will also assist in the molecular breeding of monk fruit.
DNA libraries construction and sequencing
A total of 20 μg of genomic DNA was extracted from seedlings of S. grosvenorii (variety Qingpiguo) using a modified CTAB method [14] to construct two libraries with an insert size of 20 kb. The plants were introduced from the Yongfu District (Guangxi Province, China) and planted in Cangxi County (Sichuan Province, China). Sequencing of S. grosvenorii was performed using the Pacbio RSII platform (Pacific Biosciences, USA) and generated 31 Gb (∼73.8 x) of data from 44 SMRT cells, with an average subread length of 7.7 kb and read quality of 82% after filtering out low-quality bases and adapters (Table 1).
Table 1:
Statistics | Length (bp) |
---|---|
Total raw data | 31 G |
Mean length of raw reads | 11 k |
N50 of raw reads | 15 754 |
Mean length of subreads | 7.7 k |
N50 of subreads | 11,898 |
Subreads: reads without adapters and low-quality bases.
A total of 300 ng of genomic DNA was extracted as described above, and the library was constructed using DNA sequence fragments of ∼470 bp, with an approximate insert size of 350 bp. Sequencing was performed using a 2 × 150 paired-end (PE) configuration, and base calling was conducted using the HiSeq Control Software + Off-Line Base Caller (OLB) + GAPipeline-1.6 (Illumina; CA, USA) on the HiSeq instrument, which generated a total of 169 M (over 100 x) short reads.
RNA isolation and sequencing
Fresh roots, leaves, and early fruit of S. grosvenorii were sampled in our garden in Cangxi County. All samples were stored at -80°C after immediate treatment with liquid nitrogen. Total RNA was isolated from (1) leaves of female plants (FL), (2) leaves of male plants (ML), (3) leaves beside fruits (L), (4) roots (R), (5) fruit of 3 DAA (F1), and (6) fruit of 20 DAA (F2) using the Qiagen RNeasy Plant Mini Kits (Qiagen, CA, USA). PE libraries (PE150 with an insert size of 350 bp) were constructed and subsequently sequenced via the Illumina HiSeq X-Ten platform (Illumina, CA, USA).
Genome assembly
Initial correction of long reads was performed using FALCON (Falcon, RRID:SCR_016089) [15] with _cutoff length = 5,000 according to the distribution of read lengths and -B15, -s400 to cut reads into blocks of 400 Mb and align 15 blocks to another block at the same time. The 25x coverage of the longest corrected reads was extracted with Perl scripts and assembled by mecat2canu command of MECAT [15] with GenomeSize = 420 000 000 estimated in the previous study [8]. This led to a new genome assembly of 467 Mb with a contig N50 size of 434,684 bp (Table 2). This genome size was slightly larger than the estimated 420 Mb [8], which was likely due to the high genome heterozygosity. We used the consensus algorithm Quiver [16] and further polished the assembly with PE reads using Pilon (Pilon, RRID:SCR_014731) [17]. The final assembly produced 4,128 contigs, 614 of which were >100 kb long, with a contig N50 length of 432,384 bp (Table 2). Compared to the preliminary draft of the published Siraitia genome, the contiguity was improved more than ∼12.6 times.
Table 2:
Statistics | Contig | Contig (polished) |
---|---|---|
Total number | 4128 | 4128 |
Total length (bp) | 467,072,951 | 469,518,713 |
N50 length (bp) | 433,684 | 432,384 |
N90 length (bp) | 36,820 | 36,953 |
Max length (bp) | 7657,852 | 7683,850 |
GC content (%) | 33.57 | 33.49 |
Genome assessment
We estimated the completeness of the assembly using Benchmarking Universal Single-Copy Orthologues (BUSCO v2, RRID:SCR_015008) [18] analysis. Of the 1,440 orthologues identified in plants, 1,284 were found in the genome assembly, including 849 in single-copy and 435 in multi-copy (Table 3). In addition, we used RNA-Seq data from different organs to assess the sequence quality. All 15 RNA-Seq libraries were mapped to the assembly using HISAT2 (HISAT2, RRID:SCR_015530) [19], and the overall alignment rate for each data was used as a rough estimation of sequence quality. We also estimated the base error rate of the assembly with both DNA paired-end reads and published DNA short reads [8]. We used BWA-mem [20] to align both short reads to the genome assembly and filtered out low-quality (mapping quality <30) alignments with SAMtools (SAMtools, RRID:SCR_002105) [21]. Then, we used the Genome Analysis Toolkit (RRID:SCR_001876) HaplotypeCaller [22] to call short variants. The Genome Analysis Toolkit VariantFiltration program was used to filter out low-quality variants with the following expression: QD < 2.0 || ReadPosRankSum < -8.0 || FS > 60.0 || QUAL < 50 || DP < 10. Coverage of each alignment file was scanned using Qualimap 2 [23], and the error rate was calculated as the average number of short variants that appear at both alleles (labeled as 1/1 and 1/2 in Table 5) per base. The overall alignment rates of reads in all samples were over 80% (Table 4), and the average base error rate was estimated as less than 1E-3, which suggests a high-quality assembly (Table 5).
Table 3:
Monk fruit (%) | |
---|---|
Complete BUSCOs | 89.2 |
Complete and single-copy | 59.0 |
Complete and duplicated | 30.2 |
Partial | 2.7 |
Missing | 8.1 |
Table 5:
Number of variation | |||||||
---|---|---|---|---|---|---|---|
Sample | Mean depth | Coverage | 0/1 | 1/1 | 1/2 | Total | Error rate |
Paired-end | 65.3 x | 92.99% | 1342,849 | 37,987 | 14,704 | 1395,540 | 1.21E-4 |
Published | 80.0 x | 90.79% | 2569,592 | 172,906 | 16,777 | 2759,276 | 4.45E-4 |
High-quality genome criteria: 1E-4.
0: genotype that is identical to the reference, 1,2: genotype that is different from the reference.
Error rate = (Number of 1/1 + Number of 1/2)/(Genome size * Coverage).
Table 4:
Sample | Overall alignment rate |
---|---|
FL-1 | 89.93% |
FL-2 | 87.75% |
FL-3 | 85.83% |
ML-1 | 89.70% |
ML-2 | 89.73% |
ML-3 | 85.07% |
L-1 | 85.95% |
L-2 | 87.39% |
R-1 | 81.50% |
R-2 | 84.36% |
R-3 | 84.57% |
F1–1 | 84.35% |
F1–2 | 91.58% |
F2–1 | 86.83% |
F2–2 | 87.37% |
FL: female leaf, ML: male leaf, L: leaf, R: root, F1: fruit stage 1, F2: fruit stage 2.
Repeat annotation
We scanned the genome using RepeatMasker (RRID:SCR_012954) [24] with Repbase [25] and a de novo repeat database constructed with RepeatModeler (RID:SCR_015027) [26]. Sequences 240 Mb (51.14% of the assembled genome) in length were identified as repetitive elements, which was slightly larger than the 42.8% of Momordica charantia [27] and much larger than the 28.2% of Cucumis sativus [28]. We further classified the repetitive regions and found that the vast majority were interspersed repeats. Among them, the main subtypes were unclassified repeats and long terminal repeats (LTRs), with Copia (27.1 Mb, 5.8% of the genome) and Gypsy (38.6 Mb, 8.2% of the genome) LTRs being the most abundant. Compared to cucumber, the genome enlargement in monk fruit and bitter gourd was likely driven by the expansion of interspersed repeats (Table 6).
Table 6:
S. grosvenorii | M. charantia | C. sativus | |||||
---|---|---|---|---|---|---|---|
Repeat classification | Length (bp) | Content | Length (bp) | Content | Length (bp) | Content | |
Interspersed repeats | SINEs | 0 | 0.00% | 0 | 0.00% | 0 | 0.00% |
LINEs | 9629,949 | 2.05% | 5183,926 | 1.82% | 2397,830 | 1.22% | |
LTR | 67,499,840 | 14.38% | 34,217,647 | 11.98% | 8253,090 | 4.18% | |
DNA elements | 9372,444 | 2.00% | 3460,431 | 1.21% | 2777,943 | 1.41% | |
Unclassified | 147,311,542 | 31.38% | 75,056,338 | 26.28% | 37,539,553 | 19.03% | |
Total | 233,813,775 | 49.80% | 117,918,342 | 41.29% | 50,967,966 | 25.84% | |
Simple repeats | 5401,880 | 1.15% | 3451,508 | 1.21% | 3547,474 | 1.80% | |
Low complexity | 1570,875 | 0.33% | 958,289 | 0.34% | 1095,406 | 0.56% | |
Total | 240,122,745 | 51.14% | 122,111,538 | 42.75% | 55,540,243 | 28.15% |
Gene annotation
To generate gene models, the S. grosvenorii genome was annotated using three gene prediction pipelines including homology-based, de novo, and RNA-Seq data-based prediction. First, we aligned the three cucurbitaceous proteomes downloaded from the cucurbit database ([29] cucumber_Chinese_Long_v2, melon_v3, and watermelon_97 103_v1) to the genome assembly using TBLASTN with an E-value of 1e-5 and filtering out bad hits (identity <50% and length <50%). The best hit of each retained protein was extracted and further used to predict protein coding gene structures with GeneWise (RRID:SCR_015054) [30, 31]. Second, we de novo predicted protein coding genes using AUGUSTUS (RRID:SCR_008417) [32] with the repeat masked genome. Third, we used StringTie [33] to assemble 15 RNA-Seq alignment files (described above) generated from HISAT2 using the assembly as the reference and TransDecoder [34] to generate an annotation file based on transcripts. Finally, the three respective annotation files were combined using EVidenceModeler (RRID:SCR_014659) [35]. After combining these gene structure predictions, we obtained 30,565 consensus protein-coding genes (Table 7). We annotated the genes using BLASTp searching against the NCBI nonredundant protein database (nr) and found that 78.3% of the predicted genes had at least one significant homologue (E-value < 1E-3), indicating that the gene structures were credible. We found that the majority of homologous proteins belonged to cucurbitaceous plants, such as cucumber and muskmelon (Fig. 3). Protein domain and gene ontology term annotations were performed using InterProScan 5 (RRID:SCR_005829, Table 7) [36]. In addition, genes annotated as SQEs, EPHs, CDSs, EPHs, CYP450s, and UGTs were compared with those in other Cucurbitaceae genomes, and we found that gene abundance in the five mogroside-related gene families were not significantly different among S. grosvenorii, Cucumis sativus, Cucurbita moschata, and Cucurbita maxima ([29], Table 8).
Table 7:
RNA-Seq data-based | Ab initio | Homology-based | Integration | Annotation | |||
---|---|---|---|---|---|---|---|
Weight | 10 | 0.1 | 5 | - | - | ||
Number of predicted genes | 27,304 | 60,818 | 130,686 | 30,565 | nr | IPR | GO |
23,936 | 19,684 | 14,966 | |||||
Tools | HISAT2 StringTie TransDecoder | RepeatMasker AUGUSTUS | BLAST GeneWise | EVM | BLAST | InterProScan |
Table 8:
S. grosvenorii | C. sativus | C. moschata | C. maxima | |
---|---|---|---|---|
SQE | 5 (5) | 1 | 2 | 1 |
EPH | 30 (8) | 23 | 29 | 22 |
CYP450 | 276 (191) | 213 | 289 | 234 |
UGT | 156 (131) | 124 | 137 | 121 |
CDS | 1 (1) | 1 | 2 | 3 |
The numbers quoted are the number of genes belonging to each gene family annotated in monk fruit genome version 1.
Ortholog analysis
Gene family clustering analysis was accomplished using OrthoMCL (RRID:SCR_007839) [37] on protein sequences of S. grosvenorii, C. sativus (cucumber_ChineseLong_v2, [29]) [28], Cucumis melo (CM3.5.1, [29]) [38], Citrullus lanatus (watermelon_97 103_v1, [29]) [39], Prunus persica (Prunus_persica.prupe1_0, [40]) [41], Solanum lycopersicum (Solanum_lycopersicum.SL2.50, [40]) [42], Arabidopsis thaliana (Tair10, [43]) [44], and Oryza sativa (Oryza_sativa.IRGSP-1.0,40 ]) [45]. A total of 23,246 S. grosvenorii genes were clustered into 26,190 gene families, including 1,471 unique S. grosvenorii gene families (Fig. 4A). Compared to other cucurbitaceous plants, S. grosvenorii shares fewer gene families with relative species (Fig. 4B), indicating an earlier divergence time than C. lanatus. A total of 834 single-copy gene families were identified and selected to construct the phylogenetic tree using RAxML (RRID:SCR_006086) [46]. We used Muscle (RRID:SCR_011812) [47, 48] to align the orthologs, and the alignment was treated with Gblocks [49] with parameters of -t = p -b5 = h -b4 = 5 –b3 = 15 -d = y -n = y. The divergence time was estimated by MCMCtree [50]. Phylogenetic analysis showed that S. grosvenorii diverged from the Cucurbitaceae family approximately 40.95 million years ago (Fig. 4C).
Transcriptomic analysis
Mogrosides are produced during fruit development in S. grosvenorii and are not found in vegetative tissues [8]. Thus, we performed an extensive transcriptomic analysis of early fruit at two stages (stage 1 sampled at 3 days after anthesis and stage 2 sampled at 20 days after anthesis) and of leaves to identify transcripts involved in mogroside synthesis in early fruit. Using the genome-wide annotation, RNA-Seq reads were mapped to the genome assembly, and read count tables were generated using HISAT2 and StringTie [33] for the next step of differential expression analysis. DESeq2 (RRID:SCR_000154) [51] was used to detect differential gene expression among L, F1, and F2 with the criteria of padj < 0.01 and |log2FoldChange| > 1. Genes that were up-regulated with fruit development were merged and used for KEGG pathway enrichment analysis with KOBAS (RRID:SCR_006350) [52]. Thirteen pathways were significantly enriched (corrected P < 0.01), and the most enriched pathways were related to metabolic pathways. In particular, the sesquiterpenoid and triterpenoid biosynthesis pathways were significantly enriched, indicating that genes involved in the biosynthesis of secondary metabolites, including mogrosides, perform their functions in the very early fruit (Fig. 5). Genes possibly related to mogrosides biosynthesis in early fruit according to the gene annotation were assigned to the mogrosides synthesis pathway (Fig. 2).
Discussion
S. grosvenorii is an important herbal crop with multiple economic and pharmacological values. Mogrosides, the main effective components of S. grosvenorii fruit, are partial substitutes of sucrose because of its extremely sweet and noncaloric characteristics as more progress is made on molecular breeding and purification processes. Additionally, monk fruit could serve in contrast to other cucurbitaceous plant because of its earlier divergence from the common ancestor than some other well-studied cucurbits (cucumber, muskmelon), and it may be a new system for the investigation of plant sex determination. In the present study, we sequenced and assembled the second version of the monk fruit genome. With a great improvement in completeness and accuracy, the genome as well as the annotations will provide valuable resources and reference information for transcriptome assembly and novel gene discovery. These resources and further transcriptomics analysis of ripe fruit and young fruit will facilitate studies of the secondary metabolite synthesis pathways and monk fruit breeding.
Availability of supporting data
The genomic and transcriptomic sequencing reads were deposited in the Genome Sequence Archive under accession number CRA000522 and ENA (European Nucleotide Archive) under accession numbers PRJEB23465, PRJEB23466, and PRJEB25737. Supporting data are also available in the GigaScience database, GigaDB [53].
Abbreviations
CDS: cucurbitadinenol synthase; CYP450: cytochrome P450; EPH: epoxide hydrolase; F1: fruit of 3 DAA; F2: fruit of 20 DAA; FL: female plants; L: leaves beside fruits; ML: male plants; PacBio: Pacific Biosciences; PE: paired-end; R: root; SMRT: single molecule real time sequencing; SQE: squalene epoxidase; UGT: UDP-glycosyl-transferase.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
X.W.D., B.C., H.H., and M.X. planned and coordinated the project. M.X. collected and grew the plant material. R.Y. and G.Z. collected the samples and performed experiments. Genome assembly, annotation, phylogenetic analysis, and manuscript writing were completed by X.H., M.X., H.H., and X.W.D.
Supplementary Material
ACKNOWLEDGEMENTS
This research was supported by the National Key R&D Program of China (2017YFA0503800) to X.W.D. and in part by the National Demonstration Area of Modern Agriculture in Cangxi, Sichuan Province, China.
References
- 1. Zhang JS, Dai LH, Yang JG et al. . Oxidation of cucurbitadienol catalyzed by CYP87D18 in the biosynthesis of mogrosides from Siraitia grosvenorii. Plant Cell Physiol. 2016;57:1000–7. [DOI] [PubMed] [Google Scholar]
- 2. Li C, Lin LM, Sui F et al. . Chemistry and pharmacology of Siraitia grosvenorii: A review. Chinese J Nat Med. 2014;12:89–102. [DOI] [PubMed] [Google Scholar]
- 3. Liu C, Dai LH, Dou DQ et al. . A natural food sweetener with anti-pancreatic cancer properties. Oncogenesis. 2016;5:e217. [DOI] [PMC free article] [PubMed] [Google Scholar] [Retracted]
- 4. Nie RL. The decadal progress of triterpene saponins from Cucurbitaceae (1980–1992). Acta Bot Yunnan. 1994;16:201–8. [Google Scholar]
- 5. Wang Q, Qin HH, Wang W et al. . The pharmacological research progress of Siraitia grosvenorii. J Guangxi Tradit Chin Med Univ. 2010;13:75–76. [Google Scholar]
- 6. Zhang H, Li XH. Research progress on chemical compositions of Fructus Momordicae. J Anhui Agri Sci. 2011;39:4555–56., 4559. [Google Scholar]
- 7. Pawar RS, Krynitsky AJ, Rader JI. Sweeteners from plants–with emphasis on Stevia rebaudiana (Bertoni) and Siraitia grosvenorii (Swingle). Anal Bioanal Chem. 2013;405:4397–407. [DOI] [PubMed] [Google Scholar]
- 8. Itkin M, Davidovich-Rikanati R, Cohen S et al. . The biosynthetic pathway of the nonsugar, high-intensity sweetener mogroside V from Siraitia grosvenorii. Proc Natl Acad Sci USA. 2016;113:E7619–28. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Dai LH, Liu C, Zhu YM et al. . Functional characterization of cucurbitadienol synthase and triterpene glycosyltransferase involved iin biosynthesis of mogrosides from Siraitia grosvenorii. Plant Cell Physiol. 2015;56:1172–82. [DOI] [PubMed] [Google Scholar]
- 10. Zhang JS, Dai LH, Yang JG et al. . Oxidation of cucurbitadienol catalyzed by CYP87D18 in the biosynthesis of mogrosides from Siraitia grosvenorii. Plant Cell Physiol. 2016;57:1000–7. [DOI] [PubMed] [Google Scholar]
- 11. Tang Q, Ma XJ, Mo CM et al. . An efficient approach to finding Siraitia grosvenorii triterpene biosynthetic genes by RNA-seq and digital gene expression analysis. BMC Genomics. 2011;12:343. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Shibuya M, Adachi S, Ebizuka Y. Cucurbitadienol synthase, the first committed enzyme for cucurbitacin biosynthesis, is a distinct enzyme from cycloartenol synthase for phytosterol biosynthesis. Tetrahedron. 2004;60:6995–7003. [Google Scholar]
- 13. Zimin AV, Stevens KA, Crepeau MW et al. . An improved assembly of the loblolly pine mega-genome using long-read single-molecule sequencing. Gigascience. 2017;6:1–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Porebski S, Bailey LG, Baum BR. Modification of a CTAB DNA extraction protocol for plants containing high polysaccharide and polyphenol components. Plant Mol Biol Rep. 1997;15:8–15. [Google Scholar]
- 15. Xiao CL, Chen Y, Xie SQ et al. . MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads. Nat Methods. 2017;14:1072–4. [DOI] [PubMed] [Google Scholar]
- 16. Chin CS, Alexander DH, Marks P et al. . Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat Methods. 2013;10:563–9. [DOI] [PubMed] [Google Scholar]
- 17. Walker BJ, Abeel T, Shea T et al. . Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One. 2014;9:e112963. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Simão FA, Waterhouse RM, Ioannidis P et al. . BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31:3210–2. [DOI] [PubMed] [Google Scholar]
- 19. Kim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nat Methods. 2015;12:357–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Li H, Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM, arXiv 2013;arXiv:13033997 [Google Scholar]
- 21. Li H, Handsaker B, Wysoker A et al. . The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics. 2009;25:2078–79. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. McKenna A, Hanna M, Banks E et al. . The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Okonechnikov K, Conesa A, García-Alcalde F. Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics. 2016;32:292–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Tarailo-Graovac M, Chen N. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr Protoc Bioinformatics. 2009;3:4–14. [DOI] [PubMed] [Google Scholar]
- 25. Visser M, Van der Walt AP, Maree HJ et al. . Extending the sRNAome of apple by next-generation sequencing. PLoS One. 2014;9:e95782. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Smit A, Hubley R. RepeatModeler Open-1.0.8, 2008; http://www.repeatmasker.org/RepeatModeler.html. [Google Scholar]
- 27. Urasaki N, Takagi H, Natsume S et al. . Draft genome sequence of bitter gourd (Momordica charantia), a vegetable and medicinal plant in tropical and subtropical regions. DNA Res. 2016;24:51–58. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Huang S, Li R, Zhang Z et al. . The genome of the cucumber, Cucumis sativus L. Nat Genet. 2009;41:1275–81. [DOI] [PubMed] [Google Scholar]
- 29. Cucurbit Genomics Database: http://cucurbitgenomics.org, Accessed 12 Jun 2018. [Google Scholar]
- 30. Wise2: https://www.ebi.ac.uk/∼birney/wise2/, Accessed 12 Jun 2018. [Google Scholar]
- 31. Gupta V, Estrada AD, Blakley I et al. . RNA-Seq analysis and annotation of a draft blueberry genome assembly identifies candidate genes involved in fruit ripening, biosynthesis of bioactive compounds, and stage-specific alternative splicing. Gigascience. 2015;4:5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Stanke M, Tzvetkova A, Morgenstern B. AUGUSTUS at EGASP: using EST, protein and genomic alignments for improved gene prediction in the human genome. Genome Biol. 2006;7:S11.1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Pertea M, Pertea GM, Antonescu CM et al. . StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol. 2015;33:290–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. TransDecoder GitHub: https://github.com/TransDecoder/TransDecoder, Accessed 12 Jun 2018. [Google Scholar]
- 35. Haas BJ, Salzberg SL, Zhu W et al. . Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol. 2008;9:R7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Quevillon E, Silventoinen V, Pillai S et al. . InterProScan: protein domains identifier. Nucleic Acids Res. 2005;33:W116–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Li L, Stoeckert CJJr, Roos DS. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003;13:2178–89. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Garcia-Mas J, Benjak A, Sanseverino W et al. . The genome of melon (Cucumis melo L.). Proc Natl Acad Sci USA. 2012;109:11872–77. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Guo S, Zhang J, Sun H et al. . The draft genome of watermelon (Citrullus lanatus) and resequencing of 20 diverse accessions. Nat Genet. 2013;45:51–8. [DOI] [PubMed] [Google Scholar]
- 40. EnsemblPlants: https://plants.ensembl.org/, Accessed 12 Jun 2018. [Google Scholar]
- 41. International Peach Genome Initiative, Verde I, Abbott AG et al. . The high-quality draft genome of peach (Prunus persica) identifies unique patterns of genetic diversity, domestication and genome evolution. Nat Genet. 2013;45:487–94. [DOI] [PubMed] [Google Scholar]
- 42. Tomato Genome Consortium The tomato genome sequence provides insights into fleshy fruit evolution. Nature. 2012;485:635–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. TAIR: http://Arabidopsis.org/, Accessed Jun 2018. [Google Scholar]
- 44. Lamesch P, Berardini TZ, Li D et al. . The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Res. 2012;40:D1202–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. International Rice Genome Sequencing Project. The map-based sequence of the rice genome. Nature. 2005;436:793–800. [DOI] [PubMed] [Google Scholar]
- 46. Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014;30:1312–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. MUSCLE: https://www.ebi.ac.uk/Tools/msa/muscle/, Accessed 12 Jun 2018. [Google Scholar]
- 48. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–97. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Talavera G, Castresana J. Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. Syst Biol. 2007;56:564–77. [DOI] [PubMed] [Google Scholar]
- 50. Battistuzzi FU, Billing-Ross P, Paliwal A et al. . Fast and slow implementations of relaxed-clock methods show similar patterns of accuracy in estimating divergence times. Mol Biol Evol. 2011;28:2439–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:550. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52. Xie C, Mao X, Huang J et al. . KOBAS 2.0: a web server for annotation and identification of enriched pathways and diseases. Nucleic Acids Res. 2011;39:W316–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53. Xia M, Han X, He H et al. . Supporting data for “Improved de novo genome assembly and analysis of the Chinese cucurbit Siraitia grosvenorii, also known as monk fruit or luo-han-guo”. GigaScience Database. 2018. http://dx.doi.org/10.5524/100452. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.