Skip to main content
GigaByte logoLink to GigaByte
. 2021 Jun 10;2021:gigabyte24. doi: 10.46471/gigabyte.24

Improvements in the sequencing and assembly of plant genomes

Priyanka Sharma 1, Othman Al-Dossary 1,2, Bader Alsubaie 1,2, Ibrahim Al-Mssallem 2, Onkar Nath 1, Neena Mitter 1, Gabriel Rodrigues Alves Margarido 1,4, Bruce Topp 1, Valentine Murigneux 3, Ardashir Kharabian Masouleh 1, Agnelo Furtado 1, Robert J Henry 1,5,*
PMCID: PMC9631998  PMID: 36824328

Abstract

Advances in DNA sequencing have made it easier to sequence and assemble plant genomes. Here, we extend an earlier study, and compare recent methods for long read sequencing and assembly. Updated Oxford Nanopore Technology software improved assemblies. Using more accurate sequences produced by repeated sequencing of the same molecule (Pacific Biosciences HiFi) resulted in less fragmented assembly of sequencing reads. Using data for increased genome coverage resulted in longer contigs, but reduced total assembly length and improved genome completeness. The original model species, Macadamia jansenii, was also compared with three other Macadamia species, as well as avocado (Persea americana) and jojoba (Simmondsia chinensis). In these angiosperms, increasing sequence data volumes caused a linear increase in contig size, decreased assembly length and further improved already high completeness. Differences in genome size and sequence complexity influenced the success of assembly. Advances in long read sequencing technology continue to improve plant genome sequencing and assembly. However, results were improved by greater genome coverage, with the amount needed to achieve a particular level of assembly being species dependent.

Data description

This article is an update on the previously published article; “Comparison of long-read methods for sequencing and assembly of a plant genome” [1].

Recent advances in DNA sequencing technology have facilitated the sequencing and assembly of plant genomes. There has been rapid growth in the number of reports of high-quality chromosome level assemblies [2]. A basal eudicot, Macadamia jansenii, was used to compare the range of long read sequencing and assembly technologies available in 2019 [1]. The Pacific Biosciences (PacBio) Sequel, Oxford Nanopore Technology (ONT) PromethION and Beijing Genomics Institute (BGI) single-tube Long Fragment Read platforms were used to analyse the same sample. Assembly tools were evaluated for these data sets, and the contribution of short reads to improving assemblies was assessed [1]. Technology improvements had delivered continuing increases in the length and quality of sequence reads delivered by these platforms.

Context

Since the original study, notable further advances have been made, with the use of repeated sequencing of the same molecule to greatly increase sequence accuracy for long reads. This allows the generation of long reads (10–25 kilobase pairs [kb]) with greater than 99.5% accuracy [3]. Comparison of long read technologies demonstrates the advantages and disadvantages of different platforms in relation to contiguity, accuracy of sequence and data analysis time [4]. We now update the earlier study to demonstrate the impact of these improvements on genome assemblies. Factors such as the volume of data (base pairs, bp) used in the assembly were explored for the Macadamia genome, related species and other diverse species with similar sized genomes.

Methods

DNA extraction

All local, national and international guidelines and legislations were observed in obtaining samples for this study. Macadamia jansenii (NCBI:txid83725) DNA was prepared as described earlier [5]. Three other Macadamia species (M. tetraphylla [NCBI:txid512563], M. ternifolia [NCBI:txid4330] and M. integrifolia [NCBI:txid60698]) and jojoba (Simmondsia chinensis [NCBI:txid3999]) were also extracted using this method, with minor modifications (phenol was excluded from the extraction method) [6]. Avocado (Persea americana [NCBI:txid3435]) DNA was isolated by a modified cetyl-trimethyl ammonium bromide (CTAB) DNA extraction protocol [7, 8]. Leaf tissue (0.2 g) was ground and added to 15 ml of 2% CTAB buffer, pH 8.0, followed by 15 min incubation at 65 °C. After centrifugation at 10g for 15 min, the supernatant was treated with RNAse A (10 ng/μl) and incubated at 37 °C for 30 min. Chloroform:isoamyl alcohol (24:1) washes were performed, followed by precipitation with isopropanol and 70% ethanol washes. The DNA was resuspended in ultrapure DNAse and RNAse-free water for sequencing.

DNA sequencing and assembly

Long read sequencing was conducted as previously described [5]. Continuous long reads (CLR) were assembled using Falcon (RRID:SCR_016089) [1] for M. jansenii and Canu (RRID:SCR_015880, v 2.0) for the other genomes. HiFi gDNA libraries were prepared using sheared genomic DNA (∼15–20 Kb). gDNA was sequenced on a PacBio Sequel II (software/chemistry v9.0.0) following diffusion loading. Sequence data was processed to generate circular consensus sequencing (CCS) reads using the default settings of the CCS application (RRID:SCR_017990, v4.2.0) in SMRT Link (RRID:SCR_002942, v9.0.0); minimum parameters for passes (3), accuracy (0.99), CCS read length (10) and maximum CCS read length (50,000). CCS reads were assembled using the Improved Phased Assembly (IPA) method (PacBio).

The IPA assembly method is available in protocols.io (Figure 1) [9].

Figure 1.

Figure 1.

Protocol for IPA assembly for PacBio Hifi reads [9]. https://www.protocols.io/widgets/doi?uri=dx.doi.org/10.17504/protocols.io.buxvnxn6

The IPA assembly tool [10] uses the HiFi sequencing reads (high-quality consensus reads) and generates phased assembly. This produces a primary contig folder, including the main assembly and an associated contig, containing haplotigs and duplications. For all assemblies, 24 core processing units (CPUs) and 120 Gb of memory was employed.

Assessment of completeness

The completeness of genome assemblies was evaluated using benchmarking universal single-copy orthologues (BUSCO) analysis (RRID:SCR_015008, v4.1.2 and v5.beta) [11, 12], using genome mode and lineage Eukaryota_odb10 dataset.

Data validation and quality control

Long read versus HiFi assemblies

Comparison of assemblies based upon long reads [13] and circular consensus sequence (CCS) reads from HiFi [3] showed that greater accuracy of the CCS reads resulted in greatly improved assemblies for the M. jansenii genome (Table 1).

Table 1.

Improvement in long read sequencing (PacBio) for Macadamia jansenii when using higher accuracy sequencing.

Parameter Long reads [1] HIFi
Total data (Gb) 65.2 28
Contig N50 (Mb) 1.57 4.49
Assembly length (Mb) 758 738
Number of contigs (n) 762 284
BUSCO (%) 97 98

phased Falcon assembly. BUSCO: Benchmarking Universal Single-Copy Orthologs; Gb: gigabase pairs; Mb: megabase pairs.

The assembly with the high quality HiFi reads was less fragmented, with slightly reduced total genome length and improved completeness (Benchmarking Universal Single-Copy Orthologs, BUSCO). The use of around 20 gigabase pairs (Gb) of high-quality (HiFi) data gave N50 values of 4 megabase pairs (Mb) and resulted in assemblies with fewer than 300 contigs required to cover the genome. This represents a considerable advance over the assemblies that were possible when this sample was previously used to compare different long read platforms and assembly tools, many of which required long computing times to assemble contigs [1]. The high-quality Improved Phased Assembler (IPA) assembly had a run time of 20 h with 120 gigabytes (GB) peak memory on the FlashLite computer cluster. This analysis requirement compares favorably with the results for a large number of earlier assembly tools reported for the same sample [1], but provides a much higher quality assembly. Assembly of the HiFi data with other recent tools was also compared. Flye (RRID:SCR_017016, v2.8.3) resulted in a highly fragmented genome of 993 Mb with an N50 of 459 Kb, while Hifiasm (RRID:SCR_021069, v0.15) produced a genome of 827 Mb comprising 779 contigs, but with an N50 of 46.1 Mb and an L75 of 14.

Results for other Macadamia species

M. jansenii is an endangered species. It is one of four species in the Macadamia genus. Sequences of all four species were obtained using the same HiFi techniques and all gave similar, high-quality outcomes when assembled (Table 2).

Table 2.

Comparison of assemblies of Macadamia species.

Parameter M. jansenii M. integrifolia M. tetraphylla M. ternifolia
Contig N50 (Mb) 4.5 5.3 10.0 6.4
Longest Contig (Mb) 16.6 26.4 32.1 21.2
Assembly Length (Mb) 738 742 707 716
Number of contigs (n) 284 249 153 211
BUSCO (%) 97.2 98 97 98

Primary assemblies shown. For details of associate assemblies see Table 3.

Table 3.

Data for associated contigs in IPA assemblies.

Parameter M. jansenii M. integrifolia M. ternifolia M. tetraphylla Jojoba Avocado
Contig N50 (Mb) 0.45 1.23 0.77 1.83 1.69 1.53
Longest contig (Mb) 5.23 10.22 5.68 14.97 8.25 10.0
Assembly length (Mb) 527 671 590 655 738 788
Number of contigs 3966 3226 3006 2103 1999 3196

Results for other plant species

Methods for sequencing plant genomes must be applied to genomes of various sizes and complexities. Macadamia is a basal eudicot. Other flowering plant genomes were sequenced to determine how widely applicable the results of this study would be in plant genome sequencing. Jojoba (Simmondsia chinensis), a core eudicot from the Caryopyllales, and avocado, a magnoliid, were compared. The three diverse genomes were all similar in size (700–1000 Mb). Many important plant genomes are in, or near this size range [14]. M. jansenii is an endangered species with relatively low heterozygosity, avocado has much greater heterozygosity [15], and jojoba has been reported to be a tetraploid [16]. Heterozygosity and polyploidy both complicate assembly [17, 18]. The quality of the assemblies was more contiguous (fewer contigs required to cover the genome) or similar (avocado) with less data in each of these cases when HiFi reads were used instead of the earlier continuous long reads (Table 4). The macadamia and jojoba genomes gave larger N50 values when using the HiFi (CCS) reads than with long reads (CLR). However, the N50 for the slightly larger genome of avocado was greater when using the long reads than when using that obtained with the HiFi reads. This suggest that the larger genome may have longer repeat regions that limit contig assembly in some parts of the genome with HiFi reads.

Table 4.

Long read versus HiFi sequencing of other diverse species.

Long reads HiFi
Parameter Jojoba Avocado Jojoba Avocado
Total data (Gb) 152 159 41.4 44
Contig N50 (Mb) 4.73 6.7 4.89 4.3
Assembly length (Mb) 1260 787 780 749
Number of contigs (n) 762 308 284 298
BUSCO (%) 99 99 98 98

Impact of sequencing coverage on the assemblies

The length of the contigs assembled (Figure 2) was directly related to the volume of sequence data used. Analysis of four related Macadamia species gave a similar linear relationship between data volume and contig N50 for input of 10–40× genome coverage. The size of the contigs assembled showed a similar dependence on the amount of sequence data (genome coverage) across species, with the slope of the relationship varying for different species (Figure 3). The Macadamia genomes could be assembled with lower coverage. This may be a function of genome size, with their smaller genomes requiring less coverage to achieve a given N50. The larger genomes may contain a higher proportion of repetitive sequences that are difficult to assemble.

Figure 2.

Figure 2.

Influence of data volume on assembly for Macadamia species. N50 of contigs is plotted against the genome coverage. Genome sizes used to calculate coverage were; M. integrifolia, 895 Mb [20]; M. janseni, 780 Mb [1]; M. tetraphylla, 758 Mb [27] and M. ternifolia, 758 Mb (not known but assumed to be the same as M. tetraphylla owing to similar assembly size).

Figure 3.

Figure 3.

Influence of data volume on assembly for diverse species. N50 of contigs is plotted against the genome coverage. Genome sizes used to calculate coverage were jojoba 1003 Mb [22]; avocado 920 Mb [28], and as in Figure 2 for Macadamia species.

Assemblies based upon more data were slightly shorter in total length (Figure 4). This reduction was probably associated with removal of duplicated end sequences as contigs were joined. The high quality of these assemblies was confirmed by BUSCO values of more than 95%. Genome completeness was high in all cases but increased slightly when more data was used in the assembly (Figure 5).

Figure 4.

Figure 4.

Decrease in length of total assembly as more genome coverage is used in the assembly.

Figure 5.

Figure 5.

Improvement in genome completeness (BUSCO%) with genome coverage.

These results were confirmed when applying these methods to sequencing the other phylogenetically diverse plant genomes with slightly larger genomes with greater genome complexity. In each case, N50 and completeness increased with data volume and genome size declined.

Impact of the read length on the assemblies

The length of sequence reads was also expected to influence the assembly. Examination of size distribution of the six species showed that the length of the sequences varied slightly within the expected range – around 15 Kb for HiFi data. The minor differences in mean read length and numbers of longer reads did not explain the differences in the size of the contigs assembled (Supplementary Figure S1). This suggests that the different amounts of sequence data required to drive assembly to a particular level may be associated more with the complexity of the sequence. The close relationship between sequence volume and N50 for the four Macadamia species may reflect the similar sequence complexity of the species in this group. The jojoba and avocado genomes required more sequence data to reach the same level of assembly. The slightly larger genome size of these two species may be sufficient to explain this difference, owing to the likely higher proportion of repetitive sequence in the somewhat larger genomes.

Oxford Nanopore Technologies updates

ONT regularly releases updated basecalling software to convert raw electrical signal into sequence data. We repeated the basecalling of the ONT raw data of M. jansenii using different versions of the Guppy basecaller released in March 2019 (v2.3.7), April 2019 (v3.0.3) and June 2020 (v4.0.11). The assembly quality improved, as shown, by an increase in the assembly contiguity and in the number of complete BUSCOs before any polishing (Table 5). The assembly size decreased from 817 Mb to 798 Mb. Two versions of the Flye assembler were applied to the same basecalled sequence data set, which resulted in a marked increase in genome contiguity and completeness, as well as a reduced genome assembly size.

Table 5.

ONT genome assembly statistics of M. jansenii using the Flye assembler, the pass reads and different Guppy basecaller versions.

Basecaller Assembler
Guppy v2.3.7 Guppy v3.0.3 Guppy v4.0.11
Parameter Flye v2.5 Flye v2.4.2 Flye v2.5 Flye v2.5
Number of reads 1597,353 1592,919 1594,802
Contig N50 (Mb) 1.44 0.94 1.51 1.79
Assembly length (Mb) 817 845 811 798
Number of contigs 2996 4242 2855 2841
Number of contigs (>10 kb) 2300 3275 2088 1913
BUSCO complete (%) 66.8 51.4 75.1 79.1

Reuse potential

These assemblies represent considerable advances over the highly fragmented genomes previously reported for these species [1922]. Advances in long read sequencing using different platforms provide improving options for plant genome sequencing and assembly [23]. A recent comparison of these methods applied to rice genome sequencing showed strengths and weaknesses of both, with greater sequence accuracy in the Pac Bio assemblies and more contiguity in the ONT assemblies [4]. The resulting genome sequences can be evaluated for the best combination of sequence and assembly accuracy [24]. The results presented here show that contig size can be increased by adding more sequence reads to achieve a linear increase in N50. These extra data will result in slightly shorter total assembly lengths and improved completeness of the genomes. When combined with higher level assembly tools [25], the improved methods will support routine, rapid and efficient generation of highly accurate chromosome-level genome sequences of plant species [26].

Supplementary Material

Figure S1

Size distribution of reads sequenced

Acknowledgements

The project was supported by the University of Queensland Research Computing Centre (RCC) and the University of Queensland Genome Innovation Hub.

Funding Statement

This research received funding from the Hort Frontiers Advanced Production Systems Fund, Hort Frontiers Strategic Partnership, Hort Innovation, with University of Queensland and the Australian Government as part of National Tree Genomics Program, AS17000 Genomics Resources Toolbox, to R Henry; and from King Faisal University, Jojoba Genomics Project to RJ Henry, A Furtado and A Kharabian Masouleh.

Data availability

Sequence data from PacBio (Sequel) (RRID:SCR_017989), ONT (PromethION) (RRID:SCR_017987) and BGI (single-tube Long Fragment Read) (RRID:SCR_011114) analysis of M. jansenii was described by Murigneux et al. [1]. BGI, PacBio, ONT, and Illumina sequencing data generated in that study were deposited in the NCBI Sequence Read Archive (SRA) under BioProject PRJNA609013 and BioSample SAMN14217788. Accession numbers are as follows: BGI (SRR11191908), PacBio (SRR11191909), ONT PromethION (SRR11191910), ONT MinION (SRR11191911), and Illumina (SRR11191912). Assemblies and other supporting data are available from the GigaScience GigaDB repository [29]. Pac Bio HiFi reads described in this paper are deposited as CCS reads under NCBI BioProject ID Macadamia: PRJNA694456; Avocado: PRJNA694184 and Jojoba: PRJNA694450. Other data, further supporting this updated work are openly available in the GigaScience repository, GigaDB [30].

Declarations

List of abbreviations

BGI: Beijing Genomics Institute; bp: base pair(s); BUSCO: Benchmarking Universal Single-Copy Orthologs; CCS: circular consensus sequencing; CLR: continuous long reads; Gb: gigabase pair(s); Kb: kilobase pair(s); Mb: megabase pair(s); ONT: Oxford Nanopore Technology; PacBio: Pacific Biosciences.

Ethical approval

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Funding

This research received funding from the Hort Frontiers Advanced Production Systems Fund, Hort Frontiers Strategic Partnership, Hort Innovation, with University of Queensland and the Australian Government as part of National Tree Genomics Program, AS17000 Genomics Resources Toolbox, to R Henry; and from King Faisal University, Jojoba Genomics Project to RJ Henry, A Furtado and A Kharabian Masouleh.

Authors’ contributions

Conceptualization: RJH, AF, VM, AM, IA; data curation: PS, OA, BA, ON, VM, AF; formal analysis: PS, OA, BA, ON, GM, VM, AM, AF; funding acquisition: RH, AF, IA, AM; investigation: PS, OA, BA, ON, NM, VM, AM, AF, RH; resources: IA, BT; supervision: IA, BT, NM, RH, AM, AF; writing of the original draft: RH, PS, ON, AM; writing, review and editing: all authors.

References

  • 1.Murigneux V, et al. Comparison of long-read methods for sequencing and assembly of a plant genome. Gigascience, 2020; 9(12): giaa146. doi: 10.1093/gigascience/giaa146. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Michael TP, Van Buren R, . Building near-complete plant genomes. Curr. Opin. Plant Biol., 2020; 54: 26–33, doi: 10.1016/j.pbi.2019.12.009. [DOI] [PubMed] [Google Scholar]
  • 3.Hon T, et al. Highly accurate long-read HiFi sequencing data for five complex genomes. Sci. Data, 2020; 7: 399. doi: 10.1038/s41597-020-00743-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Lang D, et al. Comparison of the two up-to-date sequencing technologies for genome assembly: HiFi reads of Pacific Biosciences Sequel II system and ultralong reads of Oxford Nanopore. Gigascience, 2020; 9(12): giaa123. doi: 10.1093/gigascience/giaa123. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Cheng B, Furtado A, Henry RJ, . Long-read sequencing of the coffee bean transcriptome reveals the diversity of full-length transcripts. Gigascience, 2017; 6(11): gix086. doi: 10.1093/gigascience/gix086. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Furtado A, . DNA extraction from vegetative tissue for next-generation sequencing. Methods Mol. Biol., 2014; 1099: 1–5, doi: 10.1007/978-1-62703-715-0_1. [DOI] [PubMed] [Google Scholar]
  • 7.Zou Y, et al. Nucleic acid purification from plants, animals and microbes in under 30 seconds. PLoS Biol, 2017; 15: e2003916. doi: 10.1371/journal.pbio.2003916. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Bienvenue JM, Duncalf N, Marchiarullo D, Ferrance JP, Landers JP, . Microchip-based cell lysis and DNA extraction from sperm cells for application to forensic analysis. J. Forensic Sci., 2006; 51: 266–273, doi: 10.1111/j.1556-4029.2006.00054.x. [DOI] [PubMed] [Google Scholar]
  • 9.Sharma P, et al. IPA assembly for PacBio Hifi reads. protocols.io. 2021; 10.17504/protocols.io.buxvnxn6. [DOI]
  • 10.Pacific Biosciences . Improved Phased Assembler. 2020; https://github.com/PacificBiosciences/pbbioconda/wiki/Improved-Phased-Assembler.
  • 11.Seppey M, Manni M, Zdobnov EM, . BUSCO: assessing genome assembly and annotation completeness. Methods Mol. Biol., 2019; 1962: 227–245, doi: 10.1007/978-1-4939-9173-0_14. [DOI] [PubMed] [Google Scholar]
  • 12.Jalili V, et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2020 update. Nucleic Acids Res., 2020; 48: 8205–8207, doi: 10.1093/nar/gkaa554. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Amarasinghe SL, Su S, Dong XY, Zappia L, Ritchie ME, Gouil Q, . Opportunities and challenges in long-read sequencing data analysis. Genome Biol., 2020; 21: 30. doi: 10.1186/s13059-020-1935-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Wendel JF, Jackson SA, Meyers BC, Wing RA, . Evolution of plant genome architecture. Genome Biol., 2016; 17: 37. doi: 10.1186/s13059-016-0908-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Juma I, et al. Genetic diversity of avocado from the southern highlands of Tanzania as revealed by microsatellite markers. Hereditas, 2020; 157: 40. doi: 10.1186/s41065-020-00150-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Tobe H, Yasuda S, Oginuma K, . Seed coat anatomy, karyomorphology, and relationships of Simmondsia (Simmondsiaceae). Bot. Mag. Tokyo, 1992; 105: 529–538, doi: 10.1007/Bf02489427. [DOI] [Google Scholar]
  • 17.Kyriakidou M, Anglin NL, Ellis D, Tai HH, Stromvik MV, . Genome assembly of six polyploid potato genomes. Sci. Data, 2020; 7: 88. doi: 10.1038/s41597-020-0428-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Kyriakidou M, Tai HH, Anglin NL, Ellis D, Stromvik MV, . Current strategies of polyploid plant genome sequence assembly. Front Plant Sci., 2018; 9: 1660. doi: 10.3389/fpls.2018.01660. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Nock CJ, Baten A, Barkla BJ, Furtado A, Henry RJ, King GJ, . Genome and transcriptome sequencing characterises the gene space of Macadamia integrifolia (Proteaceae). BMC Genomics, 2016; 17: 937. doi: 10.1186/s12864-016-3272-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Nock CJ, et al. Chromosome-scale assembly and annotation of the Macadamia genome (Macadamia integrifolia HAES 741). G3 (Bethesda), 2020; 10: 3497–3504, doi: 10.1534/g3.120.401326. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Rendon-Anaya M, et al. The avocado genome informs deep angiosperm phylogeny, highlights introgressive hybridization, and reveals pathogen-influenced gene space adaptation. Proc. Natl Acad. Sci. USA, 2019; 116: 17081–17089, doi: 10.1073/pnas.1822129116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Sturtevant D, et al. The genome of jojoba (Simmondsia chinensis): A taxonomically isolated species that directs wax ester accumulation in its seeds. Sci. Adv., 2020; 6: eaay3240. doi: 10.1126/sciadv.aay3240. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Belser C, et al. Chromosome-scale assemblies of plant genomes using nanopore long reads and optical maps. Nat. Plants, 2018; 4: 879–887, doi: 10.1038/s41477-018-0289-4. [DOI] [PubMed] [Google Scholar]
  • 24.Wang WW, et al. The draft nuclear genome assembly of Eucalyptus pauciflora: a pipeline for comparing de novo assemblies. Gigascience, 2020; 9: giz160. doi: 10.1093/gigascience/giz160. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Monat C, et al. TRITEX: chromosome-scale sequence assembly of Triticeae genomes with open-source tools. Genome Biol., 2019; 20: 284. doi: 10.1186/s13059-019-1899-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Li FW, Harkess A, . A guide to sequence your favorite plant genomes. Appl. Plant Sci., 2018; 6: e1030. doi: 10.1002/aps3.1030. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Niu Y-F, et al. Genome assembly and annotation of Macadamia tetraphylla. bioXriv. 2020; doi: 10.1101/2020.03.11.987057. [DOI]
  • 28.Talavera A, Soorni A, Bombarely A, Matas AJ, Hormaza JI, . Genome-wide SNP discovery and genomic characterization in avocado (Persea americana Mill.). Sci. Rep., 2019; 9: 20137. doi: 10.1038/s41598-019-56526-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Murigneux V, et al. Supporting data for “Comparison of long-read methods for sequencing and assembly of a plant genome”. GigaScience Database. 2020; 10.5524/100812. [DOI] [PMC free article] [PubMed]
  • 30.Sharma P, et al. Supporting data for “Improvements in the Sequencing and Assembly of Plant Genomes”. GigaScience Database. 2021; 10.5524/100906. [DOI] [PMC free article] [PubMed]
GigaByte. 2021 Jun 10;2021:gigabyte24.

Article Submission

Robert Henry
GigaByte.

Assign Handling Editor

Editor: Scott Edmunds
GigaByte.

Editor Assess MS

Editor: Nicole Nogoy
GigaByte.

Curator Assess MS

Editor: Christopher Hunter
GigaByte.

Review MS

Editor: Mile Sikic

Reviewer name and names of any other individual's who aided in reviewer Mile Sikic
Do you understand and agree to our policy of having open and named reviews, and having your review included with the published papers. (If no, please inform the editor that you cannot review this manuscript.) Yes
Is the language of sufficient quality? Yes
Please add additional comments on language quality to clarify if needed
Are all data available and do they match the descriptions in the paper? Yes
Additional Comments
Are the data and metadata consistent with relevant minimum information or reporting standards? See GigaDB checklists for examples <a href="http://gigadb.org/site/guide" target="_blank">http://gigadb.org/site/guide</a> Yes
Additional Comments
Is the data acquisition clear, complete and methodologically sound? Yes
Additional Comments
Is there sufficient detail in the methods and data-processing steps to allow reproduction? Yes
Additional Comments
Is there sufficient data validation and statistical analyses of data quality? Yes
Additional Comments
Is the validation suitable for this type of data? Yes
Additional Comments
Is there sufficient information for others to reuse this dataset or integrate it with other data? Yes
Additional Comments In their update to the previous study on the comparison of long read technologies for sequencing and assembly of plant genomes, Sharma et al. presented a follow-up analysis using a newer generation of basecallers for nanopore reads and PacBio HiFi reads. I argue that this study is an important update, but it is not suitable for publication in the current form. My major comments are the following: 1. It is not clear which version of the basecaller the authors used in assemblies related to Table 1 and Table 3 2. For phased assemblies, it is important to provide information about the size of alternative contigs 3. In Table 1, it would be great to have results for methods that do not phase assembly (i.e. Flye) 4. There is no explanation why authors use IPA instead of other HiFi assemblers, i.e. hifiasm, which from my experience, perform better than IPA 5. A sentence related to Table 3, “The quality of the assemblies was more contiguous with less data in each of these cases when HiFi reads were used instead of the earlier continuous long reads (Table 3).” is not clear. Following Table 3, assemblies achieved using long reads have similar or longer N50 and higher BUSCO score. Also, it is not clear which assembler was used for long reads
Any Additional Overall Comments to the Author
Recommendation Major Revision
GigaByte.

Review MS

Editor: Chao Bian

Reviewer name and names of any other individual's who aided in reviewer Chao Bian
Do you understand and agree to our policy of having open and named reviews, and having your review included with the published papers. (If no, please inform the editor that you cannot review this manuscript.) Yes
Is the language of sufficient quality? No
Please add additional comments on language quality to clarify if needed
Are all data available and do they match the descriptions in the paper? Yes
Additional Comments
Are the data and metadata consistent with relevant minimum information or reporting standards? See GigaDB checklists for examples <a href="http://gigadb.org/site/guide" target="_blank">http://gigadb.org/site/guide</a> Yes
Additional Comments
Is the data acquisition clear, complete and methodologically sound? Yes
Additional Comments
Is there sufficient detail in the methods and data-processing steps to allow reproduction? Yes
Additional Comments
Is there sufficient data validation and statistical analyses of data quality? Yes
Additional Comments
Is the validation suitable for this type of data? Yes
Additional Comments
Is there sufficient information for others to reuse this dataset or integrate it with other data? Yes
Additional Comments
Any Additional Overall Comments to the Author
Recommendation Major Revision
GigaByte.

Editor Decision

Editor: Nicole Nogoy
GigaByte. 2021 Jun 10;2021:gigabyte24.

Minor Revision

Robert Henry
GigaByte.

Assess Revision

Editor: Nicole Nogoy
GigaByte.

Re-Review MS

Editor: Mile Sikic

Comments on revised manuscript From five of my comments, the authors answered only one (minor one). I cannot support publication of the manuscript in this form.
GigaByte.

Editor Decision

Editor: Nicole Nogoy
GigaByte. 2021 Jun 10;2021:gigabyte24.

Minor Revision

Robert Henry
GigaByte.

Assess Revision

Editor: Nicole Nogoy
GigaByte.

Re-Review MS

Editor: Mile Sikic

Comments on revised manuscript In this revision, the authors answered some of my comments. My primary concern from the first version of the manuscript is whether the authors use the best possible de novo assemblers. They compare different technologies, and their conclusion relies on achieved results. In their previous paper, they showed that Canu consistently produces a much longer final sequence than other solutions. From the manuscript, it is not clear if they use methods for the removal of haplotypic duplications (ie. purge duplication). I deem that they should test at least one another assembler (ie. Flye) for error-prone reads. Flye is more resilient to haplotypic duplications. Similarly, IPA is rarely used for hifi reads, and most of the authors use hifiasm. Even PacBio uses hifiasm in their analysis. The newest version of Flye and hifiasm are fast assemblers, so I deem their usage will not require significant computational resources. From above, I argue that the authors need to provide more results to support their claims.
GigaByte.

Editor Decision

Editor: Nicole Nogoy
GigaByte. 2021 Jun 10;2021:gigabyte24.

Minor Revision

Robert Henry
GigaByte.

Assess Revision

Editor: Nicole Nogoy
GigaByte.

Final Data Preparation

Editor: Chris Armit
GigaByte.

Editor Decision

Editor: Nicole Nogoy
GigaByte.

Accept

Editor: Scott Edmunds

Comments to the Author None
GigaByte.

Export to Production

Editor: Scott Edmunds

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Figure S1

    Size distribution of reads sequenced

    gigabyte-2021-24-s001.pdf (462.3KB, pdf)

    Data Availability Statement

    Sequence data from PacBio (Sequel) (RRID:SCR_017989), ONT (PromethION) (RRID:SCR_017987) and BGI (single-tube Long Fragment Read) (RRID:SCR_011114) analysis of M. jansenii was described by Murigneux et al. [1]. BGI, PacBio, ONT, and Illumina sequencing data generated in that study were deposited in the NCBI Sequence Read Archive (SRA) under BioProject PRJNA609013 and BioSample SAMN14217788. Accession numbers are as follows: BGI (SRR11191908), PacBio (SRR11191909), ONT PromethION (SRR11191910), ONT MinION (SRR11191911), and Illumina (SRR11191912). Assemblies and other supporting data are available from the GigaScience GigaDB repository [29]. Pac Bio HiFi reads described in this paper are deposited as CCS reads under NCBI BioProject ID Macadamia: PRJNA694456; Avocado: PRJNA694184 and Jojoba: PRJNA694450. Other data, further supporting this updated work are openly available in the GigaScience repository, GigaDB [30].


    Articles from GigaByte are provided here courtesy of Gigascience Press

    RESOURCES