Sequencing smart: De novo sequencing and assembly approaches for a non-model mammal

Graham J Etherington; Darren Heavens; David Baker; Ashleigh Lister; Rose McNelly; Gonzalo Garcia; Bernardo Clavijo; Iain Macaulay; Wilfried Haerty; Federica Di Palma

doi:10.1093/gigascience/giaa045

. 2020 May 12;9(5):giaa045. doi: 10.1093/gigascience/giaa045

Sequencing smart: De novo sequencing and assembly approaches for a non-model mammal

Graham J Etherington ^1,^✉, Darren Heavens ¹, David Baker ¹, Ashleigh Lister ¹, Rose McNelly ¹, Gonzalo Garcia ¹, Bernardo Clavijo ¹, Iain Macaulay ¹, Wilfried Haerty ¹, Federica Di Palma ^1,^✉

PMCID: PMC7216774 PMID: 32396200

Abstract

Background

Whilst much sequencing effort has focused on key mammalian model organisms such as mouse and human, little is known about the relationship between genome sequencing techniques for non-model mammals and genome assembly quality. This is especially relevant to non-model mammals, where the samples to be sequenced are often degraded and of low quality. A key aspect when planning a genome project is the choice of sequencing data to generate. This decision is driven by several factors, including the biological questions being asked, the quality of DNA available, and the availability of funds. Cutting-edge sequencing technologies now make it possible to achieve highly contiguous, chromosome-level genome assemblies, but rely on high-quality high molecular weight DNA. However, funding is often insufficient for many independent research groups to use these techniques. Here we use a range of different genomic technologies generated from a roadkill European polecat (Mustela putorius) to assess various assembly techniques on this low-quality sample. We evaluated different approaches for de novo assemblies and discuss their value in relation to biological analyses.

Results

Generally, assemblies containing more data types achieved better scores in our ranking system. However, when accounting for misassemblies, this was not always the case for Bionano and low-coverage 10x Genomics (for scaffolding only). We also find that the extra cost associated with combining multiple data types is not necessarily associated with better genome assemblies.

Conclusions

The high degree of variability between each de novo assembly method (assessed from the 7 key metrics) highlights the importance of carefully devising the sequencing strategy to be able to carry out the desired analysis. Adding more data to genome assemblies does not always result in better assemblies, so it is important to understand the nuances of genomic data integration explained here, in order to obtain cost-effective value for money when sequencing genomes.

Keywords: polecat, vertebrate, non-model organism, Illumina, chromium, Bionano, assembly, sequencing

Introduction

Starting in 1990, the Human Genome Project used low-throughput, high-cost Sanger sequencing platforms to create the first draft human genome at a cost of US Inline graphic 300 million. Fast-forward 20 years and the cost of sequencing a human genome has decreased to roughly US 1,000. Short-read technologies producing high-throughput, low per-base cost next-generation sequencing (NGS) means that genomics is no longer restricted to large sequencing consortiums and has opened up the field to even the smallest of research groups. The recently formed Vertebrate Genomes Project aims to produce near-gapless, chromosome-scale phased genome assemblies for ∼66,000 extant vertebrate species [1]. The assembly pipeline consists of 60× coverage Pacific Biosciences (PacBio) long-read sequencing, followed by 10x Genomics linked reads, Bionano optical mapping, and Arima Genomics' Hi-C profiles. These long-read technologies provide highly contiguous genome assemblies. Similar consortiums and sequencing initiatives have been formed to sequence a range of target organisms such as Bat1K, Bird10K, Oz Mammals Genomics, and Earth BioGenome Project (including Darwin UK Tree of Life, Colombia EBP, and so forth) [2]. Although such efforts make it possible to achieve highly contiguous, chromosome-level genome assemblies, the cost of generating this amount of data and assembling them is considerable and often only within reach of a few of these consortiums. It is important for smaller independent research groups or initiatives to consider value for money against biological questions as a key factor when planning the generation of genomic sequencing data.

Non-model organisms have the potential to provide new knowledge related to phenotypic and genotypic variation. Through comparative genomics, it is possible to identify how different organisms are related to each other, how they adapt to novel environments, or the genetic basis underlying novel phenotypes. These new findings can be applied to further research, such as in the biomedical and food industries through breeding programs with the development of marker-assisted selection and in conservation biology [3–12].

De novo assembly of endangered species, followed by low-coverage population-level sequencing, provides unprecedented information about the amount of genetic diversity within populations, past and ongoing gene flow between different populations, and the level of inbreeding in small populations [13–17].

However, there are a number of difficulties when working with non-model mammals. First, the genome size is not always known, hampering the assessment of the completeness of the “assembled” genome and of the sequencing depth. Additionally, the availability and quality of the samples used for sequencing non-model organisms are often substandard. Tissue and blood samples are often obtained from wild populations and may need to be acquired from remote locations, delaying the time between collection and DNA extraction. Another common issue relates to samples that may have been stored in collections such as museums, zoos, and tissue collections and subjected to a number of different preservation methods such as freezing, storage in ethanol, or formalin-fixed paraffin-embedded. Many current sequencing technologies rely on high molecular weight DNA with varying optimum molecule lengths (e.g., PacBio HiFi reads 15–20 kb, Bionano >150 kb, and 10x Genomics >50 kb). Degraded DNA, as is commonly observed in samples from wild populations, is usually sub-optimal for use in many advanced sequencing methods. It is therefore difficult, or sometimes impossible, to leverage the full application of these technologies.

Many samples from non-model organisms originate from wild populations that are highly heterozygous, leading to numerous challenges during the assembly step. Allelic differences in a diploid genome generate branches and bubbles in the assembly graph [18]. Even though most graph-based assemblers have functions to search for and remove these structures, high density variation can still make assembly of heterozygous organisms challenging. Conversely, high levels of homozygosity, characteristic of endangered (and typically inbred) species, hamper the efforts of creating phased genome assemblies because the ability to phase haplotypes is dependent on linked sequences spanning polymorphisms. Additionally, non-model organisms vary in their ploidy, chromosome number, repeat content, sequence composition, and GC content, adding further confounding factors to genome assembly.

The European polecat (Mustela putorius, NCBI:txid9668) is a medium-sized carnivore found across Europe and the Middle East. It is purported to be the ancestral species of the domestic ferret (M. putorius furo) [19]. Across most of mainland Europe the polecat is in widespread decline [20]. In the United Kingdom, the European polecat has a chequered history. Persecuted to the verge of extinction in the early 1900s, when it was confined to unmanaged forests in central Wales, it has since seen a population increase and is now found throughout Wales and across much of central, south-western, and eastern England [21].

Here, a roadkill sample of European polecat from the Vincent Wildlife Trust collection (VWT 693) was used to assess short-read and long-range de novo sequencing strategies for non-model mammals. Comparisons between combinations of PCR-free Illumina libraries, Nextera long mate pair (LMP) libraries, 10x Genomics Chromium libraries, and Bionano optical maps are made to assess optimum sequencing and assembly strategies.

Sequencing Technologies

Short-read sequencing

The market leader in short-read high-throughput NGS is Illumina [22]. Machines produce read lengths of 100 bp and above, and a single Illumina Novaseq run is currently capable of generating 3,000 Gb of read data. An advantage to Illumina sequencing is the generation of paired-end (PE) reads, in which the sequence from both ends of each DNA molecule is synthesized. As the input molecules are of an approximate known length, the acquisition of PE data provides a greater amount of information. Additionally, using a PCR-free library preparation removes bias in genomic coverage previously incorporated by a PCR amplification step in older library preparation procedures [23]. Although requiring input DNA of a degree of magnitude greater than PCR-amplified libraries, PCR-free libraries are expected to capture unbiased coverage of the genome, usually reflected by an increased size of the assembly and less duplication in single-copy regions of the genome compared with PCR-amplified libraries [24]. They also provide superior coverage in GC-rich regions of the genome, enabling access to regions that were previously difficult to sequence [25]. PCR-free Illumina sequencing requires a minimum of 2 μg of genomic DNA (gDNA) at a minimum concentration of 35 ng/μL in 60 μL.

Long mate pair sequencing

Long DNA fragments up to ∼40 kb can be sequenced to provide PE reads that bridge long repeats, thus producing longer contiguous genome assemblies as well as characterizing structural variants. Under the Nextera LMP protocol [26], a transposase enzyme attaches 19-bp biotinylated adaptors to both ends of each long DNA fragment. The DNA is then circularized, where the biotinylated ends become joined. The circularized DNA is then fragmented and biotin enrichment is used to process the fragments containing the adaptors that mark the junction. During sequencing, reads are produced from both ends of a fragment, resulting in inward-facing reads that read toward and through the adaptors. Twelve libraries covering a wide range of jump sizes can be constructed using this protocol, thus ensuring production of the best LMP libraries from a given DNA sample. For Illumina Nextera LMP sequencing the Nextclip tool can then be used to trim adaptors and de-duplicate reads [27]. Nextera LMP sequencing requires a minimum of 4 μg of gDNA for the 12 libraries, at a minimum concentration of 30 ng/μL in 300 μL.

10x Genomics

The Chromium system from 10x Genomics uses oil emulsion and multiple displacement amplification to ligate short molecular barcodes to reads from each fragment of DNA, followed by PE Illumina sequencing [28]. Each fragment receives its own unique barcode, and hence reads with the same barcodes represent clusters of reads from the same region in the genome. These “linked reads” provide the long-range information missing from standard Illumina sequencing and are then used to assemble phased assemblies de novo. 10x Chromium libraries require a minimum of 1.25 ng of high molecular weight gDNA at a concentration of 1 ng/μL. In order to take full advantage of the technology, gDNA should >50 kb in length.

Optical mapping (Bionano)

Bionano technology produces optical maps of nicking/restriction enzyme sites across kilobase-long stretches of DNA molecules, providing a high-throughput tool for ordering and orienting contigs of physical maps and validation of genome assemblies [29]. Bionano optical maps can be compared to in silico restriction maps produced from an NGS genome assembly for validation purposes, to improve contiguity by assigning the shorter NGS scaffolds to the longer optical maps, and identifying structural variants. A total of 600 ng of raw gDNA at a concentration of 35–200 ng/μL is typically enough DNA to generate ∼120 μL of labelled molecules—enough to provide adequate coverage for analysis of a human-sized genome (3 Gb).

Genome contiguity has an effect on what analyses can be achieved (Table 1), so it is important to appreciate the power and limitations of each sequencing strategy and technology.

Table 1:

Information regarding the possible resolution for various de novo genome sequencing technologies

Assembly resolution	Paired-end	Paired-end + long mate pair	Bionano	10x Genomics
Gene content	Yes	Yes	No	Yes
Gene order	Yes	Yes	No	Yes
Repeat spanning	No	Yes	Yes	Yes
Structural variants	No	Yes	Yes	Yes
Haplotype resolution (phased genomes)	No	No	No	Yes

Open in a new tab

When planning a genome assembly project, it is important to understand the strengths and limits of the various sequencing strategies available.

Materials and Methods

Sequencing

Using the same sample of a roadkill European polecat stored in 100% ethanol, 2 lanes of PCR-free Illumina HiSeq2500 250-bp PE reads (77× coverage), 2 Illumina LMP libraries of size 5 kb (27× coverage) and 7 kb (9× coverage), and 4 lanes of 150-bp PE 10x Genomics Chromium (totalling 85× coverage) using an Illumina HiSeq2500 were generated (Illumina HiSeq 2500 System, RRID:SCR_016383).

We extracted DNA from 4 European polecat samples (all from the VWT collection) and analysed the molecule distribution using an Agilent TapeStation (Supplementary Fig. S3). Sample VWT 693 had the highest concentration of the longest molecules where the distribution of molecule lengths peaked at just under 60 kb and was used for all further sequencing. For this sample 50% of the molecules were >51 kb. The mean molecule length of the remaining 50% of molecules (i.e., those <51 kb) was 15 kb. This was not of good enough quality to generate Bionano data (recommended >150 kb). Because the domestic ferret and its polecat ancestor diverged only ∼2,000 years ago and they fully interbreed, we do not expect substantial divergence and structural differences between the 2 species [19, 30–34]. Therefore, the original sample used for the domestic ferret genome assembly [35] was obtained and 1 chip of Bionano Genomics optical genome maps was generated (Saphyr, RRID:SCR_017992). This was used to create Bionano hybrid-scaffold assemblies for the European polecat genomes assembled with the previously mentioned short-read data, using the Bionano Solve software [36]. We generated 664 Gb of Bionano data, with an N50 size of 185 kb and a contig coverage of 261×. Of this, 40% of the molecules aligned back to the Bionano de novo assembly, leaving an effective coverage of 110×. A more detailed description of the library preparation methods can be found in the Supplementary Methods, and the protocols are also available in protocols.io [37].

Assemblies

Ten different genome assemblies were generated as summarized in Fig. 1 (with additional information in Supplementary Table S1), and detailed as follows:

Figure 1: — Ten different assembly strategies using a variety of different data types: PCR-free Illumina short-read (“PCR-free”), long mate pair (“LMP”), 10x Genomics Chromium library (“10x”), and Bionano Genomics optical maps (“Bionano”). The blue-boxed assemblies all originate from the same PCR-free w2rap assembly (A1), and the black-boxed assemblies all originate from the same 10x Genomics Supernova assembly (A3). Information in paretheses refers to assembly software pipeline, and assembly numbers are annotated below each assembly.

Assembly A1 (w2rap)

The PCR-free Illumina reads from polecat were assembled using the w2rap-contigger [38]. The w2rap-contigger (w2rap) originated from a fork of the popular DISCOVAR de novo program (Discovar, RRID:SCR_016755) [39], and then a number of improvements were made to reduce memory usage and processing time, enhance parameterization, improve repeat resolution, and increase accuracy and contiguity. It also benefits from requiring less computational resources than other popular assemblers such as ALLPATHS-LG (ALLPATHS-LG, RRID:SCR_010742) [40]. w2rap is predominantly a contig assembler—reads are used to construct an assembly graph, which is then traversed to create a contig assembly. A final step involves using the PE information to scaffold contigs not joined during the initial assembly process. Using w2rap, 4 different assemblies were created using a range of k-mers (k = 180, 200, 224, and 240), and simple assembly stats were run to examine contiguity across the assemblies (for all contigs and filtering for contigs > 1 kb). From these statistics, the assembly constructed with k = 224 was selected as the final assembly.

Assembly A2 (w2rap + lmp)

We analysed the distribution and coverage from the 12 Nextera LMP libraries and selected the 5- and 7-kb libraries owing to their tight distribution and higher coverage, when compared to the other 10 libraries. Using SSPACE (SSPACE, RRID:SCR_005056) [41], the 5- and 7-kb Nextera LMPs were used to scaffold the w2rap assembly from assembly A1. For all SSPACE LMP assemblies the reads were used only for scaffolding and not for contig extension.

Assembly A3 (10x)

The 10x Genomics Chromium library was assembled using the 10x Genomics Supernova software [28], using default parameters. Default parameters automatically cap the number of reads to 1,200 M, which, after trimming and filtering, resulted in an effective coverage of 52.18× with a mean molecule length of 38.42 kb. Similar to w2rap, Supernova creates an initial contig assembly but then scaffolds using the molecule-specific barcode information in the reads to join contigs known to be from the same molecule [28]. The output style of the resulting assembly was “pseudohap,” which creates 1 haplotype per scaffold at random.

Assembly A4 (10x + lmp)

SSPACE was used with the 5- and 7-kb Nextera LMPs to scaffold the 10x assembly generated in assembly A3. As in assembly A2, the LMP reads were used only for scaffolding and not for contig extension.

The Bionano data were assembled de novo and then were used to position and orient scaffolds from previous assemblies, creating a Bionano hybrid-scaffold as follows:

Assembly A5 (w2rap + bionano). Bionano hybrid-scaffolding with w2rap assembly (Assembly A1).
Assembly A6 (w2rap + lmp + bionano). Bionano hybrid-scaffolding with the w2rap + lmp assembly (Assembly A2).
Assembly A7 (10x + bionano). Bionano hybrid-scaffolding with the 10x assembly (Assembly A3).
Assembly A8 (10x + lmp + bionano). Bionano hybrid-scaffolding with the 10x + lmp assembly (Assembly A4).

Finally, the 30× coverage of 10x Genomics data (from the same data generated for assembly A3, henceforth referred to “10x-scaffolding”) was used to scaffold 2 assemblies using the scaff10x program from Phusion2 [42], as follows:

Assembly A9 (w2rap + 10x). The w2rap-only assembly (Assembly A1) with 10x-scaffolding.
Assembly A10 (w2rap + lmp + bionano + 10x). The w2rap + lmp + bionano assembly (Assembly A6), with 10x-scaffolding.

Analyses

Genome contiguity

For each genome assembly, a number of assembly statistics, such as contig N50, scaffold N50, the number of scaffolds greater than given lengths, and scaffolded genome size were calculated. To calculate contig N50, any scaffolded contigs that were joined by ≥25 Ns were broken. The percentage of the genome contained in scaffolds >25 kb (the average length of a vertebrate gene [43]) and the number of scaffolds >39 Mb (the length of the smallest chromosome in a recent chromosome-scale assembly of a closely related mustelid [44]) were also calculated.

k-mer analysis

The K-mer Analysis Toolkit (KAT) version 2.3.4 [45] was used to examine k-mers across reads and assemblies. KAT enables users to assess levels of errors, bias, and contamination at various stages of the assembly process. Using the KAT “comp” program with a k-mer size of 31, k-mers in the PCR-free Illumina reads were compared with those in the resulting assemblies (omitting the Bionano assemblies because this technology adds negligible sequence content), and for each assembly, the k-mer spectra were plotted.

Gene content

BUSCO (v3.0.2) was used to search for single-copy orthologs in each assembly (BUSCO, RRID:SCR_015008) [46]. BUSCO reports the number of single-copy orthologs discovered in the input assembly and categorizes them as “complete,” “single-copy,” “multi-copy,” or “fragmented.” Mammalia_odb9 was used for the “lineage” parameter in BUSCO and “human” for the Augustus species parameter.

Repeat content

To examine repeat content and compare how repeats were resolved in each genome assembly, RepeatMasker v 4.0.7 with library dc20170127-rb20170127 (RepeatMasker, RRID:SCR_012954) [47] was used (with default values) to identify repeat families in each assembly, using all Carnivora-specific repeats. As well as identifying repeat sequences, the mean deletion, insertion, and divergence for each family were also calculated, as well as the mean values overall. Mean divergence is calculated as “mismatches/(matches + mismatches)” between queries and matches for all repeats.

Assembly errors and misassemblies

REAPR (REAPR, RRID:SCR_017625) [48] was used to evaluate the accuracy of each genome assembly by separately mapping PCR-free PE and LMP reads back to each assembly. The fragment coverage distribution (FCD) error for each assembly was calculated. FCD is the fragment depth from only the reads that are mapped to a given base of a fragment. The FCD error is the difference between the theoretical and observed FCD and is used to identify assembly errors in the regions containing a run of high FCD errors. Mapping information such as the FCD and insert size distribution is analysed to locate misassemblies as well as more local per-base accuracies. The “smalt map” option in REAPR was used, which uses SMALT (SMALT, RRID:SCR_005498) [49] to align the PCR-free PE and LMP reads back to each assembly, utilizing the option to map PE reads independently. This ensures that read pairs are not artificially forced to map as proper pairs within a given insert size. REAPR was then used to identify perfectly and uniquely mapped reads in the PE PCR-free alignment, to accurately call error-free bases in the assembly, and further used the LMP reads to identify features consistent with misassemblies. Error-free bases have ≥5× perfect and unique coverage of PE reads. REAPR summary scores were calculated for each assembly by multiplying the number of error-free bases by the square of the REAPR broken scaffold N50 length, and then dividing by the original scaffold N50, i.e., “No. error-free bases * (broken N50²/assembly N50).” This test was first used to evaluate genome assemblies in the Assemblathon series [43] and rewards local accuracy, overall contiguity, and correct scaffolding of an assembly. To independently assess the performance of each data type for scaffolding, the numbers of REAPR breaks were compared between the w2rap-only assembly (A1) and that assembly scaffolded with 1 data type, namely, LMP (A2), Bionano (A5), and 10x (A9). The same analyses were also performed using the 10x-only assembly (A3).

Value for money

Cost is a huge factor in research and ultimately affects decisions made regarding the technologies used. A metric was created to reflect “value for money” by estimating the cost of each assembly and the N50 achieved. This metric is provided as N50/ Inline graphic 1,000 and calculated for contig N50, scaffold N50, and the REAPR broken scaffold N50.

Ranking assemblies

Each assembly was given a rank score according to its position in each of the 7 metrics. The top-placed assembly that performed best in a given metric was given a rank score of 10, the second-placed assembly was given a rank score of 9, and so on, down to the bottom-placed assembly, which was given a rank score of 1.

Assemblies were ranked for the following metrics:

Scaffold N50
REAPR broken scaffold N50
Contig N50
Percentage of genome represented by scaffolds >25 kb
Single-copy BUSCO orthologs
REAPR summary score
REAPR broken scaffold N50/1,000

z-scores

z-scores were used to combine scores from datasets with different means, ranges, and standard deviations and have the benefit of rewarding/penalizing those assemblies with exceptionally high/low scores in any 1 metric. The influence of each of the 7 metrics was tested by removing each metric in turn and recalculating the z-score for each assembly. These recalculations were then used to produce error bars for the final z-score figure, by providing the minimum and maximum z-score that might have occurred if any combination of 6 metrics was used.

Results

Assembly contiguity and connectivity

Assembly statistics

After assembling the 10 genomes as described in Fig. 1, a number of metrics were calculated for each assembly to examine contiguity and connectivity, measured by the lengths and distribution of the scaffolds within each assembly (Table 2). The mean assembly size for all genomes was 2.52 Gb, slightly larger than the 2.41-Gb assembly of the domestic ferret [35]. The 10x-based assemblies erred on having smaller genome assembly sizes (2.46–2.50 Gb), with the larger assemblies (2.47–2.66 Gb) being from the PCR-free Illumina-based assemblies.

Table 2:

Genome assembly statistics (for sequences >1 kb) for all assemblies

No.	Assembly	No. scaffolds (%)			% genome ≥25 kb	Longest scaffold (Mb)	Contig N50 (kb)	Scaffold N50 (Mb)	Assembly size (Gb)
No.	Assembly	>100 kb	>1 Mb	>39 Mb	% genome ≥25 kb	Longest scaffold (Mb)	Contig N50 (kb)	Scaffold N50 (Mb)	Assembly size (Gb)
A1	w2rap	6,290 (10.7)	176 (0.3)	0	94.9	2.52	182.93	0.30	2.47
A2	w2rap + lmp	1,680 (4.9)	682 (2.0)	0	94.8	15.65	271.16	2.62	2.60
A3	10x	1,023 (3.9)	501 (1.9)	0	93.3	32.15	207.98	5.26	2.46
A4	10x + lmp	669 (4.2)	346 (2.2)	3	94.7	58.16	210.72	10.33	2.50
A5	w2rap + bionano	4,361 (7.7)	626 (1.1)	0	93.8	6.89	182.93	0.85	2.66
A6	w2rap + lmp + bionano	990 (3.0)	468 (1.4)	0	94.8	34.30	271.16	5.73	2.60
A7	10x + bionano	604 (2.3)	336 (1.3)	3	97.5	46.79	207.98	10.84	2.48
A8	10x + lmp + bionano	409 (2.6)	218 (1.4)	9	97.6	104.38	210.72	21.01	2.50
A9	w2rap + 10x	1,097 (2.4)	467 (1.0)	0	97.6	35.44	182.93	5.58	2.47
A10	w2rap + lmp + bionano + 10x	447 (1.4)	235 (0.7)	6	97.5	65.13	271.16	14.05	2.60

Open in a new tab

Percentage scores refer to percentage of scaffolds greater than a given threshold. The quantity 39 Mb is the size of the smallest chromosome in a recent chromosome-scale assembly of a closely related mustelid and hence an indication of the number of chromosome-sized scaffolds. A more thorough list of genome statistics can be found in Supplementary Table S2.

Contig N50 for the assemblies varied between 183 and 271 kb. Scaffold N50 for the assemblies varied between 300 kb and 21 Mb. The increase from contig N50 to scaffold N50 varied greatly (Fig. 2). The addition of LMP data to an initial short-read assembly had a varying effect. On the relatively fragmented w2rap assembly (A1), the addition of LMP reads led to an almost 9-fold increase of the scaffold N50, but adding LMPs to the more contiguous 10x assembly (A3) resulted in a 2-fold increase. This is not unexpected because the N90 value for the 10x assembly (800 kb) is 20 times greater than that of the w2rap assembly (40 kb); hence, the chance of mate pairs spanning the same contig and not adding to the contiguity of the assembly is much higher in the already contiguous 10x assembly. The addition of Bionano data to assemblies leads to a similar scaffold N50 increase across all assemblies, namely, between a 2.0- and 2.8-fold increase. Finally, 10x-scaffolding data were added to scaffold assembly A1 (w2rap) and assembly A6 (w2rap + lmp + bionano). As might be expected, the effect of 10x scaffolding data on less contiguous genomes was greater than that on more contiguous genomes. There was an 18.6-fold increase in N50 between assembly A1 (w2rap) and assembly A9 (w2rap + 10x), whereas the increase in N50 between assembly A6 (w2rap + lmp + bionano) and assembly A10 (w2rap + lmp + bionano + 10x) was less contrasting at 2.5-fold.

Figure 2: — Log-scale lengths of contig N50 (blue) and scaffold N50 (red) of all 10 assemblies, sorted (left to right) by scaffold N50.

Generally speaking, assemblies created with 1 or 2 data types, where 1 of the data types was Illumina short reads, showed the smallest increase from contig N50 to scaffold N50 (Fig. 2).

Assembly errors and misassemblies

REAPR was used to assess the accuracy of the polecat genome assemblies by looking at low-quality regions, breakpoints (Table 3), and summary scores (Fig. 3). The percentage of error-free bases for each assembly varied between 76.05 and 85.9%. All the w2rap-based assemblies were on the low end of the scale (76.05–81.09%), whilst 10x-based assemblies were on the high end (84.65–85.9%). Conversely, there was a trend for w2rap-based assemblies to be less affected by misassemblies (excluding those with 10x scaffolding). Their REAPR broken N50 size reduced between 2 and 64%, whilst 10x-based assemblies reduced in N50 size between 68 and 91%. A similar pattern is seen with the number of FCD errors, where all w2rap-based assemblies (bar A10, with 10x scaffolding) have <8,214 FCD errors and all 10x-based assemblies have ≥9,095 errors.

Table 3:

REAPR statistics showing the percentage of error-free bases in the assembly, N50s before and after breaking at breakpoints, the percentage decrease in scaffold N50 after breaking, and the fragment coverage distribution (FCD) errors including errors across gaps

No.	Assembly name	% error-free	Original N50 (Mb)	REAPR broken N50 (Mb)	% reduction	FCD errors
A1	w2rap	80.83	0.3	0.29	2	6,065
A2	w2rap + lmp	79.10	2.61	1.13	57	8,213
A3	10x	85.90	5.26	1.69	68	11,379
A4	10x + lmp	85.35	10.33	1.86	82	9,095
A5	w2rap + bionano	76.05	0.85	0.52	38	4,523
A6	w2rap + lmp + bionano	78.38	5.73	2.06	64	7,392
A7	10x + bionano	84.65	10.84	2.00	82	13,068
A8	10x + lmp + bionano	84.75	21.00	1.86	91	11,531
A9	w2rap + 10x	81.09	5.58	0.57	90	7,601
A10	w2rap + lmp + bionano + 10x	77.80	14.05	1.75	88	9,488

Open in a new tab

Figure 3: — REAPR summary scores for each polecat assembly. REAPR summary scores were calculated for each assembly by multiplying the number of error-free bases by the square of the REAPR broken scaffold N50 length and then dividing by the original scaffold N50.

Finally, the performance of each technology was independently assessed for scaffolding by comparing the number of REAPR breaks between the w2rap assembly (A1) and those scaffolded with only 1 data type (LMP, Bionano, and 10x scaffolding) (Table 4). After accounting for the 2,756 breaks introduced by REAPR in the w2rap-only assembly (A1), it was found that Bionano (assembly A5) clearly performed best, containing only 729 more breaks than the original assembly (A1). Conversely, LMP (6,843 more breaks) and 10x scaffolding (7,353 more breaks) data types had ≥9 times more breaks introduced by REAPR than Bionano. A comparison was made between the number of breaks (5,252) in the 10x assembly (A3) to the 10x + lmp assembly (A4) and the 10x + bionano assembly (A7) (Table 5). A similar pattern as above was found, with the LMP assembly having 2,785 more breaks than the 10x assembly but with the Bionano assembly having only 61 more breaks, again demonstrating the accuracy of Bionano for scaffolding.

Table 4:

Comparison of the number of breaks introduced by REAPR for each of the technologies used to scaffold the w2rap-only assembly (A1)

No.	Assembly name	Assembled sequences	No. sequences after breaking	REAPR breaks
A1	w2rap	929,245	932,001	2,756
A2	w2rap + lmp	887,887	897,486	9,599 (6,843)
A5	w2rap + bionano	927,316	930,801	3,485 (729)
A9	w2rap + 10x	916,014	926,123	10,109 (7,353)

Open in a new tab

The number of breaks in parentheses represents the number of breaks after accounting for the 2,756 breaks introduced into the comparison assembly (A1).

Table 5:

Comparison of the number of breaks introduced by REAPR for each of the technologies used to scaffold the 10x assembly (A3)

No.	Assembly name	Assembled sequences	No. sequences after breaking	REAPR breaks
A3	10x	26,253	31,505	5,252
A4	10x + lmp	16,018	24,055	8,037 (2,785)
A7	10x + bionano	25,834	31,147	5,313 (61)

Open in a new tab

The number of breaks in brackets represent the number of breaks after accounting for the 5,252 breaks introduced into the comparison assembly (A3).

Assembly completeness

k-mer content

“KAT comp” [44] was used to compare k-mers in the Illumina PCR-free reads with k-mers in the non-Bionano assemblies (A1–A4 and A9). “KAT plot” was then used to visualize the output (Fig. 4 and Supplementary Fig. S1). The plots all show a similar distribution of k-mers. The black distribution at the start of the x-axis represents sequencing errors in reads, and its increased width represents an increased number of errors in the reads. The k-mers in these reads have not been incorporated into the final assembly. The extension of the black line along the x-axis (up to a k-mer multiplicity of 40 on the x-axis) represents collapsed haplotypes, where k-mers from 1 side of a bubble in the assembly graph have been removed to construct a linear path through the graph. Any extension of the black line along the x-axis into the main red distribution (>40 k-mer multiplicity) represents a small number of high-copy k-mers in the reads missing from the assembly. The red area in all graphs represents a normal distribution of k-mers found in the reads and occurring once in the assembly. The absence of any additional colours, representing k-mers appearing once in the reads but multiple times in the assembly, reflects the presence of only unique content throughout the assembly, with k-mers in the reads occurring no more than once in the assembly.

Figure 4: — KAT k-mer plots comparing k-mer content of Illumina PCR-free reads with w2rap assembly (A1). The black area of the graph represents the distribution ofk-mers present in the reads but not in the assembly, and the red area represents the distribution of k-mers present in the reads and once in the assembly.

Despite all of the assemblies being compared to the PCR-free Illumina short reads, virtually the same distribution of k-mers between the reads and assemblies was observed, showing an almost identical distribution of k-mers from all the different read sequences and their resulting assemblies. The KAT-plots involving 10x assemblies (Supplementary Fig. S1C and D) are also characterized by some high-copy read k-mers missing from the assemblies. This suggests that the minimum size of contigs included in the final assembly (1 kb) may be too high. This may also explain the slightly smaller assembly sizes obtained from the 10x-based assemblies when compared to the w2rap-based assemblies (Table 2).