Abstract
With the increasing affordability and accessibility of genome sequencing data, de novo genome assembly is an important first step to a wide variety of downstream studies and analyses. Therefore, bioinformatics tools that enable the generation of high-quality genome assemblies in a computationally efficient manner are essential. Recent developments in long-read sequencing technologies have greatly benefited genome assembly work, including scaffolding, by providing long-range evidence that can aid in resolving the challenging repetitive regions of complex genomes. ntLink is a flexible and resource-efficient genome scaffolding tool that utilizes long-read sequencing data to improve upon draft genome assemblies built from any sequencing technologies, including the same long reads. Instead of using read alignments to identify candidate joins, ntLink utilizes minimizer-based mappings to infer how input sequences should be ordered and oriented into scaffolds. Recent improvements to ntLink have added important features such as overlap detection, gap-filling and in-code scaffolding iterations. Here, we present three basic protocols demonstrating how to use each of these new features to yield highly contiguous genome assemblies, while still maintaining ntLink’s proven computational efficiency. Further, as we illustrate in the alternate protocols, the lightweight minimizer-based mappings that enable ntLink scaffolding can also be utilized for other downstream applications, such as misassembly detection. With its modularity and multiple modes of execution, ntLink has broad benefit to the genomics community, from genome scaffolding and beyond. ntLink is an open-source project and is freely available from https://github.com/bcgsc/ntLink.
Basic Protocol 1: ntLink scaffolding using overlap detection
Basic Protocol 2: ntLink scaffolding with gap-filling
Basic Protocol 3: Running in-code iterations of ntLink scaffolding
Alternate Protocol 1: Generating long-read to contig mappings with ntLink
Alternate Protocol 2: Using ntLink mappings for genome assembly correction with Tigmint-long
Support Protocol 1: Installing ntLink
Keywords: De novo genome assembly, long reads, scaffolding, minimizer mapping
INTRODUCTION
Generating high-quality de novo genome assemblies for both model and non-model organisms opens the door to a plethora of important downstream studies, including annotation, structural variant analysis and population studies, to name a few (Logsdon et al., 2020). In recent years, long-read genome sequencing technology from Oxford Nanopore Technologies PLC (ONT, Oxford, UK) and Pacific Biosciences of California, Inc. (PacBio, Menlo Park, CA) have gained in popularity. Long reads range in length from kilobases to megabases, orders of magnitude greater than the typical 150–300 bp lengths of short reads, albeit with a higher base error rate than typical short read sequencing technologies (e.g. Illumina). This length distribution enables the long reads to span over the numerous repetitive elements present in complex genomes, therefore allowing for repeats to be resolved in a draft assembly. With the continued improvement and accessibility of long-read genome sequencing, there is great opportunity to harness the rich information of this data type for facilitating and improving de novo genome assemblies.
While current state-of-the-art de novo long-read assemblers such as Flye (Kolmogorov et al., 2019) are generating highly contiguous assemblies, we observe that they still do not fully exhaust or necessarily correctly use the long-range evidence inherent in long reads (Coombe et al., 2021). Therefore, there is great value in stand-alone genome assembly scaffolders, such as LINKS (Warren et al., 2015), OPERA-LG (Gao et al., 2016), LRScaf (Qin et al., 2019), and our alignment-free scaffolder ntLink (Coombe et al., 2021). ntLink is a lightweight, minimizer-based long-read scaffolding tool that was previously published as a central step in the correction and scaffolding pipeline LongStitch (Coombe et al., 2021). More recently, ntLink was integrated as a key step in our de novo long-read assembler GoldRush (Wong et al., 2022). ntLink uses long-read evidence to further contiguate draft assemblies from any sequencing technology. Instead of using alignments of long reads to the draft assembly, like many state-of-the-art long-read scaffolding tools, ntLink uses minimizer mappings (leveraging particular subsets or sketches of the sequence k-mers), which we have shown to be effective for assembly scaffolding as it confers considerable computational benefit (Coombe et al., 2021).
ntLink is a flexible toolkit which can be run in various modes depending on the desired user output (Figure 1), with multiple new functionalities introduced since the published version (Coombe et al., 2021). For each basic protocol, the input files provided by the users include the long reads and a draft assembly to be improved, and the main output file is a scaffolded assembly in FASTA format. The core functionality of ntLink uses the long-read evidence and generated minimizers to infer how the input contigs (draft assembly sequences) should be ordered and oriented, and subsequently performs these joins. The basic protocols differ in additional steps that are performed to more accurately join contigs together, fill ambiguous sequences (i.e. gap-filling), and enhance the final contiguity of the assembly. In earlier versions of ntLink, contigs were joined together end-to-end naively, whether they overlapped in the genomic space or not. The overlap detection feature of ntLink identifies cases where adjacent contigs overlap, and trims them to remove the overlapping regions prior to concatenation (Basic Protocol 1, Figure 1A). Furthermore, scaffolding assemblies generally introduces gaps, or ambiguous nucleotide bases (“N”s), between joined contigs. The gap-filling feature of ntLink instead fills gaps with bases from a representative read supporting the join (Basic Protocol 2, Figure 1B). Finally, running additional iterations of ntLink can further improve the contiguity of the final output assembly. To take advantage of these contiguity gains efficiently, ntLink can run in-code scaffolding iterations (also termed rounds hereon) powered by a coordinate liftover module (Basic Protocol 3, Figure 1C).
Although ntLink was developed as a scaffolding tool, the initial minimizer-guided step of mapping long reads to the draft assembly can also benefit other bioinformatics utilities that require approximate mapping information, including the ntEdit+Sealer polishing component of the GoldRush assembler (J. X. Li et al., 2022; Paulino et al., 2015; Warren et al., 2019; Wong et al., 2022) and Tigmint-long (Coombe et al., 2021; Jackman et al., 2018), an assembly correction tool also integrated in the LongStitch pipeline. In Alternate Protocols 1 and 2, we highlight the usage of ntLink as a mapping tool, and, as an example of using ntLink mappings in a different downstream application, showcase how these mappings can be utilized in Tigmint-long.
With its numerous modes and available protocols, ntLink is a flexible and wide-reaching tool for improving de novo genome assemblies and helping researchers better leverage their long-read sequencing data.
STRATEGIC PLANNING
Hardware
ntLink is a command-line tool, which can be run on 64-bit Linux or MacOS operating systems with sufficient available RAM (random-access memory). The amount of RAM and disk space required for running ntLink varies with the draft genome size and coverage of the long-read dataset. See Table 1 for the peak memory and disk space usage for representative ntLink scaffolding runs using four different species with varying genome sizes. The wall-clock time for each ntLink run is also included in Table 1. For the basic and alternate protocols, each step uses 5 threads, but this can be adjusted depending on the specifications of the user’s machine. The parameter t controls the number of threads used (where applicable), and is shown in each corresponding command.
Table 1.
Species | Approximate genome size (Mbp) | Fold coverage of long reads | Peak memory usage (GB) | Peak disk space usage (GB) | Wall-clock time (min) |
---|---|---|---|---|---|
Caenorhabditis elegans | 100 | 93 | 0.9 | 0.9 | 6.5 |
Oryza sativa | 373 | 62 | 3.4 | 2.8 | 21.4 |
Solanum lycopersicum | 824 | 72 | 5.4 | 5.6 | 43.4 |
Homo sapiens | 3,055 | 50 | 20.8 | 25.0 | 136.0 (2h16m) |
Default ntLink parameters were used for each example run, except for Solanum lycopersicum, which used k=64, w=250. One round of ntLink, including the gap-filling option, was run with each dataset.
Software
ntLink is available from the conda package manager for a more straightforward installation. Users can also install ntLink from the provided source code on GitHub. Detailed instructions for installing ntLink are available in the Support Protocol 1.
Files
Each ntLink protocol requires input long sequencing reads (in FASTA or FASTQ format), and an input draft genome assembly to be scaffolded (in FASTA format). Both files can be in single-line or multi-line (standard) FASTA format.
BASIC PROTOCOL 1
ntLink scaffolding using overlap detection
Basic Protocol 1 describes running ntLink to scaffold an input draft assembly using long reads, with overlap detection enabled. The overlap detection functionality identifies when adjacent contigs (draft assembly sequences) overlap in genomic space, and trims the contigs to ensure the sequences are merged without duplicating this overlapping sequence.
ntLink leverages the long-range information inherent in long-read sequencing data to scaffold an input draft assembly. First, ntLink maps the long reads to a draft assembly using a minimizer-based approach. Long-read mappings that span multiple contigs provide evidence that these contigs should be joined together. These mappings are also used to estimate gap sizes between the contigs. After ntLink determines the sequences of oriented contigs to be joined together as scaffolds, the overlap detection feature identifies adjacent contigs that have a putative overlap (indicated by a negative estimated gap size), and resolves the overlaps. Finally, the ordered and oriented scaffolds are output in FASTA format.
Necessary Resources:
Hardware
This protocol requires a 64-bit Linux or MacOS operating system with sufficient RAM and available disk space (See Strategic Planning for more information).
Software
The steps of Basic Protocol 1 describe the installation of all required tools. For easier installation of the dependencies, see Support Protocol 1 to set-up a conda environment called “ntLink_env”.
Files
The input files for ntLink are long genome sequencing reads and a draft genome assembly. The long sequencing reads can be provided in FASTA or FASTQ format, either compressed with gzip or uncompressed. The input draft assembly to be scaffolded should be in FASTA format (multi-line or single-line).
Sample Files
To demonstrate the usage of ntLink in Basic Protocol 1, we will scaffold a C. elegans draft assembly with a corresponding C. elegans Oxford Nanopore long-read dataset. The C. elegans long reads are available from the Sequence Read Archive (SRA) under accession SRR10028109. The draft assembly is a Flye (Kolmogorov et al., 2019) assembly of the same C. elegans long reads, and is available from https://doi.org/10.5281/zenodo.7526395. To assess the genome assemblies generated from ntLink, a C. elegans N2 (Bristol strain) reference genome (accession GCA_000002985.3) will be used. There are detailed steps in the protocol to guide the user in downloading these files.
Protocol steps with step annotations:
-
Install ntLink
See Support Protocol 1 for detailed instructions and options for installing ntLink.
-
Install protocol-specific dependencies curl, QUAST, and SRA toolkit.
Option A: Use conda package manager- If Option A of Support Protocol 1 was used to install ntLink, the protocol-specific dependencies can be installed in the same conda environment.
conda activate ntlink_env conda install -y -c bioconda -c conda-forge curl quast ‘sra-tools>=2.10.2’
Option B: Install from source-
Install curlMany servers will already have curl installed. To check if curl is available:
which curl
If you see the path to a curl installation, curl is already installed and you can continue to part (ii) to install QUAST. Otherwise, follow the next steps.- Go to https://curl.se/download.html, and find the tarball for the latest released version. Version 7.86.0 is used below to illustrate the steps. Change your terminal’s current directory to the location where you would like curl installed, download the tarball, extract the tarball and change your directory into the downloaded curl directory.
mkdir -p ~/bin/curl && cd ~/bin/curl wget https://curl.se/download/curl-7.86.0.tar.gz tar zvxf curl-7.86.0.tar.gz cd curl-7.86.0/
- Compile the source code
mkdir curl_install ./configure --with-ssl --prefix=$(pwd)/curl_install make make install
- Add the curl installation directory to your PATH
export PATH=$(pwd)/curl_install/bin:$PATH
- Install QUAST (Mikheenko et al., 2018)
- Go to https://github.com/ablab/quast/releases, download the latest release and extract the tarball. Version 5.2.0 is shown as an example in the following commands.
mkdir -p ~/bin/quast_install && cd ~/bin/quast_install curl -L --output quast-5.2.0.tar.gz https://github.com/ablab/quast/releases/download/quast_5.2.0/quast-5.2.0.tar.gz tar xvzf quast-5.2.0.tar.gz cd quast-5.2.0/ ./install.sh cd ..
- Add the QUAST installation directory to your PATH
export PATH=$(pwd)/quast-5.2.0:$PATH
- Install the SRA toolkit
- Go to https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit, and find the pre-built binary appropriate for your system. Download the archive and extract it. We use the centOS linux release v3.0.2 as an example below.
mkdir -p ~/bin/sratools && cd ~/bin/sratools curl -L --output sratoolkit.3.0.2-centos_linux64.tar.gz https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.0.2/sratoolkit.3.0.2-centos_linux64.tar.gz tar xvzf sratoolkit.3.0.2-centos_linux64.tar.gz
- Add the sratoolkit installation directory to your PATH
export PATH=$(pwd)/sratoolkit.3.0.2-centos_linux64/bin:$PATH
-
Navigate to the directory where you want to run the ntLink tests, and download the sample long-read data.
Note that downloading the long reads will take approximately 1 hour.cd ~/ntLink/test fasterq-dump SRR10028109
Once the command has finished, the reads will be available in the file SRR10028109.fastq. These reads are ~93-fold coverage C. elegans Oxford Nanopore long reads.
-
Download the sample draft long-read assembly.
This is a Flye (Kolmogorov et al., 2019) assembly of the long reads downloaded in the previous step.curl -L --output celegans_flye.fa https://zenodo.org/record/7526395/files/celegans_flye.fa
-
Download a reference genome assembly for the C. elegans Bristol N2 strain.
This assembly will be used in a later step when assessing the final assembly scaffolds using QUAST.curl -L --output celegans_reference.fa.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/002/985/GCA_000002985.3_WBcel235/GCA_000002985.3_WBcel235_genomic.fna.gz
-
Run ntLink.
The specified values of k, w and overlap are the default values, but are included in the command to demonstrate how to set these parameters using the ntLink Makefile driver script.ntLink scaffold target=celegans_flye.fa reads=SRR10028109.fastq k=32 w=100 t=5 overlap=True
-
Check the logs and output files to ensure that the run executed successfully.
If ntLink completed successfully, the ntLink terminal output should contain the message “Done ntLink! Final post-ntLink scaffolds can be found in: celegans_flye.fa.k32.w100.z1000.ntLink.scaffolds.fa”. In addition, the final output scaffolds file “celegans_flye.fa.k32.w100.z1000.ntLink.scaffolds.fa” should be in the current working directory.
- Assess the final output scaffolds using abyss-fac (de novo approach) (Jackman et al., 2017) and QUAST (reference-based approach) (Mikheenko et al., 2018). The abyss-fac utility is part of the ABySS suite of tools, which is installed as a dependency of ntLink. See Table 2 for detailed descriptions of the assembly statistics and Table 3 for the expected statistics generated from these steps.
-
Run abyss-fac using the input draft genome assembly and the post-ntLink genome assembly.
abyss-fac -G100e6 --count-ambig celegans_flye.fa celegans_flye.fa.k32.w100.z1000.ntLink.scaffolds.fa
The “-G” option specifies the genome size, which is approximately 100 Mbp for C. elegans. The “--count-ambig” option counts any ambiguous bases (ex. “N”s) in the output statistics.After running this command, you will see that the NG50 length increases after ntLink scaffolding, and the number of sequences decreases. -
Run QUAST to assess the input draft genome assembly and the post-ntLink genome assembly using the previously downloaded C. elegans reference assembly.
quast -t 5 -o quast_ntlink_bp1 -r celegans_reference.fa.gz --fast --large celegans_flye.fa celegans_flye.fa.k32.w100.z1000.ntLink.scaffolds.fa
After running QUAST, all output files will be written to the quast_ntlink_bp1 directory. The main statistics will be written to report.tsv. You will see that certain statistics, such as “# contigs (>= 0 bp)” and “NG50” will be the same as calculated by abyss-fac. Important reference-based statistics to look at include the “NGA50” and “# misassemblies” (Table 2). After scaffolding, we want to minimize the number of contigs and misassemblies, while maximizing the NG50 length and NGA50 length. Note that the QUAST executable will be quast.py if the tool was installed from source as described in Step 2B.
-
Table 2.
Statistic | de novo or reference-based | Description |
---|---|---|
n | de novo | The number of sequences in the assembly. |
NG50 length | de novo | At least half of the genome size is assembled in pieces of the NG50 length and larger. In other words, if you add up the lengths of the contigs that are the NG50 length and larger, it will sum to at least half of the expected genome size. |
NGA50 length | reference-based | Analogous to the NG50 length, but uses alignment blocks instead of contig lengths for the calculation. Therefore, it summarizes both the contiguity and correctness of the assemblies. |
Misassemblies | reference-based | Number of large-scale errors in the assembly as compared to the supplied reference. These QUAST extensive misassemblies can be classified into 3 categories: relocations, inversions and translocations. |
Scaffold NG50/NGA50 length | de novo (NG50) and reference-based (NGA50) | The “Scaffold NG50” and “Scaffold NGA50” lengths (as described above) are computed directly on the full scaffold lengths. |
Contig NG50/NGA50 length | de novo (NG50) and reference-based (NGA50) | Prior to calculating the NG50 or NGA50 lengths, the assembly sequences are broken at ambiguous codes (“N”s). By default, QUAST will break the sequences at regions of >= 10 “N”s when calculating these contig statistics. |
# N’s per 100 kbp | de novo | Number of ambiguous bases (“N”s) in the assembly per 100 kbp of sequence. |
Table 3.
Assembly | Number of sequences >= 3 kbp | NG50 length (Mbp) | NGA50 length (Mbp) | Number of misassemblies |
---|---|---|---|---|
Baseline assembly | 63 | 3.6 | 2.3 | 75 |
Baseline assembly + ntLink | 33 | 6.8 | 3.7 | 66 |
BASIC PROTOCOL 2
ntLink scaffolding with gap-filling
Basic Protocol 2 describes how to run ntLink scaffolding with a gap-filling step. In this protocol, instead of simply introducing ambiguous bases, ntLink fills gaps with sequence from the input long-read sequencing data. The initial steps of ntLink are executed as described in Basic Protocol 1. Then, an additional step is performed which fills-in the ntLink-induced scaffold gaps with bases from a representative read that supports the given join (the read that has the highest average number of minimizers in common with the incident contigs). Following this gap-filling step, the scaffolds are output in FASTA format. Because the gaps are filled with raw long-read sequence, we recommend polishing the output assembly using long-read polishing tools such as ntEdit+Sealer (J. X. Li et al., 2022; Wong et al., 2022), Racon (Vaser et al., 2017) or Medaka (Medaka: Sequence Correction Provided by ONT Research, n.d.) following the ntLink scaffolding and gap-filling.
Necessary Resources:
Hardware
This protocol requires a 64-bit Linux or MacOS operating system with sufficient RAM and available disk space (See Strategic Planning for more information).
Software
The following software must be installed and available in your PATH environment variable:
SRA toolkit (v3.0.0+): (https://github.com/ncbi/sra-tools)
curl: (https://curl.se/)
Python 3.7+: (https://www.python.org/)
ntLink (v1.3.7+): (https://github.com/bcgsc/ntLink)
ABySS (v2.3.0+): (https://github.com/bcgsc/abyss)
QUAST (v5.2.0+): (https://github.com/ablab/quast)
For more information about installing these dependencies, see Support Protocol 1 and Basic Protocol 1, steps 1–2. Instructions for creating a conda environment that can be used for installing protocol-specific dependencies, as described below, are available in Support Protocol 1.
Files
The input files for ntLink are long genome sequencing reads and a draft genome assembly. The long sequencing reads can be provided in FASTA or FASTQ format, either compressed with gzip or uncompressed. The input draft assembly to be scaffolded should be in FASTA format (multi-line or single-line).
Sample Files
The sample files used for this protocol are the same as used in Basic Protocol 1.
Protocol steps with step annotations:
-
Install the required software.
Installing protocol-specific dependencies minimap2 and Racon:
Option A: Using conda package manager- If ntLink was installed using Option A of Support Protocol 1, minimap2 and Racon can be installed in the same environment.
conda activate ntlink_env conda install -y -c bioconda -c conda-forge minimap2 racon
Option B: Installing from source- Install minimap2
- For Linux, minimap2 provides pre-compiled binaries. Go to https://github.com/lh3/minimap2/releases to find the most recent pre-compiled binary. Here, we show downloading the v2.24 binary as an example:
mkdir -p ~/bin/minimap2 && cd ~/bin/minimap2 curl -L https://github.com/lh3/minimap2/releases/download/v2.24/minimap2-2.24_x64-linux.tar.bz2 | tar -jxvf - cd minimap2–2.24_x64-linux
- For MacOS, review the minimap2 dependencies (https://github.com/lh3/minimap2), download the most recent release tarball from https://github.com/lh3/minimap2, extract it and compile the code. We show downloading and compiling the v2.24 release as an example.
mkdir -p ~/bin/minimap2 && cd ~/bin/minimap2 curl -L --output minimap2-2.24.tar.bz2 https://github.com/lh3/minimap2/releases/download/v2.24/minimap2-2.24.tar.bz2 tar -jxvf minimap2-2.24.tar.bz2 cd minimap2-2.24 make
- Append the path to the minimap2 installation to your PATH environment variable.
export PATH=$(pwd):$PATH
- Install Racon (See https://github.com/lbcb-sci/racon for information about dependencies), and add the path to the Racon installation to your PATH environment variable.
git clone https://github.com/lbcb-sci/racon && cd racon mkdir build && cd build cmake -DCMAKE_BUILD_TYPE=Release .. make export PATH=$(pwd)/bin:$PATH
Download the sample data. As the sample data for this protocol is the same as used for Basic Protocol 1, please see steps 3–5 of Basic Protocol 1 for full details about downloading the long reads, draft genome assembly and reference genome.
-
Change to the directory with the downloaded data, and run ntLink with the gap-filling option specified. The ntLink steps are powered by the ntLink Makefile.
cd ~/ntLink/test ntLink scaffold gap_fill target=celegans_flye.fa reads=SRR10028109.fastq k=32 w=100 t=5
Note that the k (k-mer size) and w (window size) values specified are the default values, but are included in the command to illustrate how they can be set when running ntLink. The target gap_fill being specified in the command triggers the gap-filling stage after the initial scaffolding steps of ntLink.
-
Check the logs and output files to ensure that the run executed successfully.
If ntLink completed successfully, this message will be found in the terminal output: “Done ntLink! Final post-ntLink and gap-filled scaffolds can be found in: celegans_flye.fa.k32.w100.z1000.ntLink.scaffolds.fa”. In addition, the final output scaffolds file “celegans_flye.fa.k32.w100.z1000.ntLink.scaffolds.fa” will be in the current working directory. The intermediate scaffold file before gap-filling is “celegans_flye.fa.k32.w100.z1000.stitch.abyss-scaffold.fa”.
- Polish the gap-filled ntLink scaffolds. For illustrative purposes, we demonstrate polishing using Racon, but any long-read polishing tool can be utilized. This is an optional step in the pipeline, and can be bypassed if the integration of raw long reads (with a lower base quality) into the draft assembly is not a concern.
- First, align the long reads to the draft assembly, and output the alignments in SAM format. This step will take approximately 20 minutes.
minimap2 -a -t 5 -x map-ont -o celegans_flye.fa.k32.w100.z1000.ntLink.scaffolds.SRR10028109.sam celegans_flye.fa.k32.w100.z1000.ntLink.scaffolds.fa SRR10028109.fastq
- Next, run Racon, supplying the SAM file generated in step 5a. This step will take approximately 60 minutes.
racon -u -t 5 SRR10028109.fastq celegans_flye.fa.k32.w100.z1000.ntLink.scaffolds.SRR10028109.sam celegans_flye.fa.k32.w100.z1000.ntLink.scaffolds.fa 1> celegans_flye.fa.k32.w100.z1000.ntLink.scaffolds.racon-polished.fa
- Check the Racon output files to ensure that the run executed successfully. The final, polished assembly will be in the file “celegans_flye.fa.k32.w100.z1000.ntLink.scaffolds.racon-polished.fa”, and the final terminal output message in a successful Racon run will include “[racon::Polisher::] total =”, along with the runtime.
- Assess the final, polished output scaffolds using abyss-fac (reference-free) and QUAST (reference-based). See Table 2 for detailed descriptions of the assembly statistics generated, and Table 4 for the expected results.
-
Run abyss-fac using the input draft genome assembly, the ntLink intermediate scaffolds file before gap-filling, and the final output scaffolds file after gap-filling and polishing.
abyss-fac -G100e6 --count-ambig celegans_flye.fa celegans_flye.fa.k32.w100.z1000.stitch.abyss-scaffold.fa celegans_flye.fa.k32.w100.z1000.ntLink.scaffolds.racon-polished.fa
See Basic Protocol 1, Step 8a for detailed information describing the abyss-fac output. -
Run QUAST to assess the input draft genome assembly, the ntLink intermediate scaffolds file before gap-filling, the output scaffolds after gap-filling and the final output scaffolds file after gap-filling and polishing.
quast -t 5 -o quast_ntlink_bp2 -r celegans_reference.fa.gz --fast --large --split-scaffold celegans_flye.fa celegans_flye.fa.k32.w100.z1000.stitch.abyss-scaffold.fa celegans_flye.fa.k32.w100.z1000.ntLink.scaffolds.fa celegans_flye.fa.k32.w100.z1000.ntLink.scaffolds.racon-polished.fa
See Basic Protocol 1, Step 8b for detailed information about the QUAST output. In addition to the QUAST statistics described in the previous protocol, for this protocol we are also interested in distinguishing between “Scaffold NG50/NGA50” and “Contig NG50/NGA50”, which are available from the QUAST output in the quast_ntlink_bp2 directory. See Table 2 for detailed explanations of these statistics. When the --split-scaffold option is specified for QUAST, it will output the statistics for the full input assembly (“Scaffold” statistics), and the assembly after breaking the sequences at regions of >= 10 Ns (“Contig” statistics, “_broken” added to filename). Therefore, the “Contig NG50/NGA50” statistics are a measure of contiguity as well as the number and distribution of gaps in the assembly. Furthermore, the QUAST statistic “# N’s per 100 kbp” gives a direct measure of the number of ambiguous bases in the assembly. With efficient gap-filling, the “Contig” statistics will become closer to the “Scaffold” statistics, and the “# N’s per 100 kbp” will decrease.
-
Table 4.
Assembly | Number of sequences >= 3 kbp | Scaffold NG50 length (Mbp) | Contig NG50 length (Mbp) | Scaffold NGA50 length (Mbp) | Contig NGA50 length (Mbp) | Number of misassemblies | # N’s per 100 kbp |
---|---|---|---|---|---|---|---|
Flye baseline | 63 | 3.6 | 3.6 | 2.3 | 2.3 | 75 | 0.00 |
Flye + ntLink (before gap-filling) | 33 | 6.8 | 4.0 | 3.7 | 2.5 | 66 | 14.36 |
Flye + ntLink | 33 | 6.8 | 6.8 | 3.9 | 3.9 | 64 | 0.02 |
Flye + ntLink + Racon | 33 | 6.9 | 6.9 | 3.0 | 3.0 | 58 | 0.01 |
BASIC PROTOCOL 3
Running in-code iterations of ntLink scaffolding
Basic Protocol 3 describes how to run multiple iterations, or rounds, of ntLink using a liftover-based approach. When scaffolding assemblies, the goal is to achieve the highest possible contiguity without sacrificing the correctness of the assembly. While running a single round of ntLink, as described in Basic Protocols 1 and 2, is very effective in improving upon a draft genome assembly from any technology, further gains are possible with additional rounds of ntLink. Using the in-code round capability of ntLink allows a user to maximize the contiguity of the final assembly without needing to manually run ntLink multiple times. To avoid re-mapping the reads at each round, ntLink lifts over the mapping coordinates from the input draft assembly to the output post-ntLink scaffolds, which can then be used for the next round of ntLink. The same process can be repeated as many times as needed, thus enabling multiple rounds of ntLink to be powered by a single instance of long-read mapping.
Necessary Resources:
Hardware
This protocol requires a 64-bit Linux or MacOS operating system with sufficient RAM and available disk space (See Strategic Planning for more information).
Software
The following software must be installed and available in your PATH environment variable:
SRA toolkit (v3.0.0+): (https://github.com/ncbi/sra-tools)
curl: (https://curl.se/)
Python 3.7+: (https://www.python.org/)
ntLink (v1.3.7+): (https://github.com/bcgsc/ntLink)
ABySS (v2.3.0+): (https://github.com/bcgsc/abyss)
QUAST (v5.2.0+): (https://github.com/ablab/quast)
For more information about installing these dependencies, see Support Protocol 1 and Basic Protocol 1, steps 1–2.
Files
The input files for ntLink are long genome sequencing reads and a draft genome assembly. The long sequencing reads can be provided in FASTA or FASTQ format, either compressed with gzip or uncompressed. The input draft assembly to be scaffolded should be in FASTA format (multi-line or single-line).
Sample Files
The sample files used for this protocol are the same as used in Basic Protocol 1.
Protocol steps with step annotations:
-
Download the sample data. The sample data is the same as for Basic Protocols 1 and 2.
For detailed instructions describing downloading the sample data, see Basic Protocol 1, steps 3–5.
-
Run 3 rounds of ntLink scaffolding.
Change into a new directory, and create soft links so that the input files are accessible in the current working directorycd ~/ntLink/test mkdir -p run_rounds && cd run_rounds ln -s ../celegans_flye.fa && ln -s ../SRR10028109.fastq ln -s ../celegans_reference.fa.gz
Option A: Run rounds of ntLink without gap-fillingntLink_rounds run_rounds target=celegans_flye.fa reads=SRR10028109.fastq k=32 w=100 t=5 rounds=3 dev=True
Option B: Run rounds of ntLink with gap-fillingntLink_rounds run_rounds_gaps target=celegans_flye.fa reads=SRR10028109.fastq k=32 w=100 t=5 rounds=3 dev=True
The dev=True option will retain all intermediate files. Although this is useful to be able to see all the file types generated by ntLink for working through this protocol, this option can be omitted for most runs. When omitted, some intermediate files will be automatically deleted to save disk space.
-
Check the logs and output files to ensure that the ntLink run executed successfully.
After the ntLink command has completed, check the terminal output for this final message, which indicates a successful run: “Done ntLink rounds! Final scaffolds found in celegans_flye.fa.k32.w100.z1000.ntLink.3rounds.fa”. This message also indicates the FASTA file which contains the final, scaffolded assembly sequences.
-
Use abyss-fac (de novo approach) and QUAST (reference-based approach) to assess the genome assembly after each round of ntLink scaffolding, and compare the results to the initial baseline assembly. See Table 2 for detailed descriptions of the assembly statistics generated in these steps, and Figure 2 for a summary of the expected assembly statistics.
-
Reference-free analysis of the ntLink output scaffolds using abyss-fac.If Option A was followed in Step 2:
abyss-fac --count-ambig -G100e6 celegans_flye.fa celegans_flye.fa.k32.w100.z1000.ntLink.fa celegans_flye.fa.k32.w100.z1000.ntLink.ntLink.fa celegans_flye.fa.k32.w100.z1000.ntLink.3rounds.fa
If Option B was followed in Step 2:abyss-fac --count-ambig -G100e6 celegans_flye.fa celegans_flye.fa.k32.w100.z1000.ntLink.gap_fill.fa celegans_flye.fa.k32.w100.z1000.ntLink.ntLink.gap_fill.fa celegans_flye.fa.k32.w100.z1000.ntLink.3rounds.fa
See Basic Protocol 1, Step 8a for detailed information describing the abyss-fac output. -
Reference-based analysis of the ntLink output scaffolds using QUAST.If Option A was followed in Step 2:
quast -t 5 -o quast_ntlink_bp3 -r celegans_reference.fa.gz --fast --large --split-scaffold celegans_flye.fa celegans_flye.fa.k32.w100.z1000.ntLink.fa celegans_flye.fa.k32.w100.z1000.ntLink.ntLink.fa celegans_flye.fa.k32.w100.z1000.ntLink.3rounds.fa
If Option B was followed in Step 2:quast -t 5 -o quast_ntlink_bp3 -r celegans_reference.fa.gz --fast --large --split-scaffold celegans_flye.fa celegans_flye.fa.k32.w100.z1000.ntLink.gap_fill.fa celegans_flye.fa.k32.w100.z1000.ntLink.ntLink.gap_fill.fa celegans_flye.fa.k32.w100.z1000.ntLink.3rounds.fa
See Basic Protocol 1, Step 8b and Basic Protocol 2, Step 6b for detailed information about the QUAST output. The QUAST output will be written to the directory named quast_ntlink_bp3. Note that if QUAST was installed from source, the executable will be named quast.py.
Note that the final scaffolds file name will be the same whether Option A or Option B was followed, but the names of the files from intermediate rounds differ slightly.
-
ALTERNATE PROTOCOL 1
Generating long-read to contig mappings with ntLink
Although ntLink is most commonly used as a scaffolding tool, the minimizer-based mapping functionality that enables assembly scaffolding can also be run in isolation. In any mode, ntLink first maps the input long reads to the input draft genome. When using ntLink in the default scaffolding mode, these mappings are parsed to infer evidence that supports ordering and orienting contigs into scaffolds. However, this mapping information can also be simply output to a file in a standard format and used for other downstream applications. In Alternate Protocol 1, we demonstrate the mapping mode of ntLink, which outputs the mappings in a standard (Pairwise mApping Format) PAF format.
Necessary Resources:
Hardware
This protocol requires a 64-bit Linux or MacOS operating system with sufficient RAM and available disk space (See Strategic Planning for more information).
Software
The following software must be installed and available in your PATH:
SRA toolkit (v3.0.0+): (https://github.com/ncbi/sra-tools)
curl: (https://curl.se/)
Python 3.7+: (https://www.python.org/)
ntLink (v1.3.7+): (https://github.com/bcgsc/ntLink)
For more information about installing ntLink and other dependencies, please see detailed instructions in Basic Protocol 1, steps 1–2. Instructions for creating a conda environment that can be used for installing protocol-specific dependencies, as described below, are available in Support Protocol 1.
Files
The input files for mapping with ntLink are long genome sequencing reads and a draft genome assembly. The long sequencing reads can be provided in FASTA or FASTQ format, either compressed with gzip or uncompressed. The input draft assembly to be scaffolded should be in FASTA format (multi-line or single-line).
Sample Files
In this protocol, we will map the same Oxford Nanopore C. elegans long reads as used in the basic protocols to a short-read C. elegans ABySS (Jackman et al., 2017) assembly. The ABySS assembly is available from https://zenodo.org/record/7526395/files/celegans_abyss.fa.
Protocol steps with step annotations:
-
Install the protocol-specific dependency, miller.
Option A: Use conda package manager- If Option A of Support Protocol 1 was used to install ntLink, miller can be installed in the same conda environment.
conda activate ntlink_env conda install -y -c conda-forge miller
Option B: Install from source- Go to the miller releases page (https://github.com/johnkerl/miller/releases) and find the pre-built binary that is appropriate for your system.
- Download the binary from the releases page, and extract the compressed tarball. A Linux pre-built binary is shown for illustration purposes, but the path to any pre-built miller binary can be used in this step.
curl -L --output miller_download.tar.gz https://github.com/johnkerl/miller/releases/download/v6.5.0/miller-6.5.0-linux-amd64.tar.gz tar xvzf miller_download.tar.gz
- Append the path to the miller installation to your PATH environment variable.
export PATH=$(pwd)/miller-6.5.0-linux-amd64/:$PATH
-
Download the sample long reads. This file is the same as used for Basic Protocols 1, 2 and 3.
For detailed instructions describing downloading the sample long reads, see Basic Protocol 1, step 3.
- Download the sample draft ABySS (Jackman et al., 2017) short-read assembly.
curl -L --output celegans-abyss.fa https://zenodo.org/record/7526395/files/celegans_abyss.fa
- Run ntLink to map the sample long reads to the draft assembly. Ensure that the downloaded sample data files are in your current working directory.
ntLink pair target=celegans-abyss.fa reads=SRR10028109.fastq t=5 sensitive=True paf=True
-
Check the logs and output files to ensure that the run executed successfully.
If the ntLink mapping completed successfully, the terminal output messages from ntLink should finish with a time stamp and “DONE!”. Furthermore, a PAF-formatted mapping file called “celegans-abyss.fa.k32.w100.z1000.paf” should be in your current working directory.
- Assess the mapping file output from ntLink. See Table 5 for the expected mapping statistics.
- To make the generation of summary statistics more straightforward with miller, add column labels to the PAF file.
cat celegans-abyss.fa.k32.w100.z1000.paf | mlr --tsv --implicit-csv-header label read,read_len,read_start,read_end,strand,contig,contig_len,contig_start,contig_end,num_minimizers,len_mapping,mapping_qual > celegans-abyss.fa.paf.mlr.tsv
- Count the total number of mappings of the long reads to the query contigs.
mlr --tsv stats1 -a count -f read celegans-abyss.fa.paf.mlr.tsv
- Calculate the average mapping block length.
cat celegans-abyss.fa.paf.mlr.tsv | mlr --tsv stats1 -a mean -f len_mapping
- Calculate the average number of read mappings per draft assembly contig.
cat celegans-abyss.fa.paf.mlr.tsv | mlr --tsv cut -f read,contig then uniq -g read,contig then stats1 -a count -f contig -g contig then stats1 -a mean -f contig_count
Table 5.
Total number of read mappings | 622,975 |
Average mapping block length (bp) | 5,647.7 |
Average number of distinct mapped reads per contig | 105.2 |
ALTERNATE PROTOCOL 2
Using ntLink mappings for genome assembly correction with Tigmint-long
As described in Alternate Protocol 1, the mapping functionality in ntLink can be used to inform scaffolding, the most common use of ntLink, or separately to provide mapping information that can be used by other downstream pipelines. One such alternate application is Tigmint-long (Coombe et al., 2021), a de novo genome assembly correction tool which utilizes information in long reads to detect and cut at putative misassemblies. In the default mode, Tigmint-long simulates pseudo-linked reads from the long reads. This involves breaking the long reads into tiles, which represent short-read fragments, then generating paired-end reads from the fragments. Each read pair from the same long read is assigned the same barcode, adhering to the expected format for linked reads. These reads are then mapped to the draft assembly using minimap2 (H. Li, 2018), and these mappings are parsed to look for regions of the draft assembly that are not well-supported by the reads. However, as only approximate mappings are required, ntLink mapping can be used in place of minimap2. As ntLink uses more streamlined mapping logic, the reads do not need to be broken into pseudo-linked reads prior to mapping, thus eliminating a step in the pipeline. The output of Tigmint-long is a contigs file in FASTA format, where the sequences are broken at putative misassemblies.
Hardware
This protocol requires a 64-bit Linux or MacOS operating system with sufficient RAM and available disk space (See Strategic Planning for more detail).
Software
The following software must be installed and available in your PATH:
SRA toolkit (v3.0.0+): (https://github.com/ncbi/sra-tools)
curl: (https://curl.se/)
Python 3.7+: (https://www.python.org/)
ntLink (v1.3.7+): (https://github.com/bcgsc/ntLink)
QUAST (v5.2.0+): (https://github.com/ablab/quast)
For more information about installing ntLink and other dependencies, please see detailed instructions in Basic Protocol 1, steps 1–2. Instructions for creating a conda environment that can be used for installing protocol-specific dependencies, as described below, are available in Support Protocol 1.
Files
The input files for ntLink are long genome sequencing reads and a draft genome assembly. The long sequencing reads can be provided in FASTA or FASTQ format, either compressed with gzip or uncompressed. The input draft assembly to be corrected should be in FASTA format (multi-line or single-line).
Protocol steps with step annotations:
-
Install the protocol-specific dependency Tigmint.
Option A: Use conda package manager- If Option A of Support Protocol 1 was used to install ntLink, Tigmint can be installed in the same conda environment.
conda activate ntlink_env conda install -y -c bioconda -c conda-forge tigmint ‘samtools>=1.10’
Option B: Install from source- Consult the README in the Tigmint GitHub repository (https://github.com/bcgsc/tigmint) to ensure that the required dependencies are installed.
- Go to the releases page for Tigmint (https://github.com/bcgsc/tigmint/releases) and find the most recent release tarball. Download and extract this tarball in the directory where you would like Tigmint to be installed. To demonstrate the required commands, Tigmint v1.2.9 is shown below, but the URL can be substituted for any later release of Tigmint.
curl -L --output tigmint-1.2.9.tar.gz https://github.com/bcgsc/tigmint/releases/download/v1.2.9/tigmint-1.2.9.tar.gz tar xvzf tigmint-1.2.9.tar.gz cd tigmint-1.2.9
- Compile the required binaries
cd src make cd ../..
- Append the path to the Tigmint installation to your PATH environment variable.
export PATH=$(pwd)/tigmint-1.2.9/bin:$PATH
-
Download the sample C. elegans long reads and reference genome. These files are the same as used in Basic Protocols 1–3.
For detailed instructions describing downloading this sample data, see Basic Protocol 1, steps 3 and 5.
-
Download the sample draft C. elegans ABySS short-read assembly. This draft assembly FASTA is the same as used in Alternate Protocol 1.
For detailed instructions describing downloading the ABySS short-read assembly, see Alternate Protocol 1, step 3.
- Run Tigmint-long on the draft C. elegans ABySS assembly using ntLink mapping and the downloaded long reads to detect and cut at putative misassemblies. Ensure that the downloaded sample data files are in your current working directory.
tigmint-make tigmint-long draft=celegans-abyss reads=SRR10028109 mapping=ntLink t=5 span=2
-
Check the logs and output files from Tigmint-long to ensure that the run executed successfully.
A successful run of Tigmint will finish with the following terminal output messages: “Cutting assembly at breakpoints…
DONE!”. There will also be a file named “celegans-abyss.cut500.tigmint.fa” in your current working directory which contains the corrected draft assembly sequences.
-
Use the reference-based assessment tool QUAST to compare the contiguity and correctness of the corrected genome assembly and the initial baseline assembly. See Table 6 for results from running QUAST on these assemblies.
quast -t 5 -o quast_tigmint_ap2 -r celegans_reference.fa.gz --fast --large celegans-abyss.fa celegans-abyss.cut500.tigmint.fa
See Basic Protocol 1, Step 8b and Table 2 for detailed information about the QUAST output. Note that if QUAST was installed from the source code (as described in Basic Protocol 1, Step 2B), the executable will be named quast.py. The QUAST results will be written to the quast_tigmint_ap2 directory.
Table 6.
Assembly | Number of sequences >= 3 kbp | Scaffold NG50 length (bp) | Scaffold NGA50 length (bp) | Number of misassemblies |
---|---|---|---|---|
ABySS baseline | 4,552 | 30,293 | 27,015 | 433 |
ABySS + Tigmint-long | 4,793 | 27,564 | 27,015 | 162 |
SUPPORT PROTOCOL 1
Installing ntLink
ntLink can be installed using the conda package manager or from the source code. For a more straightforward installation process, and to ensure that all dependencies are properly installed, we recommend installing ntLink using conda.
Necessary Resources:
Hardware
ntLink requires a 64-bit Linux or MacOS operating system with sufficient RAM and available disk space (See Strategic Planning for more detail).
Software
Miniconda (https://docs.conda.io/en/latest/miniconda.html)
Protocol steps with step annotations:
Option A: Installing ntLink using the conda package manager
- If miniconda is not already installed:
- Download the miniconda3 bash installer: https://docs.conda.io/en/latest/miniconda.html
- Run the installer script:
- On MacOS:
bash Miniconda3-latest-MacOSX-x86_64.sh
- On Linux:
bash Miniconda3-latest-Linux-x86_64.sh
- Follow the installer prompts
- Close and re-open your terminal window to finalize the installation
- Create a new conda environment
conda create -n ntlink_env
- Activate the new conda environment
conda activate ntlink_env
- Install ntLink in the environment
conda install -y -c bioconda -c conda-forge ntlink
Option B: Installing ntLink from the source code
- Install the following dependencies, and ensure that each is available in your PATH environment variable. We recommend installing the dependencies using a package manager such as conda. Otherwise, visit the tool homepages for information about installing from the source code.
- Python 3.7+
- Python modules:
- Numpy: (https://numpy.org/)
- Python-igraph: (https://igraph.org/python/)
- btllib: (https://github.com/bcgsc/btllib)
- ABySS (v2.3.0+): (https://github.com/bcgsc/abyss)
- Zlib: (https://zlib.net/)
- Change your directory to the desired folder for the ntLink installation, then clone the ntLink repository from GitHub.
cd ~/bin/ntlink git clone https://github.com/bcgsc/ntLink.git
- Append the location of the ntLink installation to your PATH environment variable
export PATH=$(pwd)/ntLink:$PATH
Checking your installation
To verify that your installation is working properly, you can follow any of the basic protocols, or run the small demo provided on GitHub.
Running test demo
- If you haven’t already cloned ntLink during the installation process, clone the GitHub repository to download the small test demo.
git clone https://github.com/bcgsc/ntLink.git
- Change your working directory to the cloned ntLink repository, then to the directory containing the test demo script.
cd ntLink/tests
- Run the provided demo shell script
./test_installation.sh
If the test was successful, indicating that your installation is working as expected, you will see this message: “Done tests! Compare your generated files with the files in the expected_outputs folder to ensure the tests were successful.”
GUIDELINES FOR UNDERSTANDING RESULTS
For all ntLink runs, it is important to look through the terminal output messages to ensure that there are no errors. If there are error messages at any stage, the results are not reliable, and the error(s) needs to be resolved prior to any downstream genome analysis. See Table 7 for some common errors and suggested solutions.
Table 7.
Problem | Possible Cause | Solution |
---|---|---|
Error “make: *** No rule to make target” | Input files are not in the current working directory, or full paths to files are used. | Make soft links to ensure that input files are available in the current working directory, and do not use absolute or relative paths to input files. |
Error “zsh: no such option: pipefail” | An older version of zsh is installed, or zsh is missing. | Install zsh or update to the newest version. |
“UnsatisfiableError” when installing ntLink using conda | Incompatible versions of tools were previously installed in the conda environment. | Install ntLink in a fresh conda environment. |
“source file is in invalid format!” when running ntLink | Input long reads or draft assembly files are in an unexpected format. | Ensure that the input long reads are in correctly-formatted FASTA or FASTQ format (gzipped or uncompressed), and the draft assembly is in FASTA format (uncompressed). |
“ModuleNotFoundError: No module named <package_name>” error | A required python module is not installed properly. | Ensure that the expected python version is being used (ex. with which python3), and install the missing package (using conda or pip). |
“Error 127” in ntLink terminal output after abyss-scaffold step | ABySS is not installed, or not found in your PATH environment variable. | Install ABySS if needed, and ensure that the ABySS executables are found in your PATH. |
Running ntLink -h prints the make help page instead of the ntLink help page | ntLink requires parameters to be specified in the form “variable_name=variable_value”, so using options with the form “-<letter>” will specify options to make itself, not the ntLink program. | Run ntLink help to see the full ntLink help page. |
Contiguity gains post-ntLink are minimal | Incorrect selection of k/w | If ntLink makes minimal joins, it is likely that the k and w parameters specified are not optimal. See the Critical Parameters section for more details about setting k and w. |
Error when running Tigmint-long: “samtools: error while loading shared libraries” | An older version of samtools is installed. | Update the samtools installation. |
Running ntLink for scaffolding or mapping will generate various intermediate files. For scaffolding runs, the most important output file is the FASTA file containing the final, improved scaffolds. However, the other intermediate files contain useful information about both the evidence used for generating the final scaffolds as well as the composition of the output scaffolds themselves. The constructed scaffold graph is output in DOT format (“.scaffold.dot”). In this graph, the nodes are contigs, and the directed edges represent long-read evidence between the incident contigs. When running ntLink scaffolding, this graph is traversed using abyss-scaffold (Jackman et al., 2013) to produce the final ordered and oriented scaffolds. The ntLink output files with the suffices “trimmed_scafs.path” and “trimmed_scafs.agp” each describe the composition of the output scaffolds in different formats. These files allow the user to deduce the order and orientation of the input contigs in the output scaffolds, as well as any gap sequences between the contigs. The “.path” format describes one scaffold per tab-separated line, with the first column denoting the scaffold name, and the second listing the order and orientation of the contigs, with gap sizes indicated by “<number>N”. The “.agp” file follows the standard AGP specifications. The AGP file with the suffix “gap_fill.fa.agp” is only generated when the gap-filling step of ntLink is performed, and additionally reports the identity and coordinates of the input long reads used to fill gaps.
Following successful ntLink scaffolding, it is expected that there will be fewer sequences in the scaffolded assembly compared to the baseline assembly, since input contigs will be joined together to form scaffolds. Consequently, the contiguity of the scaffolded assembly (as assessed by abyss-fac, QUAST or other assembly assessment tools) is expected to increase. If there is no change in the contiguity or number of sequences, it is possible that parameters such as k and w (controlling the generation of the minimizers) need to be optimized (See Critical Parameters). For example, if using a long-read dataset with a high error rate, a smaller k value may be needed to increase the sensitivity of the long-read mapping. When running rounds of ntLink, it is expected that the contiguity will not increase after several rounds, as demonstrated in Basic Protocol 3. This is not a cause for concern, but just an indication that the long-read evidence leveraged by ntLink may be exhausted.
As demonstrated in the protocols, it is important to analyse the output scaffold FASTA files with tools such as abyss-fac or QUAST to assess the scaffolding success. While abyss-fac analysis does not require a reference, QUAST is reference-based, and is thus not suitable for all studies. For assembly projects without an available reference genome, or if many structural variants are expected, BUSCO (Manni, Berkeley, Seppey, Simão, et al., 2021; Manni, Berkeley, Seppey, & Zdobnov, 2021) is a useful tool for reference-free assessment of the assembly quality. BUSCO, or Benchmarking Universal Single Copy Orthologs, searches the input genome assembly for genes that are evolutionarily expected to be found in single copy. Since BUSCO assesses the assembly completeness in the gene space, it provides complementary information to the reference-free contiguity metrics.
Following ntLink scaffolding and quality control of the resulting assembly, there are a variety of downstream analyses that can be performed, from comparative genomics to annotation. The direction that these analyses take will be guided by the particular research lab and study focus, making this assembly stage broadly important.
COMMENTARY
Background Information:
Scaffolding tools, such as ntLink, can play important roles in de novo assembly pipelines through further improving upon draft assemblies. Multiple new features and modes have been integrated into ntLink to help users obtain the best possible assemblies and results from their sequencing data. The efficiency of the new and existing features of ntLink are largely attributable to the use of minimizer sketches for the various mapping tasks.
As described in Roberts et al. (2004) and implemented in btllib (Nikolić et al., 2022), ntLink generates ordered minimizer sketches by first breaking the input sequences into their constituent k-mers (substrings of length k), and computing a hash value for each k-mer using ntHash2 (Kazemi et al., 2022). Then, for each w (window size) k-mers, the k-mer with the smallest hash value is chosen as the minimizer for that window. Sliding this window across the entire sequence generates the ordered minimizer sketch, a particular subset of k-mers (represented by hash values) which is much smaller than the entire k-mer spectrum. Using this sketching approach for sequence mapping provides a great computational advantage for ntLink in both memory usage and time efficiency, enabling ntLink to scale to large genomes (Table 1).
The newly developed overlap detection, gap-filling and liftover-based round functionalities of ntLink benefit the final quality of the assemblies and allow the scaffolder to be more flexible to the specific needs of the users. Prior to the integration of overlap detection, ntLink would simply join sequences end-to-end with an intervening gap, whether the sequences had a putative overlap or not. This could lead to small insertion misassemblies being introduced at the join point, which could have a negative impact on such downstream applications as annotation, if the insertion is in a gene region, for example. The overlap detection feature resolves these overlaps, avoiding the introduction of small misassemblies at the join point and allowing for a cleaner join. When using ntLink to scaffold assemblies without the gap-filling feature, the regions between joined contigs that have a gap between them, or missing genomic sequence, are filled with ambiguous bases (“N”s). While it is valuable to have sequences ordered and oriented relative to one another, there is also genomic information in those gaps that will then be missing in the assembly. Gap-filling can be performed as a downstream, often computationally intensive, assembly step (Chu et al., 2019; Paulino et al., 2015), but performing gap-filling within ntLink is efficient and effective in recovering these missing regions. Currently, the gaps are filled with raw read sequence, which could negatively impact the base quality of the output assembly. Therefore, we recommend polishing the assembly after ntLink gap-filling, as demonstrated in Basic Protocol 3. Finally, running liftover-based rounds of ntLink enables additional improvements to the draft assembly by fully leveraging the long-read evidence, while also avoiding the computational burden of re-mapping the reads at each round.
Other state-of-the-art long-read assemblers, such as LRScaf (Qin et al., 2019) and OPERA-LG (Gao et al., 2016), rely on sequence alignments instead of mapping and do not provide users with the same features and flexibility as ntLink. Neither LRScaf nor OPERA-LG provides gap-filling functionality, nor an in-code approach for running rounds of scaffolding. Therefore, if a user wants to run multiple rounds of scaffolding, they would have to do so in a naïve manner (manually executing ntLink multiple times). Furthermore, while some long-read scaffolding tools such as LRScaf do also have logic to deal with overlapping joined sequences, their algorithms use alignments, while ntLink uses a more lightweight minimizer-mapping guided approach. OPERA-LG is not currently maintained (the last release was in 2016), so may not properly leverage more recent improvements in both sequencing and bioinformatics technologies.
Finally, we also demonstrate the flexibility of using the mapping functionality of ntLink for other applications in Alternate Protocols 1 and 2. Sequence mapping is a foundational process in bioinformatics, and often exact coordinates are not needed for the desired application. In this case, ntLink is a great resource for lightweight mapping, which can find numerous applications such as in misassembly correction (as demonstrated in Alternate Protocol 2), targeted assembly and targeted polishing, to name a few.
Critical Parameters:
k (k-mer size) and w (window size)
The k and w parameters control the generation of minimizers for mapping the long reads to the draft assembly in ntLink, and are therefore the most influential parameters. Generally, the default settings (k=32, w=100) produce good results for a variety of input assemblies and reads, but in order to obtain the best final scaffolds, these parameters can be optimized using a grid search. If undertaking a grid search, the approximate recommended ranges of k and w to test would be k=[24–80] and w=[75–1000]. Generally, we recommend a lower k and w setting when the long reads and/or draft assembly are more erroneous. However, if the draft assembly is very contiguous and/or the base quality of the input data is high, higher values can be successful.
Troubleshooting:
ntLink should complete with an exit code of 0 and a message indicating a successful run. If an error occurs, the pipeline should stop running and output an error message. Some common errors are documented in Table 7. If you encounter additional errors not discussed in Table 7, please create new issue at the ntLink GitHub repository (https://github.com/bcgsc/ntLink).
Advanced Parameters:
There are several ntLink parameters that can be tweaked in addition to k and w that may provide benefits for more advanced users. The default settings of each of these parameters have been found to work well for most assemblies.
z (minimum contig length)
By default, only sequences greater than 1 kb (z=1000) will be considered for integration into an output scaffold. Depending on the contiguity of the input draft assembly, the user may want to adjust this parameter if the input assembly is very contiguous (increase z) or very fragmented (decrease z).
a (minimum number of anchoring reads)
When ntLink parses the long-read evidence to create edges in the scaffold graph, it requires (by default) at least one ‘anchoring read’ for an edge to retained. An ‘anchoring read’ is defined as a read that has at least 2 mapped minimizers on each contig in the putative pair. If more stringent scaffold pairing is desired, this parameter can be increased to require more ‘anchoring reads’ before retaining an edge.
v (verbose benchmarking mode)
If the user specifies v=1 in their ntLink command, the time and peak memory will be tracked for each step, and output to separate files. This option can be useful when benchmarking the execution of ntLink.
soft_mask (soft mask filled gaps)
If soft_mask=True is specified in the ntLink command when gap-filling is enabled, the gaps will be filled with lowercase bases instead of uppercase bases. This soft masking could be useful for downstream analyses such as targeted polishing, for example.
Time Considerations:
Running all of the basic protocols on the C. elegans assembly/long read sequencing data showcased herein will take about 3h25m to complete: 75 m for Basic Protocol 1, 110 m for Basic Protocol 2 and 20 m for Basic Protocol 3. The vast majority of this time is attributable to the execution time of a few longer-running steps. The bulk of the time for Basic Protocol 1 (~60 m) is spent downloading the long-read data, while most of the time for Basic Protocol 2 is spent polishing the assembly using minimap2 and Racon (~90 m). Note that the polishing step of Basic Protocol 2 is optional, as noted in the step-by-step instructions. For each basic protocol, the ntLink step takes less than 15 minutes. The Alternate Protocols will take about 10 m to complete: 5 m for Alternate Protocol 1 and 5 m for Alternate Protocol 2.
The exact time required for each protocol will vary based on the specifications of the machine used to execute the commands and the installation methods chosen (conda vs. source code, where applicable). For steps where multiple threads (parameter t) are supported, increasing the number of specified CPU threads will generally reduce the runtime.
Supplementary Material
ACKNOWLEDGEMENTS:
This study is supported by the Canadian Institutes of Health Research (CIHR) [PJT-183608] and the National Institutes of Health [2R01HG007182-04A1]. The content of this article is solely the responsibility of the authors, and does not necessarily represent the official views of the National Institutes of Health or other funding organizations. The funding organizations did not have a role in the design of the study, the collection, analysis and interpretation of the data, or in writing the manuscript.
Footnotes
CONFLICT OF INTEREST STATEMENT:
The authors declare that they have no conflicts of interest.
DATA AVAILABILITY STATEMENT:
The sample assemblies that support the protocol are available from https://doi.org/10.5281/zenodo.7526395. The sample long reads and reference genome are publicly available under SRA accession SRR10028109 and GenBank accession GCA_000002985.3, respectively.
LITERATURE CITED:
- Coombe L, Li JX, Lo T, Wong J, Nikolic V, Warren RL, & Birol I (2021). LongStitch: High-quality genome assembly correction and scaffolding using long reads. BMC Bioinformatics, 22(1), 534. 10.1186/s12859-021-04451-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gao S, Bertrand D, Chia BKH, & Nagarajan N (2016). OPERA-LG: efficient and exact scaffolding of large, repeat-rich eukaryotic genomes with performance guarantees. Genome Biology, 17(1), 102. 10.1186/s13059-016-0951-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jackman SD, Coombe L, Chu J, Warren RL, Vandervalk BP, Yeo S, Xue Z, Mohamadi H, Bohlmann J, Jones SJM, & Birol I (2018). Tigmint: Correcting assembly errors using linked reads from large molecules. BMC Bioinformatics, 19(1), 393. 10.1186/s12859-018-2425-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jackman SD, Raymond AG, & Birol I (2013). Scaffolding a genome sequence assembly using ABySS. http://sjackman.ca/abyss-scaffold-paper/
- Jackman SD, Vandervalk BP, Mohamadi H, Chu J, Yeo S, Hammond SA, Jahesh G, Khan H, Coombe L, Warren RL, & Birol I (2017). ABySS 2.0: Resource-efficient assembly of large genomes using a Bloom filter. Genome Research, 27(5), 768–777. 10.1101/gr.214346.116 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kazemi P, Wong J, Nikolić V, Mohamadi H, Warren RL, & Birol I (2022). ntHash2: Recursive spaced seed hashing for nucleotide sequences. Bioinformatics, 38(20), 4812–4813. 10.1093/bioinformatics/btac564 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kolmogorov M, Yuan J, Lin Y, & Pevzner PA (2019). Assembly of long, error-prone reads using repeat graphs. Nature Biotechnology, 37(5), 540–546. 10.1038/s41587-019-0072-8 [DOI] [PubMed] [Google Scholar]
- Li H (2018). Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics, 34(18), 3094–3100. 10.1093/bioinformatics/bty191 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li JX, Coombe L, Wong J, Birol I, & Warren RL (2022). ntEdit+Sealer: Efficient Targeted Error Resolution and Automated Finishing of Long-Read Genome Assemblies. Current Protocols, 2(5), e442. 10.1002/cpz1.442 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Logsdon GA, Vollger MR, & Eichler EE (2020). Long-read human genome sequencing and its applications. Nature Reviews Genetics, 21(10), 597–614. 10.1038/s41576-020-0236-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- Manni M, Berkeley MR, Seppey M, Simão FA, & Zdobnov EM (2021). BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Molecular Biology and Evolution, 38(10), 4647–4654. 10.1093/molbev/msab199 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Manni M, Berkeley MR, Seppey M, & Zdobnov EM (2021). BUSCO: Assessing Genomic Data Quality and Beyond. Current Protocols, 1(12), e323. 10.1002/cpz1.323 [DOI] [PubMed] [Google Scholar]
- Medaka: Sequence correction provided by ONT Research. (n.d.). Retrieved December 13, 2022, from https://github.com/nanoporetech/medaka
- Mikheenko A, Prjibelski A, Saveliev V, Antipov D, & Gurevich A (2018). Versatile genome assembly evaluation with QUAST-LG. Bioinformatics, 34(13), i142–i150. 10.1093/bioinformatics/bty266 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nikolić V, Kazemi P, Coombe L, Wong J, Afshinfard A, Chu J, Warren RL, & Birol I (2022). btllib: A C++ library with Python interface for efficient genomic sequence processing. Journal of Open Source Software, 7(79), 4720. [Google Scholar]
- Paulino D, Warren RL, Vandervalk BP, Raymond A, Jackman SD, & Birol I (2015). Sealer: A scalable gap-closing application for finishing draft genomes. BMC Bioinformatics, 16(1), 230. 10.1186/s12859-015-0663-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Qin M, Wu S, Li A, Zhao F, Feng H, Ding L, & Ruan J (2019). LRScaf: Improving draft genomes using long noisy reads. BMC Genomics, 20(1), 955. 10.1186/s12864-019-6337-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Roberts M, Hayes W, Hunt BR, Mount SM, & Yorke JA (2004). Reducing storage requirements for biological sequence comparison. Bioinformatics, 20(18), 3363–3369. 10.1093/bioinformatics/bth408 [DOI] [PubMed] [Google Scholar]
- Vaser R, Sović I, Nagarajan N, & Šikić M (2017). Fast and accurate de novo genome assembly from long uncorrected reads. Genome Research, 27(5), 737–746. 10.1101/gr.214270.116 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Warren RL, Coombe L, Mohamadi H, Zhang J, Jaquish B, Isabel N, Jones SJM, Bousquet J, Bohlmann J, & Birol I (2019). ntEdit: Scalable genome sequence polishing. Bioinformatics, 35(21), 4430–4432. 10.1093/bioinformatics/btz400 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Warren RL, Yang C, Vandervalk BP, Behsaz B, Lagman A, Jones SJM, & Birol I (2015). LINKS: Scalable, alignment-free scaffolding of draft genomes with long reads. GigaScience, 4(1), s13742-015-0076-3. 10.1186/s13742-015-0076-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wong J, Coombe L, Nikolić V, Zhang E, Nip KM, Sidhu P, Warren RL, & Birol I (2022). GoldRush: A de novo long read genome assembler with linear time complexity. BioRxiv, 2022.10.25.513734. 10.1101/2022.10.25.513734 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The sample assemblies that support the protocol are available from https://doi.org/10.5281/zenodo.7526395. The sample long reads and reference genome are publicly available under SRA accession SRR10028109 and GenBank accession GCA_000002985.3, respectively.