Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Oct 30.
Published in final edited form as: IEEE/ACM Trans Comput Biol Bioinform. 2021 Feb 3;18(1):365–372. doi: 10.1109/TCBB.2019.2913845

Analysis of Subtelomeric REXTAL Assemblies Using QUAST

Tunazzina Islam 1, Desh Ranjan 2, Mohammad Zubair 3, Eleanor Young 4, Ming Xiao 5, Harold Riethman 6
PMCID: PMC6940546  NIHMSID: NIHMS1048274  PMID: 31056507

Abstract

Genomic regions of high segmental duplication content and/or structural variation have led to gaps and misassemblies in the human reference sequence, and are refractory to assembly from whole-genome short-read datasets. Human subtelomere regions are highly enriched in both segmental duplication content and structural variations, and as a consequence are both impossible to assemble accurately and highly variable from individual to individual. Recently, we developed a pipeline for improved region-specific assembly called Regional Extension of Assemblies Using Linked-Reads (REXTAL) [1]. In this study, we evaluate REXTAL and genome-wide assembly (Supernova; [2]) approaches on 10X Genomics linked-reads data sets partitioned and barcoded using the Gel Bead in Emulsion (GEM) microfluidic method [3]. Our results describe the accuracy and relative performance of these two approaches using the reference-based assessment module of QUAST [4]. We show that REXTAL dramatically outperforms the Supernova whole genome assembler in subtelomeric segmental duplication regions, and results in highly accurate assemblies. Nearly all of the REXTAL “misassemblies” identified using default QUAST parameters simply pinpoint locations of tandem repeat arrays in the reference sequence where the repeat array length differs from that in the cognate REXTAL assembly by >1000 bp.

Keywords: regional assembly, quality metric, segmental duplication, subtelomere, tandem repeat, misassembly, genome gap

I. Introduction

It is currently impossible to get complete de-novo assembly of segmentally duplicated genome regions using genome-wide short-read datasets. Even using paired-end read approaches with input molecules of various lengths, de novo assembly of human genomes has remained problematic because of abundant interspersed repeats and especially segmental duplication regions which contain > 1 kbp segments of DNA with similar (> 90%) identity. A recently developed approach pioneered by 10X Genomics generates short-read datasets from large genomic DNA molecules first partitioned and barcoded using the Gel Bead in Emulsion (GEM) microfluidic method [3]. The bioinformatic pipeline for assembly of these reads called Supernova [2] takes advantage of a large number of sets of linked reads. Each set of linked reads is comprised of low-read coverage of a small number of large genomic DNA molecules (roughly 10) and is associated with a unique bar code. This approach enables efficient de novo assembly of the human genome, with large segments separable into haplotypes [2]. However, it does not solve the problem of segmental duplications such as those found in subtelomeres. To address this problem, we developed a new computational method called REXTAL [1] for improved region-specific assembly of segmental duplication-containing DNA, leveraging genomic short-read datasets generated from large DNA molecules partitioned and barcoded using the Gel Bead in Emulsion (GEM) microfluidic method.

In this paper, we do a more extensive analysis of REXTAL and evaluate our regional assemblies with the reference-based alignment tool (QUAST; [4]) on 17 subtelomeric DNA regions. We find dramatically improved coverage of subtelomeric segmental duplication regions in REXTAL vs. whole genome assemblies while maintaining accurate assemblies using REXTAL.

II. Experimental Details

In Subsection A, we present the input data description. Subsection B presents the overview of REXTAL methodology. In subsections C and D we describe QUAST analysis and visualization of our final assemblies.

A. Data

The input data for QUAST [4] is the respective regional assembly generated by REXTAL [1] and the cognate subtelomeric reference sequence from HG38.

B. REXTAL Methodology

REXTAL [1] uses linked-read genome sequencing to extend subtelomere assemblies. It differs from the genome-wide assembly method in that we used the barcode information for selection of reads from anticipated segmental duplication or gap regions adjacent to a specified 1-copy DNA segment before doing the assembly. We initially found reads matching the 1-copy DNA segment (bait DNA segment) based upon the reference human genome (HG38), then selected all reads for barcodes represented in these initial matching reads in reads selection step (Fig. 1). This set of reads should represent a very limited subset of all genomic reads, and approximately 10% of the barcode-selected reads should be derived specifically from the selected 1-copy DNA and 50 kbp-100 kbp segments of flanking DNA. Using barcode read frequency range selection and barcode clustering pattern selection steps (Fig. 1) we selected all reads from a subset of these initial barcodes for assembly [1], enabling the extension of existing assemblies into adjacent segmental duplication and gap regions using the Supernova assembler [2]. Fig. 1 shows the overall REXTAL workflow.

Fig. 1.

Fig. 1.

Overview of REXTAL workflow.

C. QUAST Analysis

QUAST [4] evaluates genome assemblies by computing various metrics from a global alignment of the test assembly with a reference sequence. To measure the quality of the assembly, we ran QUAST with --scaffolds option (keeping other parameters default) using assembled scaffolds generated by REXTAL and using as reference sequence specified subtelomeric regions of HG38 corresponding to our unmasked single-copy bait segments along with their flanking reference DNA segments (including segmental duplication regions).

1). Scaffold:

As REXTAL assemblies are scaffolds (rather than contigs) and we ran QUAST with --scaffolds option, this added split versions of assemblies to the comparison (named <assembly_name>_broken). Assemblies are split by continuous fragments of N’s of length ≥ 10. Scaffold gap size misassemblies are enabled in this case and we kept default --scaffold-gap-max-size (which is 10 kbp) for setting maximum gap length.

2). Misassembly Detection:

QUAST [4] generates a report with the number of misassemblies according to the defined misassembly breakpoint by Plantagora [5]. Misassembly breakpoint is a position in the assembled contigs where the left flanking sequence aligns over 1 kbp away from the right flanking sequence on the reference, or they overlap by >1 kbp, or the flanking sequences align on opposite strands or different chromosomes. While running QUAST we kept default threshold of 1 kbp for --extensive-mis-size parameter. Most of the “misassemblies” called in REXTAL generated assemblies relative to reference were due to the gap sizes in a contig slightly excluding the QUAST default gap limit of 1000 bp.

D. Visualization

We used Icarus [6] a genome visualizer for assessment and analysis of genomic assemblies, which is based on QUAST genome quality assessment tool. The contig alignment viewer of Icarus has 2 parts. The top part shows the detailed view of selected region from the bottom part which represents the assembly overview.

1). Broken scaffold:

To view only the split versions of assemblies in the Icarus viewer following steps were followed:

Step1: At first we ran the referenced-based QUAST from a command line with --scaffolds and --debug option [12].

Step2: If an output path is not specified manually (we can specify output path of QUAST by using -o option), QUAST generates its output into quast_results/result_<DATE> directory. We chose the <assembly_name>_broken version file under the quast_results/result_<DATE>/quast_corrected_input/directory [12].

Step3: We reran the referenced-based QUAST from a command line with the same reference but used <assembly_name>_broken instead of the original assembly and did not use --scaffolds option this time.

2). Tandem Repeat Marker:

Since there is no special visualization for repeats yet in QUAST [12], we used tandem repeat finder [9] to screen subtelomeric regions of reference DNA segment sequences. We then used this masked reference as input data for QUAST with the same unmasked subtelomeric regions as reference and followed the procedure described in subsubsection II-D1. We used the broken masked reference to locate the positions of tandem repeats in Icarus viewer (Fig. 2Fig. 5).

Fig. 2.

Fig. 2.

Contig alignment viewer of Icarus for the segmental duplication region region of 18p and 22q. Each viewer has 2 parts containing 3 rows in each part. The top green bars of the top part represents the masked reference as tandem repeat markers with white breaks in this track shows the positions and sizes of tandem repeats in the reference. The 2nd and 3rd rows show the REXTAL and the genome-wide assemblies respectively. The bottom part having 3 rows represents the assembly overview with the highlighted yellow box indicates the region expanded in the top 3 rows. A. The 2nd and the 3rd row of top part represent the contigs generated by REXTAL and genome-wide method for 18p correspondingly. B. The 2nd and the 3rd row of top part represent the contigs generated by REXTAL and genome-wide method for 22q correspondingly.

Fig. 5.

Fig. 5.

Contig alignment viewer of Icarus for for the bait segment into adjacent DNA including segmental duplication region of 17p and 2p. Each viewer has 2 parts containing 3 rows in each part. The top green bars of the top part represents the masked reference as tandem repeat markers with white breaks in this track shows the positions and sizes of tandem repeats in the reference. The 2nd and 3rd rows show the REXTAL and the genome-wide assemblies respectively. The bottom part having 3 rows represents the assembly overview with the highlighted yellow box indicates the region expanded in the top 3 rows. A. The 2nd row represents the contigs generated by REXTAL and the 3rd row represents the contigs generated by genome-wide method for 17p. There are four red blocks in a contig those are misassembled because of relocation with inconsistency value 1920, 1172, and 1055. B. The 2nd row represents the contigs generated by REXTAL and the 3rd row represents the contigs generated by genome-wide method for 2p. The two red block represents the misassembly because of 2935 bp gap between two blocks within a contig. Note that the “misassembled contig” is in fact a gap in the contig corresponds exactly to a tandem repeat. It is called a QUAST misassembly only because it exceeds the 1000 bp default when aligned to the reference sequence.

3). Comparative Analysis:

To visualize the comparative analysis of REXTAL and Genome-wide method as well as the tandem repeat marker, we ran reference based QUAST with 3 input files i.e. first one is broken masked reference file as tandem repeat marker, then broken REXTAL assembly and 3rd one is broken Genome-wide assembly.

III. Results And Discussions

UCSC browser [7] was used to access HG38 and select subtelomere DNA segments for analysis. We tested REXTAL and the QUAST analysis on 17 human subtelomere regions (base pair coordinates listed are from HG38). The 2p subtelomere is a 500 kbp sized segment of 1-copy DNA (10,001 to 500,000); 19p subtelomere has a very large segmental duplication region next to the telomere (10,001-259,447) followed by a 300 kbp sized 1-copy region (259,448-559,447), 10p has a smaller segmental duplication region near the telomere (10,001-88,570) followed by a 300 kbp 1-copy region (88,571-388,571); 5p has multiple segmental duplication regions (10,001-49,495 and 210,596-305,378) separated and flanked by two 1-copy regions (49,496-210,595 and 305,379-510,000). For 16p, 16q, 17p, 17q, 18p, 18q, 19q, 20p, 20q, 21q, 22q, Xp, Xq we extracted 100 kbp single copy bait sequences as close as possible to the telomere. Table I shows the details of subtelomeric region.

TABLE I.

co-ordinate of extracted subtelomeric region from UCSC browser

Regiona Refb Baitc SDd 1-copye
2p 10,001-700,000 10,001-500,000 N/A 10,001-700,000
5p 10,001-677,959 49,496-210,595 and 305,379-510,000 10,001-49,495 and 210,596-305,378 305,379-677,959
10p 10,001-588,571 88,571-388,571 10,001-88,570 88,571-588,571
16p 10,000-240,859 40,860-140,859 10,000-40,859 40,860-240,859
16q 89,857,010-90,228,345 89,857,010-89,965,857 and 89,968,061-90,057,009 89,965,858-89,968,060 and 90,057,010-90,228,345 89,857,010-89,965,857 and 89,968,061-90,057,009
17p 60,000-341,850 141,851-241850 60,000-141,850 141,851-341,850
17q 83,004,545-83,247,441 83,104,544-83,204,544 83,204,545-83,247,441 83,004,545-83,204,544
18p 10,000-331,693 131,694-231,693 10,000-131,693 131,694-331,693
18q 80,059,053-80,263,285 80159052-80,259,052 80,259,053-80,263,285 80,059,053-80,259,052
19p 10,001-759,447 259,448-559,447 10,001-259,447 259,448-759,447
19q 58,386,558-58,607,616 58486557-58,586,557 58,586,558-58,607,616 58,386,558-58,586,557
20p 66,335-266,334 66,335-166334 N/A 66,335-266,334
20q 64,073,499-64,334,167 64,173,498-64,273,498 and 64,276,019-64,282,623 64,273,499-64,276,018 and 64,282,624-64,334,167 64,073,499-64,273,498 and 64,276,019-64,282,623
21q 46,472,945-46,699,983 46572944-46,672,944 46,672,945-46,699,983 46,472,945-46,672,944
22q 50,540,514-50,808,468 50640513-50,740,513 50,740,514-50,808,468 50,540,514-50,740,513
Xp 222,347-527,305 222,347-320,315 and 327,306-427,306 320,316-327,305 222,347-320,315 and 327,306-527,305
Xq 155,783,780-156,030,894 155,883,778-155,983,778 and 155,987,225-156,000,330 155,983,779-155,987,224 and 156,000,331-156,030,894 155,783,780-155,983,778 and 155,987,225-156,000,330
a.

Subtelomeric region.

b.

HG38 Co-ordinates of reference subtelomeric region (HG38).

c.

HG38 Co-ordinates of 1-copy subtelomeric bait region.

d.

HG38 Co-ordinates of subtelomeric segmental duplication region.

e.

HG38 Co-ordinates of entire subtelomeric 1-copy region.

For a fair comparison of REXTAL with genome-wide assembly method, we extracted all contigs in the genome-wide assembly that overlap (including potential extensions into flanking DNA) with the 1-copy bait sequences using SAMtools [11].

A. QUAST report on genome fraction in segmental duplication region

As segmental duplication regions contain segments of DNA with near-identical duplicated subtelomere sequences, these regions are hard to assemble de novo with whole genome reads. We can extend REXTAL into subtelomere segmental duplication regions. To measure the quality of REXTAL vs. genome-wide assemblies in segmental duplication regions, we ran reference based QUAST for these regions (Table II). 2p and 20p do not have segmental duplication regions. 5p, 16q, 20q, and Xq have multiple segmental duplication regions. For 17p, genome-wide method could not extend the assembly up to segmental duplication region.

TABLE II.

Comparison of QUAST result in segmental duplication region

REXTAL Genome-wide
Regiona Genome
fraction
(%)
Misassemblyb Genome
fraction
(%)
Misassemblyc
2p N/A N/A N/A N/A
5p_1st 88.707 0 57.734 0
5p_2nd 88.038 0 3.034 0
10p 90.103 1 7.092 0
16p 94.18 0 25.493 0
16q_1st 100 0 100 0
16q_2nd 35.423 1 16.549 0
17p 21.52 0 N/A N/A
17q 85.12 1 0.494 0
18p 69.028 0 9.701 0
18q 82.731 0 60.761 0
19p 28.14 0 2.139 0
19q 92.906 0 1.22 0
20p N/A N/A N/A N/A
20q_1st 100 0 100 0
20q_2nd 98.462 0 6.955 0
21q 94.941 0 23.806 0
22q 96.087 0 5.055 0
Xp 59.828 0 16.753 0
Xq_1st 100 0 100 0
Xq_2nd 90.211 0 62.174 0
a.

Subtelomeric region

b.

“Misassemblies” are tandem repeat arrays in the reference sequence where the repeat array length differs from that in the cognate REXTAL assembly by > 1000 bp

c.

Number of misassembly in genome-wide method

It is easy to observe that the %genome fractions obtained by REXTAL (2nd Column of Table II) are significantly better than the %genome fractions obtained by genome-wide method (4th Column of Table II) for all loci that have been tested.

Fig. 2A and Fig. 2B show the visualization of QUAST analysis of 18p and 22q in segmental duplication regions.

B. QUAST report on misassembly in segmental duplication region

Generally, QUAST report contains a classification of misassembly events (using Plantagoras [5] definition) into three groups: relocations, translocations, and inversions (subsubsection II-C2).

The number of misassemblies obtained in segmental duplication region by REXTAL and genome-wide method are shown correspondingly in 3rd and 5th Column of Table II. QUAST generated one misassembly for 10p, 16q_2nd (2nd segmental duplication region of 16q), and 17q and all these three misassemblies happened because of relocation according to QUAST report. Fig. 3 shows the misassembled contig (two red blocks) for segmental duplication region of 16q_2nd. The cause of the misassembly was relocation with inconsistency = 1512. As the top green bars represent tandem repeat marker and the gap between green top bars represent the tandem repeat region, the misassembly happened in tandem repeat region. These misassembled blocks are in one contig. Genome-wide method could not extend the assembly up to this point. To run the QUAST we used default value of parameter --extensive-mis-size and that is 1000. If we set the parameter of --extensive-mis-size with higher value, we would not find these misassemblies.

Fig. 3.

Fig. 3.

Contig alignment viewer of Icarus for the segmental duplication region of 16q_2nd.The top green bars of the top part represents the masked reference as tandem repeat markers with white breaks in this track shows the positions and sizes of tandem repeats in the reference. The 2nd row represents the contigs generated by REXTAL and two red blocks represents the misassembled contig with gap 1512 bp. 3rd row is supposed to be the contigs generated by genome-wide method for segmental duplication region of 16q_2nd and this row shows nothing here because genome-wide method could not extend the assembly up to this point. The bottom three rows represent the assembly overview with the highlighted yellow box indicates the region expanded in the top 3 rows. Note that the “misassembled contig” is in fact a gap in the contig corresponds exactly to a tandem repeat. It is called a QUAST misassembly only because it exceeds the 1000 bp default when aligned to the reference sequence.

C. QUAST report on the bait segment into adjacent 1-copy region

We extracted subtelomeric region containing 1-copy and 1-copy bait region as reference from UCSC genome browser to compare the extending assemblies of the bait segment into adjacent 1-copy region for REXTAL and genome-wide method. To show the quality of REXTAL vs. genome-wide assembly, we ran QUAST with both assemblies (Table III).

TABLE III.

Comparison of QUAST result in the bait segment into adjacent 1-copy region

REXTAL Genome-wide
Regiona Genome
fraction
(%)
Misassemblyb Genome
fraction
(%)
Misassemblyc
2p 96.226 1 96.839 0
5p_1st 93.471 0 94.818 0
5p_2nd 91.68 1 93.789 1
10p 97.966 1 98.142 0
16p 75.873 0 55.733 0
16q_1st 96.759 3 96.146 0
16q 2nd 97.719 1 96.606 0
17p 68.839 5 39.774 0
17q 77.839 1 45.235 0
18p 93.448 0 99.71 0
18q 87.391 0 98.668 1
19p 87.173 2 85.141 0
19q 81.459 1 55.077 0
20p 86.243 1 100.00 0
20q_1st 71.204 0 53.714 0
20q_2nd 100.00 0 100.00 0
21q 73.621 0 99.974 0
22q 70.74 0 96.547 1
Xp_1st 72.381 1 26.064 1
Xp_2nd 72.103 0 13.322 0
Xq_1st 87.678 0 90.088 0
Xq_2nd 100.00 0 100.00 0
a.

Subtelomeric region.

b.

“Misassemblies” are tandem repeat arrays in the reference sequence where the repeat array length differs from that in the cognate REXTAL assembly by > 1000 bp.

c.

Number of misassembly in genome-wide method.

Fig. 4A and 4B show the contig alignment viewer of Icarus in the bait segment into adjacent 1-copy region for 19q and 17q correspondingly.

Fig. 4.

Fig. 4.

Contig alignment viewer of Icarus for 1-copy region of 19q and 17q. Each viewer has 2 parts containing 3 rows in each part. The top green bars of the top part represents the masked reference as tandem repeat markers with white breaks in this track shows the positions and sizes of tandem repeats in the reference. The 2nd and 3rd rows show the REXTAL and the genome-wide assemblies respectively. The bottom part having 3 rows represents the assembly overview with the highlighted yellow box indicates the region expanded in the top 3 rows. A. The 2nd row (including the expansion (yellow area)) represents the contigs generated by REXTAL and the 3rd row represents the contigs generated by genome-wide method for 19q. The expanded version of 2nd row shows that misassembled contig has seven blocks among them two blocks (red blocks) are misassembled because of relocation with inconsistency = 1115. This misassembled contig is located entirely within other higher-quality contig (1 green block in 2nd row). B. The 2nd row represents the contigs generated by REXTAL and the 3rd row represents the contigs generated by genome-wide method for 17q. The misassembled contig has four blocks (in assembly overview image there is a light yellow rectangle representing the selected region and four down arrows (↓) represent four blocks in one contig.). Among them two blocks (red blocks) are misassembled with inconsistency= 1168. These two misassembled blocks are in one contig in REXTAL assembly but two different contigs in genome-wide assembly. In the selected region of genome-wide method has seven different assembled contigs whether REXTAL has one contig with four blocks with gaps. Note that the “misassembled contig” is in fact a gap in the contig corresponds exactly to a tandem repeat. It is called a QUAST misassembly only because it exceeds the 1000 bp default when aligned to the reference sequence.

For 19q QUAST reports 1 misassembly in 1-copy region for REXTAL (3rd column Table III). Fig. 4A shows that the misassembled contig corresponds to a small contig matching less accurate (identity 96% – 98%) among seven small blocks to the reference than a larger, more closely matching contig (99.92% identity). Among the seven blocks two blocks (red blocks) are misassembled because of relocation with inconsistency value 1115. However, this misassembled contig is located entirely within other higher-quality (99.92% identity) contig (1 green block in 2nd row in Fig. 4A). To avoid this situation in our prior work we proposed a metric called Length-wise Assembled Fraction (LAF) [1] for quality measurement of the regional assemblies. Before measuring the quality, we extracted reference sequences from HG38 and then aligned them with corresponding assembled scaffolds using BLAST [10], requiring ≥98% of identity for retention of each local alignment. This generates positions of each local alignment including query start positions and query end positions. The starting positions of the query were sorted in increasing order. Local alignments were merged by (1) deleting local alignments located entirely within other higher-quality alignments; and (2) Local alignments with partial overlap, the overlap regions were merged by selecting the alignment with equivalent or higher % identity in the overlap region [1]. The LAF metric avoids the secondary more weakly matching assemblies like that shown above.

In Fig. 4B, for 17q the misassembled contig has four blocks (in assembly overview image there is a light yellow rectangle representing the selected region and four down arrows (↓) represent four blocks in one contig.). Among them two blocks (red blocks) are misassembled because of relocation with inconsistency value 1168. These two misassembled blocks are in one contig in REXTAL assembly but two different contigs in genome-wide assembly. Overall the selected region of genome-wide method has seven different assembled contigs in the genome-wide assembly whereas REXTAL has one contig with four blocks with gaps. The gaps all correspond to tandem repeat regions where REXTAL was able to assemble the tandem repeat region putting gaps in a contig rather than creating separate contigs. We can avoid these misassembly calls by setting the parameter of --extensive-mis-size with slightly higher value during running the QUAST.

D. QUAST report on the bait segment into adjacent DNA including segmental duplication region

We extracted subtelomeric region containing 1-copy, 1-copy bait and segmental duplication region as reference from UCSC genome browser to compare the extending assemblies of the bait segment into adjacent DNA for REXTAL and genome-wide method.

To show the quality of REXTAL vs. genome-wide assembly, we ran reference based QUAST for these regions and compared results in Table IV.

TABLE IV.

Comparison of QUAST result in the bait segment into adjacent DNA including segmental duplication region

REXTAL Genome-wide
Regiona Genome
fraction
(%)
Misassemblyb Genome
fraction
(%)
Misassemblyc
2p 75.535 1 71.337 0
5p 75.448 1 70.945 1
10p 72.933 2 53.707 0
16p 78.2 0 51.56 0
16q 68.623 2 59.263 0
17p 54.829 5 28.38 0
17q 78.813 2 37.142 0
18p 84.151 0 65.66 0
18q 87.138 0 96.764 1
19p 51.108 3 35.316 1
19q 78.18 0 47.087 0
20p 86.208 1 99.999 0
20q 77.442 0 46.099 0
21q 75.861 0 90.902 0
22q 77.156 0 73.266 1
Xp 70.569 1 16.982 0
Xq 88.709 0 88.709 0
a.

Subtelomeric region.

b.

“Misassemblies” are tandem repeat arrays in the reference sequence where the repeat array length differs from that in the cognate REXTAL assembly by > 1000 bp.

c.

Number of misassembly in genome-wide method.

1). Analysis of genome fraction and misassmblies:

REXTAL has better % of genome fraction than whole genome assembly except for 18q, 20p, and 21q (2nd Column of Table IV). In subsection III-A we showed that 18q and 21q have noticeably good extension in segmental duplication region (2nd Column of Table II). 20p is all single copy region and the genome-wide method gave a better genome fraction here than the REXTAL.

Fig. 5A and Fig. 5B show the contig alignment viewer of Icarus in the bait segment into adjacent DNA including segmental duplication region for 17p and 2p correspondingly.

For 17p, QUAST generates total 5 misassemblies (3rd Column of Table IV) on the bait segment into adjacent DNA region. Fig. 5A shows that there are four red blocks in a contig that were misassembled because of relocation with inconsistency value 1920, 1172, and 1055, where the genome-wide method has five separate contigs instead and does not have these misassemblies.

In Fig. 5B for 2p similar case happened in tandem repeat region where misassembly happened because of the gap (inconsistency = 2935) between two blocks within a contig. Genome-wide method considered these two blocks as two separate contigs.

Both for 17p and 2p (Fig. 5), it is noticeable that the “misassembled contig” is in fact a gap in the contig corresponds exactly to a tandem repeat. It is called a QUAST misassembly only because it exceeds the 1000 bp default when aligned to the reference sequence. We can avoid these errant misassembly calls by setting the parameter of --extensive-mis-size with higher value during running the QUAST.

IV. Conclusion

We successfully used REXTAL [1] on 17 subtelomeric bait regions and extended the assembly of single-copy diploid DNA into adjacent including inaccessible subtelomere segmental duplication regions. We evaluated REXTAL and genome-wide assembly using the reference-based assessment module of QUAST and showed that REXTAL dramatically outperformed the Supernova whole genome assembler in subtelomeric segmental duplication regions, and produced in highly accurate assemblies. In future experiments, we will combine REXTAL and Nanopore single-read datasets to achieve complete long-range assemblies throughout all human subtelomere regions.

Acknowledgments

The work in this paper is supported in part by NIH R21CA177395 (HR and MX), and Modeling and Simulation Scholarship (to TI) from Old Dominion University.

Contributor Information

Tunazzina Islam, Department of Computer Science, Old Dominion University, Norfolk, VA, USA.

Desh Ranjan, Department of Computer Science, Old Dominion University, Norfolk, VA, USA.

Mohammad Zubair, Department of Computer Science, Old Dominion University, Norfolk, VA, USA.

Eleanor Young, School of Biomedical Engineering, Drexel University, Philadelphia, PA, USA.

Ming Xiao, Institute of Molecular Medicine and Infectious Disease, Drexel University, Philadelphia, PA, USA.

Harold Riethman, School of Medical Diagnostic & Translational Sciences, Old Dominion University, Norfolk, VA, USA.

References

  • [1].Islam T et al. , “REXTAL: Regional Extension of Assemblies Using Linked-Reads,” International Symposium on Bioinformatics Research and Applications, pp. 63–78, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Weisenfeld NI, Kumar V, Shah P, Church DM, Jaffe DB, “Direct determination of diploid genome sequences,” Genome research, 27, pp. 757–767, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Zheng GX-L-P et al. , “Haplotyping germline and cancer genomes with high-throughput linked-read sequencing,” Nature biotechnology, 34, pp. 303–311, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Gurevich A, Saveliev V, Vyahhi N, Tesler G, “QUAST: quality assessment tool for genome assemblies,” Bioinformatics, 29, pp. 1072–1075, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].Barthelson R, et al. , “Plantagora: modeling whole genome sequencing and assembly of plant genomes,” PLoS One, 6:e28436, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Mikheenko A, Valin G, Prjibelski A, Saveliev V, Gurevich A,“Icarus: visualizer for de novo assembly evaluation,” Bioinformatics, 32, pp. 3321–3323, 2016. [DOI] [PubMed] [Google Scholar]
  • [7].Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. (2002). The human genome browser at UCSC. Genome research, 12, 996–1006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Smit AF (1996). 2010 RepeatMasker Open-3.0. http://www.repeatmasker.org/.
  • [9].Benson G (1999). Tandem repeats finder: a program to analyze DNA sequences. Nucleic acids research, 27, 573. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Altschul SF, Madden TL, Schffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research, 25, 3389–3402. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup (2009). The sequence alignment/map format and SAMtools. Bioinformatics, 25, 2078–2079. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Gurevich Alexey, email = ”alexeigurevich@gmail.com”, Affiliation = Research Scientist at Center for Algorithmic Biotechnology, Saint Petersburg State University. [Google Scholar]

RESOURCES