Abstract
Genomic regions of high segmental duplication content and/or structural variation have led to gaps and misassemblies in the human reference sequence, and are refractory to assembly from whole-genome short-read datasets. Human subtelomere regions are highly enriched in both segmental duplication content and structural variations, and as a consequence are both impossible to assemble accurately and highly variable from individual to individual. Recently, we developed a pipeline for improved region-specific assembly called Regional Extension of Assemblies Using Linked-Reads (REXTAL) [1]. In this study, we evaluate REXTAL and genome-wide assembly (Supernova; [2]) approaches on 10X Genomics linked-reads data sets partitioned and barcoded using the Gel Bead in Emulsion (GEM) microfluidic method [3]. Our results describe the accuracy and relative performance of these two approaches using the reference-based assessment module of QUAST [4]. We show that REXTAL dramatically outperforms the Supernova whole genome assembler in subtelomeric segmental duplication regions, and results in highly accurate assemblies. Nearly all of the REXTAL “misassemblies” identified using default QUAST parameters simply pinpoint locations of tandem repeat arrays in the reference sequence where the repeat array length differs from that in the cognate REXTAL assembly by >1000 bp.
Keywords: regional assembly, quality metric, segmental duplication, subtelomere, tandem repeat, misassembly, genome gap
I. Introduction
It is currently impossible to get complete de-novo assembly of segmentally duplicated genome regions using genome-wide short-read datasets. Even using paired-end read approaches with input molecules of various lengths, de novo assembly of human genomes has remained problematic because of abundant interspersed repeats and especially segmental duplication regions which contain > 1 kbp segments of DNA with similar (> 90%) identity. A recently developed approach pioneered by 10X Genomics generates short-read datasets from large genomic DNA molecules first partitioned and barcoded using the Gel Bead in Emulsion (GEM) microfluidic method [3]. The bioinformatic pipeline for assembly of these reads called Supernova [2] takes advantage of a large number of sets of linked reads. Each set of linked reads is comprised of low-read coverage of a small number of large genomic DNA molecules (roughly 10) and is associated with a unique bar code. This approach enables efficient de novo assembly of the human genome, with large segments separable into haplotypes [2]. However, it does not solve the problem of segmental duplications such as those found in subtelomeres. To address this problem, we developed a new computational method called REXTAL [1] for improved region-specific assembly of segmental duplication-containing DNA, leveraging genomic short-read datasets generated from large DNA molecules partitioned and barcoded using the Gel Bead in Emulsion (GEM) microfluidic method.
In this paper, we do a more extensive analysis of REXTAL and evaluate our regional assemblies with the reference-based alignment tool (QUAST; [4]) on 17 subtelomeric DNA regions. We find dramatically improved coverage of subtelomeric segmental duplication regions in REXTAL vs. whole genome assemblies while maintaining accurate assemblies using REXTAL.
II. Experimental Details
In Subsection A, we present the input data description. Subsection B presents the overview of REXTAL methodology. In subsections C and D we describe QUAST analysis and visualization of our final assemblies.
A. Data
The input data for QUAST [4] is the respective regional assembly generated by REXTAL [1] and the cognate subtelomeric reference sequence from HG38.
B. REXTAL Methodology
REXTAL [1] uses linked-read genome sequencing to extend subtelomere assemblies. It differs from the genome-wide assembly method in that we used the barcode information for selection of reads from anticipated segmental duplication or gap regions adjacent to a specified 1-copy DNA segment before doing the assembly. We initially found reads matching the 1-copy DNA segment (bait DNA segment) based upon the reference human genome (HG38), then selected all reads for barcodes represented in these initial matching reads in reads selection step (Fig. 1). This set of reads should represent a very limited subset of all genomic reads, and approximately 10% of the barcode-selected reads should be derived specifically from the selected 1-copy DNA and 50 kbp-100 kbp segments of flanking DNA. Using barcode read frequency range selection and barcode clustering pattern selection steps (Fig. 1) we selected all reads from a subset of these initial barcodes for assembly [1], enabling the extension of existing assemblies into adjacent segmental duplication and gap regions using the Supernova assembler [2]. Fig. 1 shows the overall REXTAL workflow.
Fig. 1.
Overview of REXTAL workflow.
C. QUAST Analysis
QUAST [4] evaluates genome assemblies by computing various metrics from a global alignment of the test assembly with a reference sequence. To measure the quality of the assembly, we ran QUAST with --scaffolds option (keeping other parameters default) using assembled scaffolds generated by REXTAL and using as reference sequence specified subtelomeric regions of HG38 corresponding to our unmasked single-copy bait segments along with their flanking reference DNA segments (including segmental duplication regions).
1). Scaffold:
As REXTAL assemblies are scaffolds (rather than contigs) and we ran QUAST with --scaffolds option, this added split versions of assemblies to the comparison (named <assembly_name>_broken). Assemblies are split by continuous fragments of N’s of length ≥ 10. Scaffold gap size misassemblies are enabled in this case and we kept default --scaffold-gap-max-size (which is 10 kbp) for setting maximum gap length.
2). Misassembly Detection:
QUAST [4] generates a report with the number of misassemblies according to the defined misassembly breakpoint by Plantagora [5]. Misassembly breakpoint is a position in the assembled contigs where the left flanking sequence aligns over 1 kbp away from the right flanking sequence on the reference, or they overlap by >1 kbp, or the flanking sequences align on opposite strands or different chromosomes. While running QUAST we kept default threshold of 1 kbp for --extensive-mis-size parameter. Most of the “misassemblies” called in REXTAL generated assemblies relative to reference were due to the gap sizes in a contig slightly excluding the QUAST default gap limit of 1000 bp.
D. Visualization
We used Icarus [6] a genome visualizer for assessment and analysis of genomic assemblies, which is based on QUAST genome quality assessment tool. The contig alignment viewer of Icarus has 2 parts. The top part shows the detailed view of selected region from the bottom part which represents the assembly overview.
1). Broken scaffold:
To view only the split versions of assemblies in the Icarus viewer following steps were followed:
Step1: At first we ran the referenced-based QUAST from a command line with --scaffolds and --debug option [12].
Step2: If an output path is not specified manually (we can specify output path of QUAST by using -o option), QUAST generates its output into quast_results/result_<DATE> directory. We chose the <assembly_name>_broken version file under the quast_results/result_<DATE>/quast_corrected_input/directory [12].
Step3: We reran the referenced-based QUAST from a command line with the same reference but used <assembly_name>_broken instead of the original assembly and did not use --scaffolds option this time.
2). Tandem Repeat Marker:
Since there is no special visualization for repeats yet in QUAST [12], we used tandem repeat finder [9] to screen subtelomeric regions of reference DNA segment sequences. We then used this masked reference as input data for QUAST with the same unmasked subtelomeric regions as reference and followed the procedure described in subsubsection II-D1. We used the broken masked reference to locate the positions of tandem repeats in Icarus viewer (Fig. 2 – Fig. 5).
Fig. 2.
Contig alignment viewer of Icarus for the segmental duplication region region of 18p and 22q. Each viewer has 2 parts containing 3 rows in each part. The top green bars of the top part represents the masked reference as tandem repeat markers with white breaks in this track shows the positions and sizes of tandem repeats in the reference. The 2nd and 3rd rows show the REXTAL and the genome-wide assemblies respectively. The bottom part having 3 rows represents the assembly overview with the highlighted yellow box indicates the region expanded in the top 3 rows. A. The 2nd and the 3rd row of top part represent the contigs generated by REXTAL and genome-wide method for 18p correspondingly. B. The 2nd and the 3rd row of top part represent the contigs generated by REXTAL and genome-wide method for 22q correspondingly.
Fig. 5.
Contig alignment viewer of Icarus for for the bait segment into adjacent DNA including segmental duplication region of 17p and 2p. Each viewer has 2 parts containing 3 rows in each part. The top green bars of the top part represents the masked reference as tandem repeat markers with white breaks in this track shows the positions and sizes of tandem repeats in the reference. The 2nd and 3rd rows show the REXTAL and the genome-wide assemblies respectively. The bottom part having 3 rows represents the assembly overview with the highlighted yellow box indicates the region expanded in the top 3 rows. A. The 2nd row represents the contigs generated by REXTAL and the 3rd row represents the contigs generated by genome-wide method for 17p. There are four red blocks in a contig those are misassembled because of relocation with inconsistency value 1920, 1172, and 1055. B. The 2nd row represents the contigs generated by REXTAL and the 3rd row represents the contigs generated by genome-wide method for 2p. The two red block represents the misassembly because of 2935 bp gap between two blocks within a contig. Note that the “misassembled contig” is in fact a gap in the contig corresponds exactly to a tandem repeat. It is called a QUAST misassembly only because it exceeds the 1000 bp default when aligned to the reference sequence.
3). Comparative Analysis:
To visualize the comparative analysis of REXTAL and Genome-wide method as well as the tandem repeat marker, we ran reference based QUAST with 3 input files i.e. first one is broken masked reference file as tandem repeat marker, then broken REXTAL assembly and 3rd one is broken Genome-wide assembly.
III. Results And Discussions
UCSC browser [7] was used to access HG38 and select subtelomere DNA segments for analysis. We tested REXTAL and the QUAST analysis on 17 human subtelomere regions (base pair coordinates listed are from HG38). The 2p subtelomere is a 500 kbp sized segment of 1-copy DNA (10,001 to 500,000); 19p subtelomere has a very large segmental duplication region next to the telomere (10,001-259,447) followed by a 300 kbp sized 1-copy region (259,448-559,447), 10p has a smaller segmental duplication region near the telomere (10,001-88,570) followed by a 300 kbp 1-copy region (88,571-388,571); 5p has multiple segmental duplication regions (10,001-49,495 and 210,596-305,378) separated and flanked by two 1-copy regions (49,496-210,595 and 305,379-510,000). For 16p, 16q, 17p, 17q, 18p, 18q, 19q, 20p, 20q, 21q, 22q, Xp, Xq we extracted 100 kbp single copy bait sequences as close as possible to the telomere. Table I shows the details of subtelomeric region.
TABLE I.
co-ordinate of extracted subtelomeric region from UCSC browser
| Regiona | Refb | Baitc | SDd | 1-copye |
|---|---|---|---|---|
| 2p | 10,001-700,000 | 10,001-500,000 | N/A | 10,001-700,000 |
| 5p | 10,001-677,959 | 49,496-210,595 and 305,379-510,000 | 10,001-49,495 and 210,596-305,378 | 305,379-677,959 |
| 10p | 10,001-588,571 | 88,571-388,571 | 10,001-88,570 | 88,571-588,571 |
| 16p | 10,000-240,859 | 40,860-140,859 | 10,000-40,859 | 40,860-240,859 |
| 16q | 89,857,010-90,228,345 | 89,857,010-89,965,857 and 89,968,061-90,057,009 | 89,965,858-89,968,060 and 90,057,010-90,228,345 | 89,857,010-89,965,857 and 89,968,061-90,057,009 |
| 17p | 60,000-341,850 | 141,851-241850 | 60,000-141,850 | 141,851-341,850 |
| 17q | 83,004,545-83,247,441 | 83,104,544-83,204,544 | 83,204,545-83,247,441 | 83,004,545-83,204,544 |
| 18p | 10,000-331,693 | 131,694-231,693 | 10,000-131,693 | 131,694-331,693 |
| 18q | 80,059,053-80,263,285 | 80159052-80,259,052 | 80,259,053-80,263,285 | 80,059,053-80,259,052 |
| 19p | 10,001-759,447 | 259,448-559,447 | 10,001-259,447 | 259,448-759,447 |
| 19q | 58,386,558-58,607,616 | 58486557-58,586,557 | 58,586,558-58,607,616 | 58,386,558-58,586,557 |
| 20p | 66,335-266,334 | 66,335-166334 | N/A | 66,335-266,334 |
| 20q | 64,073,499-64,334,167 | 64,173,498-64,273,498 and 64,276,019-64,282,623 | 64,273,499-64,276,018 and 64,282,624-64,334,167 | 64,073,499-64,273,498 and 64,276,019-64,282,623 |
| 21q | 46,472,945-46,699,983 | 46572944-46,672,944 | 46,672,945-46,699,983 | 46,472,945-46,672,944 |
| 22q | 50,540,514-50,808,468 | 50640513-50,740,513 | 50,740,514-50,808,468 | 50,540,514-50,740,513 |
| Xp | 222,347-527,305 | 222,347-320,315 and 327,306-427,306 | 320,316-327,305 | 222,347-320,315 and 327,306-527,305 |
| Xq | 155,783,780-156,030,894 | 155,883,778-155,983,778 and 155,987,225-156,000,330 | 155,983,779-155,987,224 and 156,000,331-156,030,894 | 155,783,780-155,983,778 and 155,987,225-156,000,330 |
Subtelomeric region.
HG38 Co-ordinates of reference subtelomeric region (HG38).
HG38 Co-ordinates of 1-copy subtelomeric bait region.
HG38 Co-ordinates of subtelomeric segmental duplication region.
HG38 Co-ordinates of entire subtelomeric 1-copy region.
For a fair comparison of REXTAL with genome-wide assembly method, we extracted all contigs in the genome-wide assembly that overlap (including potential extensions into flanking DNA) with the 1-copy bait sequences using SAMtools [11].
A. QUAST report on genome fraction in segmental duplication region
As segmental duplication regions contain segments of DNA with near-identical duplicated subtelomere sequences, these regions are hard to assemble de novo with whole genome reads. We can extend REXTAL into subtelomere segmental duplication regions. To measure the quality of REXTAL vs. genome-wide assemblies in segmental duplication regions, we ran reference based QUAST for these regions (Table II). 2p and 20p do not have segmental duplication regions. 5p, 16q, 20q, and Xq have multiple segmental duplication regions. For 17p, genome-wide method could not extend the assembly up to segmental duplication region.
TABLE II.
Comparison of QUAST result in segmental duplication region
| REXTAL | Genome-wide | |||
|---|---|---|---|---|
| Regiona | Genome fraction (%) |
Misassemblyb | Genome fraction (%) |
Misassemblyc |
| 2p | N/A | N/A | N/A | N/A |
| 5p_1st | 88.707 | 0 | 57.734 | 0 |
| 5p_2nd | 88.038 | 0 | 3.034 | 0 |
| 10p | 90.103 | 1 | 7.092 | 0 |
| 16p | 94.18 | 0 | 25.493 | 0 |
| 16q_1st | 100 | 0 | 100 | 0 |
| 16q_2nd | 35.423 | 1 | 16.549 | 0 |
| 17p | 21.52 | 0 | N/A | N/A |
| 17q | 85.12 | 1 | 0.494 | 0 |
| 18p | 69.028 | 0 | 9.701 | 0 |
| 18q | 82.731 | 0 | 60.761 | 0 |
| 19p | 28.14 | 0 | 2.139 | 0 |
| 19q | 92.906 | 0 | 1.22 | 0 |
| 20p | N/A | N/A | N/A | N/A |
| 20q_1st | 100 | 0 | 100 | 0 |
| 20q_2nd | 98.462 | 0 | 6.955 | 0 |
| 21q | 94.941 | 0 | 23.806 | 0 |
| 22q | 96.087 | 0 | 5.055 | 0 |
| Xp | 59.828 | 0 | 16.753 | 0 |
| Xq_1st | 100 | 0 | 100 | 0 |
| Xq_2nd | 90.211 | 0 | 62.174 | 0 |
Subtelomeric region
“Misassemblies” are tandem repeat arrays in the reference sequence where the repeat array length differs from that in the cognate REXTAL assembly by > 1000 bp
Number of misassembly in genome-wide method
It is easy to observe that the %genome fractions obtained by REXTAL (2nd Column of Table II) are significantly better than the %genome fractions obtained by genome-wide method (4th Column of Table II) for all loci that have been tested.
Fig. 2A and Fig. 2B show the visualization of QUAST analysis of 18p and 22q in segmental duplication regions.
B. QUAST report on misassembly in segmental duplication region
Generally, QUAST report contains a classification of misassembly events (using Plantagoras [5] definition) into three groups: relocations, translocations, and inversions (subsubsection II-C2).
The number of misassemblies obtained in segmental duplication region by REXTAL and genome-wide method are shown correspondingly in 3rd and 5th Column of Table II. QUAST generated one misassembly for 10p, 16q_2nd (2nd segmental duplication region of 16q), and 17q and all these three misassemblies happened because of relocation according to QUAST report. Fig. 3 shows the misassembled contig (two red blocks) for segmental duplication region of 16q_2nd. The cause of the misassembly was relocation with inconsistency = 1512. As the top green bars represent tandem repeat marker and the gap between green top bars represent the tandem repeat region, the misassembly happened in tandem repeat region. These misassembled blocks are in one contig. Genome-wide method could not extend the assembly up to this point. To run the QUAST we used default value of parameter --extensive-mis-size and that is 1000. If we set the parameter of --extensive-mis-size with higher value, we would not find these misassemblies.
Fig. 3.
Contig alignment viewer of Icarus for the segmental duplication region of 16q_2nd.The top green bars of the top part represents the masked reference as tandem repeat markers with white breaks in this track shows the positions and sizes of tandem repeats in the reference. The 2nd row represents the contigs generated by REXTAL and two red blocks represents the misassembled contig with gap 1512 bp. 3rd row is supposed to be the contigs generated by genome-wide method for segmental duplication region of 16q_2nd and this row shows nothing here because genome-wide method could not extend the assembly up to this point. The bottom three rows represent the assembly overview with the highlighted yellow box indicates the region expanded in the top 3 rows. Note that the “misassembled contig” is in fact a gap in the contig corresponds exactly to a tandem repeat. It is called a QUAST misassembly only because it exceeds the 1000 bp default when aligned to the reference sequence.
C. QUAST report on the bait segment into adjacent 1-copy region
We extracted subtelomeric region containing 1-copy and 1-copy bait region as reference from UCSC genome browser to compare the extending assemblies of the bait segment into adjacent 1-copy region for REXTAL and genome-wide method. To show the quality of REXTAL vs. genome-wide assembly, we ran QUAST with both assemblies (Table III).
TABLE III.
Comparison of QUAST result in the bait segment into adjacent 1-copy region
| REXTAL | Genome-wide | |||
|---|---|---|---|---|
| Regiona | Genome fraction (%) |
Misassemblyb | Genome fraction (%) |
Misassemblyc |
| 2p | 96.226 | 1 | 96.839 | 0 |
| 5p_1st | 93.471 | 0 | 94.818 | 0 |
| 5p_2nd | 91.68 | 1 | 93.789 | 1 |
| 10p | 97.966 | 1 | 98.142 | 0 |
| 16p | 75.873 | 0 | 55.733 | 0 |
| 16q_1st | 96.759 | 3 | 96.146 | 0 |
| 16q 2nd | 97.719 | 1 | 96.606 | 0 |
| 17p | 68.839 | 5 | 39.774 | 0 |
| 17q | 77.839 | 1 | 45.235 | 0 |
| 18p | 93.448 | 0 | 99.71 | 0 |
| 18q | 87.391 | 0 | 98.668 | 1 |
| 19p | 87.173 | 2 | 85.141 | 0 |
| 19q | 81.459 | 1 | 55.077 | 0 |
| 20p | 86.243 | 1 | 100.00 | 0 |
| 20q_1st | 71.204 | 0 | 53.714 | 0 |
| 20q_2nd | 100.00 | 0 | 100.00 | 0 |
| 21q | 73.621 | 0 | 99.974 | 0 |
| 22q | 70.74 | 0 | 96.547 | 1 |
| Xp_1st | 72.381 | 1 | 26.064 | 1 |
| Xp_2nd | 72.103 | 0 | 13.322 | 0 |
| Xq_1st | 87.678 | 0 | 90.088 | 0 |
| Xq_2nd | 100.00 | 0 | 100.00 | 0 |
Subtelomeric region.
“Misassemblies” are tandem repeat arrays in the reference sequence where the repeat array length differs from that in the cognate REXTAL assembly by > 1000 bp.
Number of misassembly in genome-wide method.
Fig. 4A and 4B show the contig alignment viewer of Icarus in the bait segment into adjacent 1-copy region for 19q and 17q correspondingly.
Fig. 4.
Contig alignment viewer of Icarus for 1-copy region of 19q and 17q. Each viewer has 2 parts containing 3 rows in each part. The top green bars of the top part represents the masked reference as tandem repeat markers with white breaks in this track shows the positions and sizes of tandem repeats in the reference. The 2nd and 3rd rows show the REXTAL and the genome-wide assemblies respectively. The bottom part having 3 rows represents the assembly overview with the highlighted yellow box indicates the region expanded in the top 3 rows. A. The 2nd row (including the expansion (yellow area)) represents the contigs generated by REXTAL and the 3rd row represents the contigs generated by genome-wide method for 19q. The expanded version of 2nd row shows that misassembled contig has seven blocks among them two blocks (red blocks) are misassembled because of relocation with inconsistency = 1115. This misassembled contig is located entirely within other higher-quality contig (1 green block in 2nd row). B. The 2nd row represents the contigs generated by REXTAL and the 3rd row represents the contigs generated by genome-wide method for 17q. The misassembled contig has four blocks (in assembly overview image there is a light yellow rectangle representing the selected region and four down arrows (↓) represent four blocks in one contig.). Among them two blocks (red blocks) are misassembled with inconsistency= 1168. These two misassembled blocks are in one contig in REXTAL assembly but two different contigs in genome-wide assembly. In the selected region of genome-wide method has seven different assembled contigs whether REXTAL has one contig with four blocks with gaps. Note that the “misassembled contig” is in fact a gap in the contig corresponds exactly to a tandem repeat. It is called a QUAST misassembly only because it exceeds the 1000 bp default when aligned to the reference sequence.
For 19q QUAST reports 1 misassembly in 1-copy region for REXTAL (3rd column Table III). Fig. 4A shows that the misassembled contig corresponds to a small contig matching less accurate (identity 96% – 98%) among seven small blocks to the reference than a larger, more closely matching contig (99.92% identity). Among the seven blocks two blocks (red blocks) are misassembled because of relocation with inconsistency value 1115. However, this misassembled contig is located entirely within other higher-quality (99.92% identity) contig (1 green block in 2nd row in Fig. 4A). To avoid this situation in our prior work we proposed a metric called Length-wise Assembled Fraction (LAF) [1] for quality measurement of the regional assemblies. Before measuring the quality, we extracted reference sequences from HG38 and then aligned them with corresponding assembled scaffolds using BLAST [10], requiring ≥98% of identity for retention of each local alignment. This generates positions of each local alignment including query start positions and query end positions. The starting positions of the query were sorted in increasing order. Local alignments were merged by (1) deleting local alignments located entirely within other higher-quality alignments; and (2) Local alignments with partial overlap, the overlap regions were merged by selecting the alignment with equivalent or higher % identity in the overlap region [1]. The LAF metric avoids the secondary more weakly matching assemblies like that shown above.
In Fig. 4B, for 17q the misassembled contig has four blocks (in assembly overview image there is a light yellow rectangle representing the selected region and four down arrows (↓) represent four blocks in one contig.). Among them two blocks (red blocks) are misassembled because of relocation with inconsistency value 1168. These two misassembled blocks are in one contig in REXTAL assembly but two different contigs in genome-wide assembly. Overall the selected region of genome-wide method has seven different assembled contigs in the genome-wide assembly whereas REXTAL has one contig with four blocks with gaps. The gaps all correspond to tandem repeat regions where REXTAL was able to assemble the tandem repeat region putting gaps in a contig rather than creating separate contigs. We can avoid these misassembly calls by setting the parameter of --extensive-mis-size with slightly higher value during running the QUAST.
D. QUAST report on the bait segment into adjacent DNA including segmental duplication region
We extracted subtelomeric region containing 1-copy, 1-copy bait and segmental duplication region as reference from UCSC genome browser to compare the extending assemblies of the bait segment into adjacent DNA for REXTAL and genome-wide method.
To show the quality of REXTAL vs. genome-wide assembly, we ran reference based QUAST for these regions and compared results in Table IV.
TABLE IV.
Comparison of QUAST result in the bait segment into adjacent DNA including segmental duplication region
| REXTAL | Genome-wide | |||
|---|---|---|---|---|
| Regiona | Genome fraction (%) |
Misassemblyb | Genome fraction (%) |
Misassemblyc |
| 2p | 75.535 | 1 | 71.337 | 0 |
| 5p | 75.448 | 1 | 70.945 | 1 |
| 10p | 72.933 | 2 | 53.707 | 0 |
| 16p | 78.2 | 0 | 51.56 | 0 |
| 16q | 68.623 | 2 | 59.263 | 0 |
| 17p | 54.829 | 5 | 28.38 | 0 |
| 17q | 78.813 | 2 | 37.142 | 0 |
| 18p | 84.151 | 0 | 65.66 | 0 |
| 18q | 87.138 | 0 | 96.764 | 1 |
| 19p | 51.108 | 3 | 35.316 | 1 |
| 19q | 78.18 | 0 | 47.087 | 0 |
| 20p | 86.208 | 1 | 99.999 | 0 |
| 20q | 77.442 | 0 | 46.099 | 0 |
| 21q | 75.861 | 0 | 90.902 | 0 |
| 22q | 77.156 | 0 | 73.266 | 1 |
| Xp | 70.569 | 1 | 16.982 | 0 |
| Xq | 88.709 | 0 | 88.709 | 0 |
Subtelomeric region.
“Misassemblies” are tandem repeat arrays in the reference sequence where the repeat array length differs from that in the cognate REXTAL assembly by > 1000 bp.
Number of misassembly in genome-wide method.
1). Analysis of genome fraction and misassmblies:
REXTAL has better % of genome fraction than whole genome assembly except for 18q, 20p, and 21q (2nd Column of Table IV). In subsection III-A we showed that 18q and 21q have noticeably good extension in segmental duplication region (2nd Column of Table II). 20p is all single copy region and the genome-wide method gave a better genome fraction here than the REXTAL.
Fig. 5A and Fig. 5B show the contig alignment viewer of Icarus in the bait segment into adjacent DNA including segmental duplication region for 17p and 2p correspondingly.
For 17p, QUAST generates total 5 misassemblies (3rd Column of Table IV) on the bait segment into adjacent DNA region. Fig. 5A shows that there are four red blocks in a contig that were misassembled because of relocation with inconsistency value 1920, 1172, and 1055, where the genome-wide method has five separate contigs instead and does not have these misassemblies.
In Fig. 5B for 2p similar case happened in tandem repeat region where misassembly happened because of the gap (inconsistency = 2935) between two blocks within a contig. Genome-wide method considered these two blocks as two separate contigs.
Both for 17p and 2p (Fig. 5), it is noticeable that the “misassembled contig” is in fact a gap in the contig corresponds exactly to a tandem repeat. It is called a QUAST misassembly only because it exceeds the 1000 bp default when aligned to the reference sequence. We can avoid these errant misassembly calls by setting the parameter of --extensive-mis-size with higher value during running the QUAST.
IV. Conclusion
We successfully used REXTAL [1] on 17 subtelomeric bait regions and extended the assembly of single-copy diploid DNA into adjacent including inaccessible subtelomere segmental duplication regions. We evaluated REXTAL and genome-wide assembly using the reference-based assessment module of QUAST and showed that REXTAL dramatically outperformed the Supernova whole genome assembler in subtelomeric segmental duplication regions, and produced in highly accurate assemblies. In future experiments, we will combine REXTAL and Nanopore single-read datasets to achieve complete long-range assemblies throughout all human subtelomere regions.
Acknowledgments
The work in this paper is supported in part by NIH R21CA177395 (HR and MX), and Modeling and Simulation Scholarship (to TI) from Old Dominion University.
Contributor Information
Tunazzina Islam, Department of Computer Science, Old Dominion University, Norfolk, VA, USA.
Desh Ranjan, Department of Computer Science, Old Dominion University, Norfolk, VA, USA.
Mohammad Zubair, Department of Computer Science, Old Dominion University, Norfolk, VA, USA.
Eleanor Young, School of Biomedical Engineering, Drexel University, Philadelphia, PA, USA.
Ming Xiao, Institute of Molecular Medicine and Infectious Disease, Drexel University, Philadelphia, PA, USA.
Harold Riethman, School of Medical Diagnostic & Translational Sciences, Old Dominion University, Norfolk, VA, USA.
References
- [1].Islam T et al. , “REXTAL: Regional Extension of Assemblies Using Linked-Reads,” International Symposium on Bioinformatics Research and Applications, pp. 63–78, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Weisenfeld NI, Kumar V, Shah P, Church DM, Jaffe DB, “Direct determination of diploid genome sequences,” Genome research, 27, pp. 757–767, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Zheng GX-L-P et al. , “Haplotyping germline and cancer genomes with high-throughput linked-read sequencing,” Nature biotechnology, 34, pp. 303–311, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Gurevich A, Saveliev V, Vyahhi N, Tesler G, “QUAST: quality assessment tool for genome assemblies,” Bioinformatics, 29, pp. 1072–1075, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Barthelson R, et al. , “Plantagora: modeling whole genome sequencing and assembly of plant genomes,” PLoS One, 6:e28436, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Mikheenko A, Valin G, Prjibelski A, Saveliev V, Gurevich A,“Icarus: visualizer for de novo assembly evaluation,” Bioinformatics, 32, pp. 3321–3323, 2016. [DOI] [PubMed] [Google Scholar]
- [7].Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. (2002). The human genome browser at UCSC. Genome research, 12, 996–1006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Smit AF (1996). 2010 RepeatMasker Open-3.0. http://www.repeatmasker.org/.
- [9].Benson G (1999). Tandem repeats finder: a program to analyze DNA sequences. Nucleic acids research, 27, 573. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Altschul SF, Madden TL, Schffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research, 25, 3389–3402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup (2009). The sequence alignment/map format and SAMtools. Bioinformatics, 25, 2078–2079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Gurevich Alexey, email = ”alexeigurevich@gmail.com”, Affiliation = Research Scientist at Center for Algorithmic Biotechnology, Saint Petersburg State University. [Google Scholar]





