Abstract
It is currently impossible to get complete de-novo assembly of segmentally duplicated genome regions using genome-wide short-read datasets. Here, we devise a new computational method called Regional Extension of Assemblies Using Linked-Reads (REXTAL) for improved region-specific assembly of segmental duplication-containing DNA, leveraging genomic short-read datasets generated from large DNA molecules partitioned and barcoded using the “Gel Bead in Emulsion” (GEM) microfluidic method (Zheng et al., 2016). We show that using REXTAL, it is possible to extend assembly of single-copy diploid DNA into adjacent, otherwise inaccessible subtelomere segmental duplication regions and other subtelomeric gap regions. Moreover, REXTAL is computationally more efficient for the directed assembly of such regions from multiple genomes (e.g., for the comparison of structural variation) than genome-wide assembly approaches.
Keywords: 10X sequencing, Linked-read sequencing, Subtelomere, assembly, segmental duplication, structural variation, genome gaps
1. INTRODUCTION
Massively parallel short-read DNA sequencing has dramatically reduced the cost and increased the throughput of DNA sequence acquisition; it is now cheap and straightforward to do a variety of whole-genome analyses by comparing datasets of newly sequenced genomes with the human reference sequence. However, even with the use of paired-end read approaches using input molecules of various lengths, de novo assembly of human genomes has remained problematic because of abundant interspersed repeats. A recently developed approach pioneered by 10X Genomics generates short-read datasets from large genomic DNA molecules first partitioned and barcoded using the “Gel Bead in Emulsion” (GEM) microfluidic method (Zheng et al., 2016). The bioinformatic pipeline for assembly of these reads (“Supernova”; (Weisenfeld et al., 2017)) takes advantage of the very large number of sets of “linked reads”. Each set of linked reads is comprised of low-read coverage of a small number of large genomic DNA molecules (roughly 10), and is associated with a unique bar code. This approach enables efficient de novo assembly of much of the human genome, with large segments separable into haplotypes (Weisenfeld et al., 2017). However, even with these new methods, evolutionarily recent segmentally duplicated DNA such as that found in subtelomere regions remain inaccessible to de novo assembly due to the long stretches of highly similar (> 95% identity) DNA. The problem for subtelomere DNA analysis is amplified by the relative lack of high-quality reference assemblies and abundance of structural variation in these regions. To address this problem and attempt to better assemble human subtelomere regions, we have developed a computational approach designed to leverage linked-reads from genomic GEM datasets to extend de novo assemblies from subtelomeric 1- copy DNA regions into adjacent segmentally duplicated and gap regions of human subtelomeres. Conceptually, what the “Gel Bead in Emulsion” (GEM) (Zheng et al., 2016) microfluidic method enables us to do is illustrated in Figure 1. There are approximately one million partitions, each with a unique barcode. Each partition receives approximately 10 molecules of length approximately 50 kb–100 kb. Short reads of length 150 bases are obtained from these molecules with the barcode for the partition attached at the beginning of the first read in a pair (Weisenfeld et al., 2017). Sets of these read pairs having same barcodes attached to them are called linked-reads.
Figure 1.
Conceptual description of GEM microfluidic method. Circle (blue, magenta) represents gel beads. Each bead contains many copies of a 16-base barcode (Rectangles inside the circle) unique to that bead. Each partition gets one gel bead. The 10 curve lines inside the large square (represents partition) represent molecules of length approximately 50 kb–100 kb. The green and orange ovals represent short reads of length 150 bases which are obtained from these molecules (curve lines).
Supernova assembly (Weisenfeld et al., 2017) takes advantage of linked reads to separate haplotypes over long distances, and these separated haplotypes are represented as “megabubbles” in the assembly. The chain of megabubbles generates scaffolds (Weisenfeld et al., 2017). Supernova uses the barcode information after initial whole-genome assembly for bridging long gaps. It finds all the reads of corresponding barcodes that are present in sequence adjacent to the left and right sides of the assembly gap. Then it assembles this set of reads and tries to fill the gap (Weisenfeld et al., 2017). We refer to this method as “genome-wide assembly method”. As in all genome-wide assemblies, reads from evolutionarily recent segmental duplications such as those near subtelomeres are collapsed into artifactual DNA segment assemblies; these assembly artifacts are typically either located at a single genomic locus or excluded entirely from the initially assembled genome (Alkan et al., 2011). REXTAL differs from the genome-wide assembly method in that we use the barcode information for selection of reads from anticipated segmental duplication or gap regions adjacent to a specified 1-copy DNA segment before doing the assembly. We initially find reads matching the 1-copy DNA segment (“bait DNA segment”) based upon the reference human genome (HG38), then select all reads for barcodes represented in these initial matching reads. This set of reads should represent a very limited subset of all genomic reads, and approximately 10% of the barcode-selected reads should be derived specifically from the selected 1-copy DNA and 50–100 kb segments of flanking DNA. We show here that this is indeed the case, enabling extension of existing assemblies into adjacent segmental duplication and gap regions.
While the primary motivation of our work is to improve the assembly of subtelomeric gap regions and extend the assembly to inaccessible subtelomere segmental duplication regions of genomes of human individuals from their 10X genomic data, REXTAL can be applied more generally for enriching region-specific linked reads and improving the assembly of any specified 1-copy genome region of an individual from any species for which a reference genome exists. For targeted region-specific assemblies from many individuals for which 10X datasets are available (e.g., analysis of structural variation at specific loci), REXTAL is faster and more accurate than genome-wide assembly method. In this scenario, for genome-wide assembly, we need to assemble the whole genome of the individuals and then extract the assembled portion of the specific region. But in our case, we first extract the specific region from the 10X dataset by aligning with a 1-copy segment of the reference genome, and then use our bioinformatic pipeline to do the assembly.
2. METHOD
In subsection 2.1, we present the input data description. Subsection 2.2 presents processing of raw data to get our key input data. In subsections 2.3, 2.4, and 2.5 we show our assembly pipeline step by step. Subsection 2.6 shows further analysis after assembly.
2.1. Data
The key input data is 10X Genomics linked-reads from individual human genomes, in our case from the genome of a publically available cell line GM19440. Our dataset has approximately 1.49 billion 10X Genomics linked-reads in paired-end format, with each read about 150 bp. The Supernova whole genome assembly using these data had an overall coverage of 103 and a Supernova N50 scaffold of 19.1 Mb. The loupe file shows a mean depth coverage of 67.4. Human reference genome assembly HG38 was used to select test subtelomere regions for the targeted assemblies.
2.2. Data Processing
We processed the raw 10X Genomics data using Long Ranger Basic software developed by 10X Genomics (and freely available to any researcher) to generate barcode-filtered 10XG linked-reads. The Long Ranger basic pipe-line performs basic read and barcode processing including read trimming, barcode error correction, barcode whitelisting, and attaching barcodes to reads. We used the UCSC browser (Kent W. J. et al., 2002) to access HG38 and selected subtelomere DNA segments for analysis.
2.3. Alignment of Subtelomeric Region with Linked-Reads
2.3.1. Masking out repeats
We used RepeatMasker (Smit, 1996) and Tandem Repeats Finder (Benson, 1999) to screen bait DNA segment sequences for interspersed repeats, low complexity DNA sequences, and tandem repeats in order to minimize the possibility of false-positive contaminant read identification in the initial selection of reads matching specified 1-copy DNA segments.
2.3.2. Alignment using BLAT
We used BLAT (BLAST-like alignment tool) (Kent W. J., 2002) with default parameter to do alignment of masked subtelomeric region with genome-wide reads from GM19440.
2.3.3. Reads Selection
The output from 2.3.2 gives reads which have a good “match” with a given subtelomeric bait region. However, it is possible that many reads that would have originated within this given subtelomeric region could have been missed because of masking out repeat regions done in 2.3.1. More importantly, we were especially interested in capturing reads from the large source DNA molecules extending from the flanks of the targeted 1-copy bait segment. We therefore initially collected all reads that shared a barcode with any read matching the 1-copy segment.
2.4. Barcode Frequency Range and Clustering Pattern Selection
We further reduced this subset of selected reads based on the frequency of occurrence and the clustering pattern of reads from each barcode identified as matching within the specified 1-copy segment. We estimated that each barcode should have approximately 800 reads based upon the following calculation: we assumed there are 1 million partitions in the genome with each partition containing 10 molecules of 50 kb each (Weisenfeld et al., 2017). With the length of each read 150 bp and 0.25X coverage of each single molecule in the partition, we should have approximately (0.25 × 500000 bp)/ 150 = 833 reads with each barcode. For each barcode, approximately 1/10 of these reads (about 80) should originate from a single locus, and since about 50% of the bait locus (the specified 1-copy region used for BLAT) is masked, about 40 reads/partition should be matched if the entire 50 kb is within the bait locus. If the source DNA molecule partially overlaps the bait locus and extends into the adjacent region, then this number would be smaller and dependent on the extent of the overlap. So, a key challenge was to identify the range of matching reads for each barcode that would minimize inclusion of false positive barcodes while maximizing inclusion of true positive bar codes that would permit extension of the assembly into adjacent DNA. Histogram analysis to check the frequency of the occurrence of each barcode revealed vast over-representation of barcodes with one or two reads, so we required a minimum of three reads per barcode in order to include that barcode for read selection. In addition, we required all matching reads from a single barcode to originate within less than the estimated maximum input molecule size of 100 kb within a given bait region in order to qualify for inclusion. We then empirically tested a variety of barcode frequency ranges meeting both of the above requirements for final read selection, using the ability of the selected reads to assemble the original bait region and extend into flanking DNA as the metric for optimization as described below.
2.5. Assembly of subset of reads
To get the assembly of the selected paired-end barcode reads Supernova (Weisenfeld et al., 2017) was used. It can generate assembled scaffolds in four styles named: raw, megabubbles, pseudohap, and pseudohap2. We used pseudohap2 style here. An overview of our assembly strategy is shown in Figure 2.
Figure 2.
A: Flowchart. B: Details of Reads Selection algorithm is shown inside dotted box
2.6. Alignment of assembled scaffolds with reference
To measure the quality of the assembly, we aligned specified subtelomeric regions of the HG38 reference sequence corresponding to our unmasked single-copy bait segments along with their flanking reference DNA segments as query with our generated assembled scaffolds as subject using NCBI BLAST (Altschul et al., 1997), requiring high identity matches (≥ 98%) for retention of each local alignment. The resulting output “hit table” of these local alignments lists the sequence identifier, the start and stop points for each local stretch of sequence similarity, and the percent identity of the match. From this information one can map high-similarity alignments of our regional assembly (prepared using barcode-selected linked reads) across the query reference sequence and, by merging the high-quality local alignments, evaluate assembly coverage relative to regions of the reference sequence using a parameter we define as the Lengthwise Assembled Fraction (LAF; see Figure 5). Intuitively, LAF is defined as the fraction of a targeted reference sequence that is accurately assembled by the regional sequence assembly. Regions of the reference query sequence with highest LAF have the best coverage of assembled sequence, and the limit of assembly extension regions corresponding to flanking reference sequence can be ascertained by a sudden decrease in LAF. Details of LAF calculation are presented in 3.4.1.
Figure 5.
Top magenta rectangle represents the query sequence. A: Partially overlapped local alignment regions and gaps in coverage of the query sequence. B: Considering partially overlapped local alignment regions as sequence contigs and each sequence contig region (C) is followed by one sequence gap (G). Dotted blue lines represent starting position and ending position of gap.
3. RESULTS AND DISCUSSIONS
We tested our read selection and regional assembly strategy (Figure 2) on four human subtelomere regions with representative patterns of sequence organization (base pair coordinates listed are from HG38; Figure 3). The 2p subtelomere is a 500 kb sized segment of 1-copy DNA (10001 to 500,000); 19p subtelomere has a very large segmental duplication region next to the telomere (10001–259447) followed by a 300 Kb-sized 1-copy region (259448–559447), 10p has a smaller segmental duplication region near the telomere (10001–88570) followed by a 300 kb 1-copy region (88571–388571); 5p has multiple segmental duplication regions (10,001–49,495 and 210,596–305,378) separated and flanked by two 1-copy regions (49,496–210,595 and 305,379–510,000).
Figure 3.
Four different chromosomes with different characteristics. The blue rectangle represents single copy region and the magenta rectangle represents segmental duplication region.
We processed the raw input data from GM19440 as described in subsection 2.2. Table 1 presents some characteristics of the output obtained after processing the raw data with Long Ranger Basic software.
Table 1:
Some characteristics of the obtained data
| Number of reads | 1.49 * 109 | Number of reads without barcode | 9.8 * 107 |
| Number of paired-end reads | 0.75 * 109 | Barcode whitelist | 0.933959 |
| Number of barcoded reads | 1.39 * 109 | Barcode diversity | 743369.62 |
Interspersed repeats and tandem repeats from the 1-copy regions of these subtelomeres were masked and used as bait segments to select matching reads from the GM19440 linked-read dataset using BLAT. Barcodes for matching reads were identified and characterized according to occurrence frequency and clustering within the bait DNA segments.
3.1. Barcode Range and Clustering Analysis
We tested a wide variety of Barcode ranges empirically for their ability to select read sets capable of generating high-quality regional assemblies corresponding to the bait segment itself (Figure 4) as well as extending assemblies of the bait segment into adjacent DNA (Figure 7). In all cases, a secondary filter was applied requiring that barcodes used for reads selection contained only reads mapping to a single 100 kb segment of the bait DNA (cluster) as anticipated from linked-read library preparation (Table 2). Initial experiments with 2p focused on selection of reads from barcode ranges that produced high-quality assemblies of the 500 kb bait segment, and follow-up work with all four subtelomeres fine-tuned these parameters to optimize both high-quality assembly of bait segments as well as maximal extension into adjacent segmental duplication regions and single-copy regions.
Figure 4.
A: Alignment of 2p 500kb as query with assembled scaffolds of 2p for range 10–70 as subject in BLAST. B: Alignment of 19p 1-copy 300kb as query with assembled scaffolds of 19p 1-copy for range 3–70 as subject in BLAST. C: Alignment of 10p 1-copy 300kb as query with assembled scaffolds of 10p 1-copy for range 3–70 as subject in BLAST. D: Alignments of two 1-copy regions of 5p as query with assembled scaffolds of 5p 1-copy regions for range 3–70 as subject in BLAST.
Figure 7.
A: Alignment of 2p with assembled scaffolds of 2p for range 10–60 of REXTAL. B: Alignment of 2p as query with assembled scaffolds of 2p extracted from genome-wide assembly. C: Alignment of 19p with assembled scaffolds of 19p 1-copy for range 3–70 of REXTAL. D: Alignment of 19p with assembled scaffolds of 19p 1-copy region extracted from genome-wide assembly. E: Alignment of 10p with assembled scaffolds of 10p 1-copy for range 3–70 of REXTAL. F: Alignment of 10p with assembled scaffolds of 10p 1-copy region extracted from genome-wide assembly. G: Alignment of 5p with assembled scaffolds of 5p 1-copy regions for range 3–70 of REXTAL. H: Alignment of 5p with assembled scaffolds of 5p 1-copy regions extracted from genome-wide assembly.
Table 2:
Results after range selection and clustering step
| Chromosomal region | Barcode frequency ranges | bc1 | bc2 | read3 | Chromosomal region | Barcode frequency ranges | bc1 | bc2 | read3 |
|---|---|---|---|---|---|---|---|---|---|
| 2p | 10–50 | 1639 | 1223 | 2074096 | 19p | 3–70 | 1493 | 1378 | 2482446 |
| 10–60 | 1726 | 1281 | 2177142 | 5–70 | 1142 | 1026 | 1870206 | ||
| 10–70 | 1807 | 1330 | 2265538 | 10–70 | 770 | 662 | 1265324 |
number of selected barcode after range selection.
number of selected barcode after clustering.
number of collected reads of corresponding barcodes.
Table 2 shows the selected number of barcodes and number of reads after thresholding for 2p and 19p 1-copy region for our chosen ranges.
3.2. Generate Assembled Scaffolds
After pulling out reads according to our selected range and clustering parameters, we used Supernova assembler for assembly of the collected paired-end reads. We analyzed assembled scaffolds in pseudohap2 style and calculated the length of each assembled scaffolds.
3.3. Alignment of the Scaffolds with reference
We aligned the 2p 1-copy region, 19p 1-copy region, 10p 1-copy region and 5p 1-copy regions as query with corresponding generated assembled scaffolds of 2p, 19p, 10p and 5p as subject using BLAST with default parameters and retaining only local alignments with ≥ 98% identity. Figure 4 shows a graphical representation (using the NCBI BLAST output visualization tool) of these BLAST alignments with near-optimal barcode frequencies for retention of linked-reads prior to assembly. While the respective assemblies cover most of each of the 1-copy bait regions, the extent of coverage as well as the number of scaffolds contributing substantially to coverage vary according to subtelomere. We therefore developed a more quantitative metric for assembly coverage in order to better quantify the assembly quality and compare them with the assemblies generated de novo from the whole-genome dataset using Supernova.
3.4. Assembly Quality Measurement
Standard assembly quality measurements (“QUAST” (Gurevich et al., 2013)) are not suitable to our case as we are doing region specific assemblies rather than genome-wide assemblies. We are focused on coverage and accuracy of our assembly over the targeted region and have developed a metric called Length-wise Assembled Fraction (LAF) for quality measurement of our regional assemblies. As mentioned previously, LAF measures the fraction of a targeted reference sequence that is accurately assembled by the regional sequence assembly.
3.4.1. Quality in single copy region
We extracted reference sequences of 2p, 19p, 10p, and 5p from HG38 and then aligned them with corresponding assembled scaffolds using BLAST, requiring ≥ 98% of identity for retention of each local alignment. This generates positions of each local alignment including query start positions and query end positions. The starting positions of query were sorted in increasing order. Local alignments were merged by (1) deleting local alignments located entirely within other higher-quality alignments; and (2) Local alignments with partial overlap, the overlap regions were merged by selecting the alignment with equivalent or higher % identity in the overlap region. The regions of the query sequence not aligned with sequences in the assembly scaffold are designated as gaps.
For LAF calculation, we considered a number of subsequences of the assembly. More precisely we considered subsequences of the assembly whose end points are start and end positions of n contigs (Figure 5).
In Figure 6 we present an algorithm to compute the LAF of given contig and gap lengths. The input to the algorithm are two arrays C and G each of size n. C[i] is the length of ith contig and G[i] is the length of gap before the ith contig. The algorithm computes LAF and outputs an array S of size 2n. The values in this S array correspond to LAF for 2n different subsequences of the assembled sequence, all starting at the reference start position and ending at the end of each contig and gap.
Figure 6.
Algorithm to calculate LAF.
To see the accuracy of REXTAL in subtelomeric region, we calculated the LAF with regular intervals. For example: for all ranges of 2p, we took the intervals as the distance from coordinate 1 of the reference query sequence to the starting positions of the 1st gap after 200kb, 300kb, 400kb, and 500kb respectively. For range 10–60 of 2p subtelomeric region we achieve good LAF (Table 3).
Table 3:
Quality comparison for 1-copy region
| Chromosomal region | Interval sizea | LAFb | LAFc | Chromosomal region | Interval sizea | LAFb | LAFc |
|---|---|---|---|---|---|---|---|
| 19p | 50kb | 0.90 | 0.91 | 5p (1st 1-copy) | 30kb | 0.97 | 0.98 |
| 100kb | 0.91 | 0.91 | 60kb | 0.94 | 0.90 | ||
| 150kb | 0.89 | 0.87 | 90kb | 0.94 | 0.91 | ||
| 200kb | 0.88 | 0.86 | 120kb | 0.94 | 0.92 | ||
| 250kb | 0.88 | 0.86 | 150kb | 0.95 | 0.93 | ||
| 300kb | 0.89 | 0.87 | |||||
| 10p | 50kb | 0.99 | 0.99 | 5p (2nd 1-copy) | 30kb | 0.99 | 0.99 |
| 100kb | 0.99 | 0.99 | 60kb | 0.96 | 0.96 | ||
| 150kb | 0.99 | 0.99 | 90kb | 0.93 | 0.96 | ||
| 200kb | 0.99 | 0.99 | 120kb | 0.92 | 0.95 | ||
| 250kb | 0.98 | 0.97 | 150kb | 0.93 | 0.95 | ||
| 300kb | 0.97 | 0.68 | 180kb | 0.93 | 0.94 | ||
| 210kb | 0.93 | 0.93 | |||||
| 2p | 200kb | 0.99 | 0.98 | ||||
| 300kb | 0.98 | 0.98 | |||||
| 400kb | 0.97 | 0.97 | |||||
| 500kb | 0.97 | 0.97 | |||||
Starting position of 1st gap after the given interval size.
LAF for REXTAL. For 2p the range is 10–60 and for 19p, 10, 5p 1-copy the range is 3–70.
LAF for genome-wide assembly method.
For the calculation of LAF, for all ranges of 19p 1-copy, we calculated the LAF from coordinate 1 of the reference query sequence up to the starting positions of 1st gap after 50kb, 100kb, 150kb, 200kb, 250kb, and 300kb respectively. We achieve good LAF for range 3–70 of 19p 1-copy (Table 3). We fixed the range 3–70 for 10p and 5p. Table 3 shows the LAF of 10p 1-copy with same intervals taken for 19p 1-copy.
The 5p has multiple segmental duplication regions as well as multiple single copy regions. 1st segmental duplication region is 10,001–49,495 bp, 1st 1-copy region is 49,496–210,595 bp, 2nd segmental duplication region is 210,596–305,378 bp, and 2nd 1-copy region is 305,379–677,959 bp. We applied our assembly pipeline both for 1st 1-copy and 2nd 1-copy (305,379–510,000 bp) region. Because of the length variation of 1-copy region we chose different set of intervals for 1st 1-copy and 2nd 1-copy. We calculated the LAF from coordinate 1 of the reference query sequence up to the starting positions of 1st gap after 30kb, 60kb, 90kb, 120kb, and 150kb respectively for 1st 1-copy region and for the 2nd 1-copy region we chose the intervals from coordinate 1 of the reference query sequence to the starting position of 1st gap after 30kb, 60kb, 90kb, 120kb, 150kb, 180kb, and 210kb (Table 3).
3.4.2. Quality in extended region
We can extend our assembly of single-copy diploid DNA into adjacent and other subtelomeric gap regions. To see the extension of our assembly to extended single copy region, we extracted the reference 2p (10001–700000 bp) with length 700kb, 19p (259448–759447 bp) with length 500kb, 10p (88,571–588,571 bp) with length 500kb, and 5p 2nd 1-copy (305,379–677,959 bp) with length 372580 bp from HG38. Following BLAST analysis using the extended reference sequence as the query and the assembled scaffolds as subject, we used Algorithm 1 (Figure 6) to measure the LAF only for the extended region i.e. > 500k for 2p, > 300k for 19p and 10p 1-copy, > 204621 bp for 5p 2nd 1-copy.
We calculated the LAF with regular intervals only from the edge of the bait segment into the extended region. We took the intervals as from the end of the bait segment to the starting positions of 1st gap after 10kb, 20kb, 30kb, 40kb, and 50kb respectively. To decide the cut-off point for extended region, we checked all LAFs of extended region and we stopped where we noticed sharp drop of the LAF. The reason for this sharp drop is after this contig there is big gap and after that there is no significant length of assembled contig to increase the LAF (Table 4).
Table 4:
Quality comparison for extended 1-copy region
| Chromosomal region | (EL, LAF)a | (EL, LAF)b | Chromosomal region | (EL, LAF)a | (EL, LAF)b |
|---|---|---|---|---|---|
| 2p | (33798, 0.99) | (16954, 1.00) | 10p | (52022, 0.93) | (12437, 1.00) |
| 19p | (43666, 0.93) | (6738, 0.99) | 5p (2nd 1-copy) | (42326, 0.98) | (22485, 0.97) |
Extension length (in bases) and LAF for REXTAL. For 19p, 10p, and 5p 1-copy region the range is 3–70.
Extension length (in bases) and LAF for genome-wide assembly method.
3.4.3. Quality in segmental duplication region
As segmental duplication region contains segments of DNA with near-identical duplicated subtelomere sequence, this region is hard to assemble de novo with whole genome reads. We can extend our assembly into subtelomere segmental duplication regions. Following BLAST analysis using the HG38 reference subtelomere assembly including the segmental duplication region along with the adjacent bait region, we used Algorithm 1 (Figure 6), to measure the LAF only for the segmental duplication region of 19p, 10p and 5p and then chose the cut-off point. Table 5 shows the analysis of segmental duplication region with extension length as well as LAF.
Table 5:
Quality comparison for segmental duplication region
| Chromosomal region | SD_La | (EL, LAF)b | (EL, LAF)c |
|---|---|---|---|
| 19p | 249446 | (67099, 0.98) | (5549, 1.00) |
| 10p | 78569 | (40089, 0.98) | (4606, 1.00) |
| 5p (1st 1-copy extends to 1st SD) | 39495 | (36477, 0.98) | (23129, 0.99) |
| 5p (1st 1-copy extends to 2nd SD) | 94782 | (51860, 0.96) | (65, 1.00) |
| 5p (2nd 1-copy extends to 2nd SD) | 94782 | (43090, 0.92) | (1307, 1.00) |
Length of SD (segmental duplication) region (in bases) of corresponding chromosomal region.
Extension length (in bases) and LAF for REXTAL. For 19p, 10p, and 5p 1-copy region the range is 3–70.
Extension length (in bases) and LAF for genome-wide assembly method.
3.5. Comparison with Genome-Wide Assembly
For fair comparison with genome-wide assembly method we need to extract all contigs in the genome-wide assembly that overlap (including potential extensions into flanking DNA) with the reference sequence. To do so we use BWA index (Li H. and Durbin R., 2009) of the reference genome (hg38). We have genome-wide assembly of our input data using Supernova. For alignment with BWA, we aligned the genome-wide assembled reads against the indexed reference genome and generated a .sam file. Using SAMtools (Li H. et al., 2009) we converted the .sam file into a .bam file, sort, and index the results. We extracted specific region of specific chromosomes (here 2p, 19p, 10p, and 5p) from that indexed results using SAMtools and aligned them with the same reference queries used for analysis of the barcode-selected read assemblies using BLAST with ≥ 98% of identity (see Figure 7B, Figure 7D, Figure 7F, and Figure 7H).
3.5.1. Comparison in single copy region
To measure the quality of subtelomeric region assembly of extracted 2p, 19p, 10p, and 5p 1-copy region from genome-wide assembly, we followed the same steps discussed in 3.4.1. We calculated the LAF with regular intervals Table 3 shows the comparison of LAF between genome-wide assembly method and REXTAL. For 2p and 5p 2nd 1-copy we get similar LAF with genome-wide method (Table 3). We get better LAF using REXTAL for 19p, 10p 1-copy, and 5p 1st 1-copy than genome-wide method (Table 3).
3.5.2. Comparison in extended region
To show the extension of single copy region in genome-wide assembly method, we calculated the LAF following the same steps mentioned in 3.4.2. Then we decided the cut-off point. We compared our result for the extended 1-copy region with the genome-wide method in Table 4. It is easy to observe that the results obtained by REXTAL are significantly better than the genome-wide method for these four loci.
3.5.3. Comparison in segmental duplication region
To compare the extension to segmental duplication region of genome-wide assembly method with REXTAL, for genome-wide assembly we calculated the LAF following the same steps mentioned in 3.4.3. Table 5 shows the comparison of REXTAL result for the segmental duplication region with the genome-wide method. Once again note that for segmental duplication region the results obtained by REXTAL are notably superior to the genome-wide method for all loci that have been tested. In particular, extensions from the 5p 1st 1-copy and the 2nd 1-copy region together (94950 bp) cover the entire 2nd segmental duplication region (Table 5).
Figure 8 shows the comparison of extended segmental duplication region for 19p and 10p using REXTAL and genome-wide assembly method.
Figure 8.
A: Alignment of 19p segmental duplication region with assembled scaffolds of 19p 1-copy for range 3–70 of REXTAL. B: Alignment of 19p segmental duplication region with assembled scaffolds of 19p 1-copy region extracted from genome-wide assembly. C: Alignment of 10p segmental duplication region with assembled scaffolds of 10p 1-copy for range 3–70 of REXTAL. D: Alignment of 10p segmental duplication region with assembled scaffolds of 10p 1-copy region extracted from genome-wide assembly.
4. CONCLUSION
We successfully used a new computational method called Regional Extension of Assemblies Using Linked-Reads (REXTAL) for improved region-specific assembly of segmental duplication-containing DNA, leveraging genomic short-read datasets generated from large DNA molecules partitioned and barcoded using the “Gel Bead in Emulsion” (GEM) microfluidic method (Zheng et al., 2016). We showed that using REXTAL, it is possible to extend assembly of single-copy diploid DNA into adjacent, otherwise inaccessible subtelomere segmental duplication regions. In future experiments, using larger source DNA molecules for barcode sequencing approaches could further extend assemblies into and through segmental duplications, and optical maps of large single molecules extending from the 1-copy regions through segmental duplications and gaps could be used to optimally guide and validate these assemblies.
CCS Concepts.
Applied computing ➝ Computational genomics
Applied computing ➝ Bioinformatics.
5. ACKNOWLEDGMENTS
The work in this paper is supported in part by NIH R21CA177395 (HR and MX), and Modeling and Simulation Scholarship (to TI) of Old Dominion University.
6. REFERENCES
- Alkan C, Sajjadian S, Eichler EE (2011). Limitations of next-generation genome sequence assembly. Nature methods, 8, 61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research, 25, 3389–3402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Benson G (1999). Tandem repeats finder: a program to analyze DNA sequences. Nucleic acids research, 27, 573. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gurevich A, Saveliev V, Vyahhi N, Tesler G. (2013). QUAST: quality assessment tool for genome assemblies. Bioinformatics, 29, 1072–1075. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kent WJ (2002). BLAT—the BLAST-like alignment tool. Genome research, 12, 656–664. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. (2002). The human genome browser at UCSC. Genome research, 12, 996–1006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H, Durbin R (2009). Fast and accurate short read alignment with Burrows--Wheeler transform. Bioinformatics, 25, 1754–1760. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup (2009). The sequence alignment/map format and SAMtools. Bioinformatics, 25, 2078–2079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Smit AF (1996). 2010 RepeatMasker Open-3.0. URL: http://www.repeatmasker.org.
- Weisenfeld NI, Kumar V, Shah P, Church DM, Jaffe DB. (2017). Direct determination of diploid genome sequences. Genome research, 27, 757–767. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zheng GX-L-P et al. , (2016). Haplotyping germline and cancer genomes with high-throughput linked-read sequencing. Nature biotechnology, 34, 303–311. [DOI] [PMC free article] [PubMed] [Google Scholar]








