Abstract
Genome graphs are emerging as an important novel approach to the analysis of high-throughput sequencing data. By explicitly representing genetic variants and alternative haplotypes in a mappable data structure, they can enable the improved analysis of structurally variable and hyperpolymorphic regions of the genome. In most existing approaches, graphs are constructed from variant call sets derived from short-read sequencing. As long-read sequencing becomes more cost-effective and enables de novo assembly for increasing numbers of whole genomes, a method for the direct construction of a genome graph from sets of assembled human genomes would be desirable. Such assembly-based genome graphs would encompass the wide spectrum of genetic variation accessible to long-read-based de novo assembly, including large structural variants and divergent haplotypes.
Here we present NovoGraph, a method for the construction of a genome graph directly from a set of de novo assemblies. NovoGraph constructs a genome-wide multiple sequence alignment of all input contigs and uses a simple criterion of homologous-identical recombination to convert the multiple sequence alignment into a graph. NovoGraph outputs resulting graphs in VCF format that can be loaded into third-party genome graph toolkits. To demonstrate NovoGraph, we construct a genome graph with 23,478,835 variant sites and 30,582,795 variant alleles from de novo assemblies of seven ethnically diverse human genomes (AK1, CHM1, CHM13, HG003, HG004, HX1, NA19240). Initial evaluations show that mapping against the constructed graph reduces the average mismatch rate of reads from sample NA12878 by approximately 0.2%, albeit at a slightly increased rate of reads that remain unmapped.
Keywords: Genome graph, de novo assembly, alignment, multiple sequence alignment, population reference graph, NovoGraph
Introduction
Since the completion of the human reference genome in 2003, genomic sequencing has been established as a key tool for both fundamental research and personalized medicine. Sequencing costs have fallen dramatically, and the whole genomes of tens of thousands of individuals have been sequenced and analyzed. Although long-read sequencing is becoming more cost-effective and popular, the sequencing technologies that currently dominate cohort sequencing produce millions of short reads between 100 and 250 base pairs in length. As the first step of data analysis, these reads are typically mapped to the human reference genome to determine their genomic locations.
This approach works well for the large majority of reads; critically, however, it fails for reads that come from regions in the sequenced genome that are strongly divergent from the reference genome. Important examples include immunogenetic regions known to harbour important disease-associated variants like the major histocompatibility complex (MHC) and the killer-cell immunoglobulin-like receptor (KIR) genes ( Kuśnierczyk, 2013; Trowsdale & Knight, 2013), as well as regions affected by large or complex structural variants, which together account for more than 50% of total base pair differences between individuals ( Sudmant et al., 2015). The total proportion of the human genome inaccessible to classical reference-based analysis is estimated to be greater than 1% ( Dilthey et al., 2015).
Instead of mapping reads to a single reference genome, it is now possible to map reads to a reference genome graph ( Computational Pan-Genomics Consortium, 2018; Paten et al., 2017). A reference genome graph can be thought of as a data structure that provides a unified representation of multiple genomes and their potential recombinants. As more genomes are added to the graph, the probability that any given region in a sequenced genome has a sufficiently close homolog in the graph (so as to allow for reliable mapping) increases. Technically, a genome graph is an acyclic or cyclic graph structure with nucleotide-labeled edges or nodes; each input genome can typically be reconstructed as a traversal of the graph, and nodes with more than one incoming or outgoing edge represent recombination points between the input genomes. Like linear reference genomes, genome graphs can serve as the basis for read mapping and variant calling.
The utility of reference genome graphs in the field of human genetics was first demonstrated in the field of immunogenetics and subsequently for the entire human genome. Specifically, a reference graph approach to model local haplotype structures enabled improved genotyping accuracy in the MHC ( Dilthey et al., 2015) and, for the first time, reliable typing of the Human Leukocyte Antigen (HLA) genes from standard whole-genome sequencing data ( Dilthey et al., 2016). More recently, multiple graph approaches and software toolkits suitable for genome-wide application have been published ( Eggertsson et al., 2017; Garrison et al., 2017; Rakocevic et al., 2017; Sibbesen et al., 2018), showing, for example, that graph genome approaches can enable a fivefold reduction of missed SNP calls ( Eggertsson et al., 2017) and enable the genotyping of thousands of additional variants longer than 50 base pairs per genome ( Sibbesen et al., 2018).
These developments notwithstanding, the field is still in its infancy. One particularly important open question is how to integrate information from long-read sequencing into the graph construction process. In existing approaches, graph construction typically relies on call sets derived from short-read sequencing experiments. As discussed above, however, short-read sequencing has limited sensitivity in the hypervariable and structural-variation-rich regions where graph genomes can be expected to provide the greatest benefit. Therefore, graphs constructed via existing methods likely miss substantial proportions of relevant variation. By contrast, long-read-sequencing enables the assembly of complex sequences ( Jain et al., 2018) in a reference-bias-free way and the detection of structural variants at high sensitivity ( Sedlazeck et al., 2018). Even though the number of long-read-sequenced samples is still limited, rendering their sequences available via a genome graph would be highly desirable.
Here we introduce NovoGraph, a pipeline for the direct construction of acyclic genome graphs from de novo assembly contigs. NovoGraph constructs a whole-genome graph by connecting the input assembly sequences at positions of homology; that is, it implements a model of homologous recombination between the input genomes. This approach has the advantage that the resulting graph will generally include the complete set of sequences present in the input assemblies, including (at base-pair resolution) the sequences that correspond to structural variants and divergent haplotypes. Graphs constructed by NovoGraph will therefore be comparatively enriched in large-scale structural and complex variants. In the spirit of modularity, constructed graphs are represented in VCF format, which enables them to be used with any of the established genome-wide graph toolkits. We also utilize the standard CRAM format for representing the output of intermediate steps, in particular a multiple sequence alignment of all input sequences.
We demonstrate NovoGraph by constructing a genome graph from seven ethnically diverse human genomes and the canonical reference. In a mapping experiment with vg ( Garrison et al., 2017), we show that using this graph instead of the standard reference genome increases the average alignment identity of genome-wide short reads.
This project was initiated at an NCBI hackathon ( Busby et al., 2016) held before the 2016 Biological Data Science meeting at Cold Spring Harbor Laboratory in October, 2016. The seven co-authors gathered for 3 days at CSHL to quickly develop and prototype the pipeline. As with all NCBI hackathons, the only stipulations for the event were (1) that the data be publicly available and (2) that any resulting software be open-source.
Methods
Pipeline overview
The NovoGraph pipeline ( Biederstedt et al., 2018) for constructing a genome graph from a set of assembly contigs consists of the following steps (see Figure 1):
-
1.
For each input contig, compute a global pairwise alignment to the GRCh38 primary assembly. This alignment determines the approximate placement of each input contig relative to the reference.
-
2.
Compute a global multiple sequence alignment (MSA) between all input contigs and the reference genome. This multiple sequence alignment embodies the joint sequence homology relationships between all input sequences and the reference genome. The pairwise contig-to-reference alignments from Step 1 are used to guide this process.
-
3.
Compute an acyclic directed graph structure from the global MSA, connecting contigs at homologous-identical positions.
The outputs from Steps 1 and 2 are represented in SAM/CRAM format ( Hsi-Yang Fritz et al., 2011; Li et al., 2009). The output from Step 3 is a VCF ( Danecek et al., 2011), which may be provided as input to various existing graph genome frameworks.
Step 1 – Pairwise global alignments between individual input contigs and GRCh38
For each input contig, we compute a global pairwise alignment between the input contig sequence and the GRCh38 primary assembly. This process is illustrated in Figure 2.
Exact global alignment scales quadratically with the length of the input sequences and therefore quickly becomes computationally intractable as the input sequences increase in size. We therefore adopt a heuristic approach:
First, we use bwa-mem ( Li, 2013) to identify high-scoring local alignments between the input contig and the reference genome (GRCh38). These represent diagonal (or near-diagonal) moves in a global alignment matrix, i.e. regions of high pairwise alignment identity between the input contig and a reference genome. We refer to the identified local alignments as “diagonals”.
Next, to obtain a global pairwise alignment, we identify the highest-scoring consistent combination of the identified diagonals into a global alignment by dynamic programming. Note that pairwise alignments by definition comprise two sequences in defined orientations; only diagonals that align to the same reference contig in the same orientation (strandedness) can therefore contribute to a consistent global alignment.
We now give a formal definition of the algorithm. For simplicity, we assume that all identified diagonals align to the same reference contig in the same orientation; if this is not the case, the following algorithm can be executed independently for all reference contig/orientation pairs and their corresponding diagonals, and the best global alignment between the input contig and the reference genome is the best identified alignment over all considered pairs of reference contigs and orientation.
We define a set P_ENTRY of “path entry” points and a set P_EXIT of “path exit” points. Each element (diagonal_id, (reference_coordinate, input_contig_coordinate)) of these sets consists of a diagonal identifier and a pair of coordinates that specify positions along the reference and input sequences, similar to the coordinates in the classical Needleman-Wunsch dynamic programming matrix. For example, the coordinate pair (3, 2) refers to a state in which 3 characters of the reference and 2 characters of the input sequence have been consumed. We also define the special points ORIGIN as (NA , (0, 0)) and TERMINUS as (NA , (n, m)), where n is the length of the reference sequence, m is the length of the input contig ID, and “NA” stands for an undefined diagonal identifier.
We populate the sets P_ENTRY and P_EXIT based on the identified diagonals. Each diagonal represents a local pairwise alignment between the reference and the input contig, and is therefore associated with two pairs of coordinates that specify the start and stop of the alignment in the reference and in the contig sequence. Specifically, let (d 1, d 2) denote the start coordinates of a given diagonal d in the reference and contig sequences, and let (d 3, d 4) denote the stop coordinates of the alignment in the reference and contig sequences. Both coordinate pairs are 1-based. To give an example, if diagonal d represents an alignment between positions 4 and 10 of the reference sequence and positions 3 and 11 of the contig sequence, d 1 = 4, d 2 = 3, d 3 = 10, and d 4 = 11. For each diagonal d, we add (d, (d 1, d 2)) as a member of the set P_ENTRY and (d, (d 3, d 4)) as a member of the set P_EXIT. We refer to these as “start-of-diagonal” entry and “end-of-diagonal” exit points. We also add “within-diagonal” path exit points that horizontally or vertically align with start-of-diagonal entry points of other diagonals, and “within-diagonal” path entry points that horizontally or vertically align with end-of-diagonal exit points of other diagonals. Specifically, we add a within-diagonal path exit point (d, (d x, d y)) for diagonal d if and only if (i) the coordinates (d x, d y) correspond to a column in the local alignment associated with d and (ii) there is another diagonal g with g 1 = d x or g 2 = d y. The definition of within-diagonal path entry points follows symmetrically. The different types of entry and exit points are illustrated in Figure 3.
The set of valid path traversals is defined as the set of sequences x 0,x 1,x 2,...,x n that meet the following conditions:
-
(i)
for all i such that i is even, x i is a member of { ORIGIN ∪ P_EXIT}
-
(ii)
for all i such that i is odd, x i is a member of { TERMINUS ∪ P_ENTRY}
-
(iii)
x 0 = ORIGIN and x n = TERMINUS
-
(iv)
for all x i=(g, (g a, g b)) and x i+1= (h, (h a, h b)), g a ≤ h a and g b ≤ h b
-
(v)
for all x i=(g, (g a, g b)) and x i+1=(h, (h a, h b)) with uneven i, g = h
-
(vi)
each element of the sequence is unique.
Each traversal can be scored iteratively from left to right by combining the scores of the traversed diagonals with gap-incurred penalties from the jumps between exit and entry points. We initialize by setting score(ORIGIN) = 0. For uneven i with x i=(g, (g a, g b)) and x i-1=(h, (h a, h b)), we set score(x i) = score(x i-1) + gap_score x [(h a - g a)+(h b - g b)]. For even i with x i=(g, (g a, g b)) and x i-1=(h, (h a, h b)), g is equal to h by definition and we set score(x i) to be score(x i-1) plus the score of the local alignment on diagonal g between coordinates (h a,h b) and (g a,g b). In the current implementation, gap_score is -1, matches within local alignments are scored as +1, and mismatches/gaps within local alignments as -1. Jumps to ORIGIN and TERMINUS are not penalized along the reference dimension (i.e., ends-free alignment).
A dynamic programming formulation for finding the highest-scoring traversal follows immediately from these definitions. In brief, order the union set S := {ORIGIN ∪ P_ENTRY ∪ P_EXIT ∪ TERMINUS} by coordinates and for the i-th element of the ordered set S, compute the maximum achievable score max_score(x i) of x i by
-
(i)
identifying the subset S’ ⊆ {x 0 , .., x i-1} of possible predecessor elements
-
(ii)
for each s ϵ S’, scoring the transition from s to x i by replacing ‘score’ with ‘max_score’ in the definitions of the preceding paragraph
-
(iii)
selecting the maximum-scoring transition as the value for max_score(x i).
Step 2 – Global multiple sequence alignment
We now turn the pairwise input-contig-to-reference alignments created in Step 1 into a set of global multiple sequence alignments.
We split the GRCh38 reference contigs into non-overlapping windows of approximately 10,000 bases. A window size of 10,000 is chosen to be both sufficiently large to include the majority of human structural variants and small enough to allow for efficient processing of individual windows; see below for a precise definition of how window boundaries are determined. For each window, we extract the reference sequence and, based on the pairwise input-contig-to-reference alignments, the input contig sequences overlapping the window. We use MAFFT ( Katoh & Standley, 2013) to generate an MSA for the sequences of each window (including the reference). This step is trivially parallelizable. After having computed an MSA for each window, we concatenate the per-window MSAs in the correct order. For each GRCh38 reference contig, this yields a combined MSA of the reference sequence and all input contigs initially aligned to it.
In this approach, the initial pairwise alignments determine in which window a given part of an input sequence ends up for the MSA computation. Ideally we would like to choose the window boundary positions so as to avoid regions of high uncertainty in the initial pairwise alignments. The placement of gaps in sequence alignments is often ambiguous and gaps are generally associated with increased alignment uncertainty. We therefore adopt a simple heuristic to avoid the crossing of gaps when choosing window boundaries: First we partition the reference into windows of exactly 10,000 bases in length. For each window boundary position independently, we scan the surrounding ± 100 reference positions. For each reference position, we identify the columns corresponding to that reference position in the pairwise sequence alignments, and count the number of gaps across the identified columns. We then choose the reference position with the lowest proportion of gaps as the final window boundary.
The output from this step is encoded in CRAM format. Reference gaps are represented using the ‘P’ CIGAR character.
Step 3 – Graph construction
As a last step, the multiple sequence alignment generated during the previous step is transformed into a graph. An important design decision for this operation is where to allow for recombination between the input sequences, i.e. where to allow for transitions between sequences encoded on different input contigs. The applied recombination rules shape the topology of the graph and determine the set of genomes that could be sampled from the graph.
Graph topology is also constrained by our requirement that the constructed graph be, for interoperability reasons, representable in VCF format. Some third-party inference methods support fully general VCFs with overlapping variant alleles; other frameworks, for example gramtools ( Maciuca et al., 2016), require that the encoded variants be non-overlapping. To achieve full interoperability with different downstream inference methods, NovoGraph therefore implements two separate algorithms for VCF generation.
The first graph generation algorithm, referred to as NovoGraph-Simple, implements a straightforward conversion of each individual sequence from the multiple sequence alignment into VCF format; that is, the positions at which an input sequence deviates from the reference are identified and represented as variant alleles in VCF format. Identical variant alleles from different MSA sequences are merged and sorted by position to obtain a valid VCF. This algorithm implements a recombination model that allows for recombination between pairs ( a, b) of MSA sequences immediately prior to positions at which both a and b align to the reference in the joint MSA with either a match or a mismatch.
The second graph generation algorithm, referred to as NovoGraph-Universal, is more complex and ensures that the created VCF does not contain overlapping variant alleles; that is, it ensures universal compatibility with third-party inference methods.
NovoGraph-Universal allows for recombination (A) between pairs of input contigs wherever input contigs start or end along the canonical reference, and (B) at positions at which all contig sequences agree with the canonical reference. The graph collapses into a uniformly homozygous state at positions whereby condition (B) applies. The resulting graph structure (composed of reference-identical, collapsed stretches interspersed with sets of alternative haplotypes) lends itself directly to representation in VCF format. Also note that criterion B (sequence identity across all input sequences) is stronger than the recombination condition (sequence identity across pairs of input sequences) of a related algorithm ( Dilthey et al., 2015).
An overview of NovoGraph-Universal is given in Figure 4. At a high level, NovoGraph-Universal constructs a graph by processing the input MSA for each reference contig in a column-by-column fashion from left to right in the order of genomic position, accounting for the entry and exit of input contigs as well as for potential recombination between them. As the algorithm moves along the MSA, it keeps track of the set of haplotypes compatible with the input contigs and their potential recombinants. In the graph, each haplotype is generally represented as its own branch; however, these are collapsed at positions at which all haplotypes agree with the canonical reference. The sequences corresponding to this “collapsed homozygous” state are reference-identical and therefore not explicitly represented in VCF.
In the following, we provide a more detailed description of the algorithm:
NovoGraph-Universal is executed for each GRCh38 reference contig independently. The set of input sequences for each reference contig is represented by the multiple sequence alignment constructed during Step 2. Each non-reference contig in the MSA has a first and a last column in which both the input contig and reference bases are non-gap. We refer to these columns as the entry and exit positions of the contig, and all bases outside the entry and exit columns are ignored during the following steps.
We keep a set of current haplotypes, denoted as R. Each element h of R (a “current haplotype”) consists of two elements: (i) the “current sequence” of h (which is updated as we move along the MSA) and (ii) the contig ID of the contig that the current haplotype is copying from (the “source haplotype”)—this can be either the reference or one of the input contigs. We initialize R such that R has one element that has a zero-length sequence and that is set to copy its sequence from the reference.
When we process a column of the MSA, for each element h ∈ R, we append the corresponding MSA character of the source haplotype to the current sequence of h. This step is called “extension” (see Figure 4).
After having carried out the extension step, the current sequences of the elements of R up to the second-last character are sent to the VCF generator if and only if (A) all appended characters are non-gap reference identical and (B) the length of the current sequences of the elements of R is greater than or equal to 2 non-gap bases. This step is called “flushing” (see Figure 4). The VCF generator writes a variant-encoding line to the output VCF file if the received sequences contain at least one non-reference sequence; otherwise, the output is empty. After processing by the VCF generator, the processed strings are removed from their corresponding source elements—that is, after flushing, the current sequences of all elements of R have a length of 1 (the last added base, which was not sent to the VCF generator). Note that R is a set so that, by definition, duplicate elements are collapsed. This process is carried out up to the rightmost column of the MSA, at which point the graph construction and VCF generation process is complete.
Two special cases corresponding to the entry and exit of non-reference contigs conclude the definition of the graph construction algorithm. First, if an MSA position being processed corresponds to the entry position of a contig, we duplicate all elements of R prior to the extension step, set the source haplotype of the duplicate elements to the ID of the starting contig, and add the modified duplicates to R. Second, if an MSA position being processed corresponds to the exit position of a contig, we execute the following algorithm after the extension step:
-
1.
Compile a list E of all elements of R which use the existing contig as their source haplotype (i.e. the elements R of affected by the contig exit).
-
2.
Compile a list C of non-reference contig IDs that a) are the source haplotype of any current element in R and b) don’t exit at the current MSA position (i.e. C is a list of non-exhausted current contig IDs).
-
3.
For each element (e, c) ∈ { E x C}, we add a new element to R with a) its current sequence set to the current sequence of e and b) its source haplotype set to c. After having processed all elements of the set { E x C} we set the source haplotypes of all e E to the reference.
Clearly the size of R increases as non-reference contigs enter and exit and, conversely, the size of R can only decrease during the flushing step. To limit computational demands, we impose an upper limit U1 on the size of R. If |R| ≥ U1, we prohibit the entry of new contigs, and when exiting a contig, we only allow the transition to the reference as source haplotype.
Furthermore, due to the requirement of reference identity, gaps in the input MSA along the contig sequence dimension (i.e. corresponding to columns in the MSA in which the input contig sequence is a gap and the reference is not) prevent flushing. We therefore also place an upper limit U2 on the maximum number of contiguous contig gaps in the input alignments. If a contiguous gap along the input contig dimension in an input contig alignment exceeds U2 in size, we break the alignment, i.e. we split the alignment in two. U1 limits the complexity of the graph in terms of the number of per-site variant haplotypes, U2 limits the maximum size of deletions represented in the graph. In the current implementation, we use U1 = U2 = 5000 bp, but both parameters can be easily modified by the user.
Implementation and computational requirements
Steps 1 and 2 are implemented in Perl 5. Step 3 is implemented in C++, with a wrapper Perl script. Our pipeline utilizes bwa (version 0.7.15 and above), SAMtools (version 1.4 and above), and MAFFT (version 7). The minimum computational requirement for NovoGraph is a workstation computer with at least 32 Gb of RAM; we recommend, however, that the MSA generation steps be executed within a multi-node cluster environment. NovoGraph natively supports SGE-compatible grid environments, although this could be easily adapted to other platforms.
Human input assemblies
We used contigs from seven recent de novo assemblies of human genomes ( Table 1), the data of which are publicly available. The total size and contig lengths of each input assembly are shown in Figure 5. In order to quantify the sequencing and alignment quality of each input assembly, we relied upon the edit distance (Levenshtein distance) encoded via the BAM NM tag, i.e. the number of nucleotide changes within each contig necessary to equal the reference. The results of dividing this value by the length of each aligned contig (NM/Length) are shown in Figure 6. We note that we have made no effort to classify variants within each assembly as genuine variation or errors.
Table 1. Input assemblies for the whole-genome human graph.
Sample ID | Ethnicity | Citation | Download URL |
---|---|---|---|
AK1 | Korean | ( Seo et al., 2016) | https://www.ncbi.nlm.nih.gov/Traces/wgs?val=LPVO02#contigs |
CHM1 | European | (
Chaisson
et al., 2015;
Steinberg et al., 2014) |
https://www.ncbi.nlm.nih.gov/assembly/GCA_001297185.1/ |
CHM13 | European | ( Schneider et al., 2017) | https://www.ncbi.nlm.nih.gov/assembly/GCA_001015385.3 |
HG003 | Ashkenazi | ( Zook et al., 2016) |
ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/
analysis/MtSinai_PacBio_Assembly_falcon_03282016/hg003_ p_and_a_ctg.fa |
HG004 | Ashkenazi | ( Zook et al., 2016) |
ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/
analysis/MtSinai_PacBio_Assembly_falcon_03282016/hg004_ p_and_a_ctg.fa |
HX1 | Han
Chinese |
( Shi et al., 2016) | http://hx1.wglab.org/data/hx1f4.3rdfixedv2.fa.gz |
NA19240 | Yoruba | ( Steinberg et al., 2016) | https://www.ncbi.nlm.nih.gov/assembly/GCA_001524155.1/ |
vg mapping experiment
We used the variation graph toolkit vg ( Garrison et al., 2017) to assess the effect of mapping against the constructed human genome graph (based on the NovoGraph-Universal algorithm). Short-read sequencing data of sample NA12878 were obtained from the Platinum Genomes project (2 x 100bp paired-end sequencing reads; European Nucleotide Archive accession ERR194147) and randomly subsampled to 2% of read pairs. We mapped the subsampled reads to the genome graph constructed by us and against a genome graph constructed from the GRCh38 primary reference and assessed the resulting alignment metrics (alignment score, alignment identity, number of mapped reads).
Results
We have presented NovoGraph, a pipeline for the construction of genome graphs from de novo assemblies and applied the pipeline to construct a genome graph from seven high-quality, ethnically diverse human assemblies ( Biederstedt, 2018). The graph constructed by NovoGraph-Universal has a size of 17 Gb when stored in uncompressed VCF format and contains 23,478,835 bubbles (i.e. sites with multiple alternative alleles) representing 30,582,795 variant alleles. The graph constructed by NovoGraph-Simple has an uncompressed size of 1.2 GB in VCF format and contains 33,309,666 bubbles representing 34,519,145 variant alleles. Both graphs and intermediate files are available for download and can be used for genome inference with a variety of tools.
We manually assessed a small set of hyperpolymorphic regions in the human genome. Figure 7 shows an IGV-based visualization ( Robinson et al., 2011; Thorvaldsdóttir et al., 2013) of the multiple sequence alignment of the input sequences in the HLA-B region of the MHC. HLA-B is the most polymorphic gene of the human genome and sequence polymorphisms are known to cluster around the peptide-binding-site encoding exons 2 and 3 ( Marsh et al., n.d.); consistent with this, high rates of polymorphism are observed in our multiple sequence alignment around these loci.
To measure the extent to which mapping against the constructed graph influences alignment metrics, we used the variation graph toolkit vg to map a randomly selected subset of NA12878 reads (see Methods) against a) the genome graph constructed by us (based on the NovoGraph-Universal algorithm) and b) a simple non-branching reference graph constructed from the primary GRCh38 reference alone. Alleles longer than 10 kb in size were removed to ensure successful loading of the graphs into vg. Results of the mapping experiment are shown in Table 2; while mean alignment identity is increased by approximately 0.2%, the number of mapped reads decreases by 0.04%. This somewhat counterintuitive result is probably explained by greater alignment ambiguity for a subset of reads, caused by the presence of non-unique branches in the graph; reads with multiple optimal mapping locations will be assigned a mapping quality score of 0 and count as unmapped.
Table 2. Read alignment quality metrics for the NA12878 mapping experiment.
Genome graph | Reference graph | |
---|---|---|
Mean scores | 108.859 | 108.100 |
Mean identity value | 0.9913 | 0.9891 |
Total mapped reads | 31125004 | 31138410 |
Conclusion and next steps
NovoGraph enables the construction of a graph genome from multiple de novo assemblies. The pipeline is available under an open source license and will scale to at least a few dozen input assemblies without major modifications. It would also be straightforward to adapt NovoGraph to non-human species, given the appropriate reference and input assemblies.
It is instructive to contrast the MSA-based NovoGraph approach with possible alternative approaches in which one creates a separate VCF for each assembly and then builds a graph by combining the individual VCFs. First, carrying out the multiple sequence alignment prior to the VCF generation step enables the sharing of information across multiple samples during the alignment process, potentially improving overall alignment quality and providing more consistent variant definitions across samples. Secondly, the constructed multiple sequence alignment of all input assemblies can be repurposed for other applications, for example as an input to other graph construction algorithms like the Population Reference Graph ( Dilthey et al., 2015). Finally, as the number of input genomes increases in size, it will become increasingly necessary to establish the mutual homology relationships between variant alleles from different samples and to represent these in the form of nested graphs; the MSA contains the information necessary for this. As an example, consider the case of two large insertion variants that differ from each other by a single base: in the field of graph genomes, these are most naturally represented as one large insertion with an additional SNP nested into it (instead of two near-identical branches). These points notwithstanding, multiple sequence alignments come at a computational cost, and might prove to be computationally prohibitive if the number of input genomes increases by more than one order of magnitude.
There are two important directions for future work. First, in the spirit of a hackathon, we have focused our efforts on the software development process. A comprehensive empirical evaluation of the constructed human genome graph is still outstanding. This could be achieved by loading the graph into multiple graph-based inference frameworks and by measuring genome inference accuracy. Secondly, it would be important to better understand the impact on the graph construction process of various parameter settings and trade-offs. For example, in the interest of simplicity, we implemented a simple gap scoring scheme that is neither affine nor convex; we relied on the default settings of MAFFT for the generation of the multiple sequence alignments; and we implemented a naive algorithm to split the reference genome into windows for MSA generation. Exploring alternative choices in each of these cases would be straightforward and could lead to valuable insights. A convex gap scoring scheme would probably improve the alignment of large and complex structural variants ( Sedlazeck et al., 2018) and therefore be the most important point to address.
These limitations notwithstanding, we believe that NovoGraph represents a useful addition to the field of graph genomes. A strength of NovoGraph is its ability to generate genome graphs for all major genome graph approaches directly from de novo assembly data. The graphs constructed with NovoGraph are available for download and could, for example, inform comparisons of different genome graph construction methods and the improved calling of structural variation.
Data availability
Input assemblies are publicly available and carry the NCBI assembly accession numbers GCA_001750385.2 (AK1), http://identifiers.org/ncbigi/GI:1078263188; GCA_001297185.1 (CHM1), http://identifiers.org/ncbigi/GI:929855629; GCA_001015385.3 (CHM13), http://identifiers.org/ncbigi/GI:953917559; GCA_001549605.1 (HG003), http://identifiers.org/ncbigi/GI:985741195; GCA_001549595.1 (HG004), http://identifiers.org/GI:985734877; GCA_001524155.1 (NA19240), http://identifiers.org/ncbigi/GI:1057722128; GCA_001708065.2 (HX1), http://identifiers.org/ncbigi/GI:1087879108. Full assembly data access details are given in Table 1.
All NovoGraph output data are available on OSF: https://doi.org/10.17605/OSF.IO/3VS42 ( Biederstedt, 2018). The genome graphs of seven ethnically diverse human genomes in VCF format can be downloaded from https://osf.io/t5czk/?view_only=fedd8437d96c4d688f6c40150903d857 (constructed with NovoGraph-Universal) and https://osf.io/pgq52/?view_only=fedd8437d96c4d688f6c40150903d857 (constructed with NovoGraph-Simple). The global multiple sequence of all input sequences in CRAM format can be downloaded from https://osf.io/jhbwx/?view_only=fedd8437d96c4d688f6c40150903d857. OSF data are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication).
Software availability
Source code for the pipeline is available from: https://github.com/NCBI-Hackathons/NovoGraph.
Archived source code at time of publication: https://doi.org/10.5281/zenodo.1342485 ( Biederstedt et al., 2018).
License: MIT license.
Acknowledgements
The authors would like to acknowledge the NCBI for facilitating the hackathon and thank Valerie Schneider, Justin Zook, and Lisa Federer for technical discussions. The authors thank Amazon Web Services for the Cloud Credits provided to hackathon participants during the October 2016 hackathon. The authors would also like to acknowledge Richard K. Wilson and the McDonnell Genome Institute, Washington University School of Medicine, for making available the assembly of NA19240.
Funding Statement
This work was supported by the Intramural Research Program of the National Library of Medicine, National Institutes of Health; by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health; and by the Jürgen Manchot Foundation.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
[version 1; referees: 1 approved
References
- Biederstedt E: NovoGraph.2018. 10.17605/OSF.IO/3VS42 [DOI] [Google Scholar]
- Biederstedt E, Oliver J, Dunn N, et al. : NCBI-Hackathons/NovoGraph: NovoGraph 1.0.0 (Version v1.0.0). Zenodo. 2018. 10.5281/zenodo.1342485 [DOI] [Google Scholar]
- Busby B, Lesko M, August 2015 and January 2016 Hackathon participants, et al. : Closing gaps between open software and public data in a hackathon setting: User-centered software prototyping [version 2; referees: not peer reviewed]. F1000Res. 2016;5:672. 10.12688/f1000research.8382.2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chaisson MJ, Huddleston J, Dennis MY, et al. : Resolving the complexity of the human genome using single-molecule sequencing. Nature. 2015;517(7536):608–611. 10.1038/nature13907 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Computational Pan-Genomics Consortium: Computational pan-genomics: status, promises and challenges. Brief Bioinform. 2018;19(1):118–135. 10.1093/bib/bbw089 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Danecek P, Auton A, Abecasis G, et al. : The variant call format and VCFtools. Bioinformatics. 2011;27(15):2156–2158. 10.1093/bioinformatics/btr330 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dilthey A, Cox C, Iqbal Z, et al. : Improved genome inference in the MHC using a population reference graph. Nat Genet. 2015;47(6):682–688. 10.1038/ng.3257 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dilthey AT, Gourraud A, Mentzer AJ, et al. : High-Accuracy HLA Type Inference from Whole-Genome Sequencing Data Using Population Reference Graphs. PLoS Comput Biol. 2016;12(10):e1005151. 10.1371/journal.pcbi.1005151 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eggertsson HP, Jonsson H, Kristmundsdottir S, et al. : Graphtyper enables population-scale genotyping using pangenome graphs. Nat Genet. 2017;49(11):1654–1660. 10.1038/ng.3964 [DOI] [PubMed] [Google Scholar]
- Garrison E, Sirén J, Novak AM, et al. : Sequence variation aware genome references and read mapping with the variation graph toolkit. bioRxiv. 2017. 10.1101/234856 [DOI] [Google Scholar]
- Hsi-Yang Fritz M, Leinonen R, Cochrane G, et al. : Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 2011;21(5):734–740. 10.1101/gr.114819.110 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jain M, Koren S, Miga KH, et al. : Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol. 2018;36(4):338–345. 10.1038/nbt.4060 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Katoh K, Standley DM: MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013;30(4):772–780. 10.1093/molbev/mst010 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kuśnierczyk P: Killer cell immunoglobulin-like receptor gene associations with autoimmune and allergic diseases, recurrent spontaneous abortion, and neoplasms. Front Immunol. 2013;4:8. 10.3389/fimmu.2013.00008 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv E-prints. 2013. Reference Source [Google Scholar]
- Li H, Handsaker B, Wysoker A, et al. : The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25(16):2078–2079. 10.1093/bioinformatics/btp352 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Maciuca S, del Ojo Elias C, McVean G, et al. : A Natural Encoding of Genetic Variation in a Burrows-Wheeler Transform to Enable Mapping and Genome Inference. In: Frith M, Storm Pedersen CN. eds. Algorithms in Bioinformatics. Lecture notes in computer science Cham: Springer International Publishing:2016;222–233. 10.1007/978-3-319-43681-4_18 [DOI] [Google Scholar]
- Marsh SGE, Parham P, Barber LD: The' ' HLA factsbook. San Diego: Academic Press. 10.1016/B978-0-12-545025-6.X5127-2 [DOI] [Google Scholar]
- Paten B, Novak AM, Eizenga JM, et al. : Genome graphs and the evolution of genome inference. Genome Res. 2017;27(5):665–676. 10.1101/gr.214155.116 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rakocevic G, Semenyuk V, Spencer J, et al. : Fast and Accurate Genomic Analyses using Genome Graphs. bioRxiv. 2017. 10.1101/194530 [DOI] [PubMed] [Google Scholar]
- Robinson JT, Thorvaldsdóttir H, Winckler W, et al. : Integrative genomics viewer. Nat Biotechnol. 2011;29(1):24–26. 10.1038/nbt.1754 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schneider VA, Graves-Lindsay T, Howe K, et al. : Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 2017;27(5):849–864. 10.1101/gr.213611.116 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sedlazeck FJ, Rescheneder P, Smolka M, et al. : Accurate detection of complex structural variations using single-molecule sequencing. Nat Methods. 2018;15(6):461–468. 10.1038/s41592-018-0001-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Seo JS, Rhie A, Kim J, et al. : De novo assembly and phasing of a Korean human genome. Nature. 2016;538(7624):243–247. 10.1038/nature20098 [DOI] [PubMed] [Google Scholar]
- Shi L, Guo Y, Dong C, et al. : Long-read sequencing and de novo assembly of a Chinese genome. Nat Commun. 2016;7:12065. 10.1038/ncomms12065 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sibbesen JA, Maretty L, Danish Pan-Genome Consortium, et al. : Accurate genotyping across variant classes and lengths using variant graphs. Nat Genet. 2018;50(7):1054–1059. 10.1038/s41588-018-0145-5 [DOI] [PubMed] [Google Scholar]
- Steinberg KM, Graves-Lindsay T, Schneider VA, et al. : High-Quality Assembly of an Individual of Yoruban Descent. bioRxiv. 2016. 10.1101/067447 [DOI] [Google Scholar]
- Steinberg KM, Schneider VA, Graves-Lindsay TA, et al. : Single haplotype assembly of the human genome from a hydatidiform mole. Genome Res. 2014;24(12):2066–2076. 10.1101/gr.180893.114 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sudmant PH, Rausch T, Gardner EJ, et al. : An integrated map of structural variation in 2,504 human genomes. Nature. 2015;526(7571):75–81. 10.1038/nature15394 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thorvaldsdóttir H, Robinson JT, Mesirov JP: Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform. 2013;14(2):178–192. 10.1093/bib/bbs017 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Trowsdale J, Knight JC: Major histocompatibility complex genomics and human disease. Annu Rev Genomics Hum Genet. 2013;14(1):301–323. 10.1146/annurev-genom-091212-153455 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zook JM, Catoe D, McDaniel J, et al. : Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci Data. 2016;3:160025. 10.1038/sdata.2016.25 [DOI] [PMC free article] [PubMed] [Google Scholar]