Version Changes
Revised. Amendments from Version 1
We would like to thank our reviewers for their constructive, comprehensive reviews. We sincerely apologise for the time it took to provide a response, which was partially due to difficulty with coordination during the global covid-19 pandemic. We now submit a revised version, addressing the reviewers’ comments, improving the general readability of the manuscript, and replacing pre-print references with their corresponding peer-reviewed references. - We have addressed the concerns related to the readability of the manuscript. Given the nature of the hackathon, the revised manuscript does lack the narrative continuity of traditional papers. The end result may still leave readers unsatisfied, but we are trying to follow the F1000 format. - We have updated the sections relating to plant genomics, explaining the motivation behind the importance of applying pangenomic methods to plant genomes (as opposed to vertebrate genomes) and detail the problems encountered. - We have corrected various typos within the manuscript, and corrected unclear captions for diagrams within. - We have tried to provide a clearer motivation for why certain methods were chosen in the piece for computational experiments. We have also attempted to detail the results of these experiments; given the nature of the hackathon, some of these computational experiments proved computationally intractable (or simply too expensive given the resources at hand) to continue, and were therefore abandoned. - Given the time since our original publication, other publications (motivated by this hackathon) have addressed larger questions of the best practices for applications of graph genomes. We have cited these papers within short explanations in the relevant sections. Naturally, there are still many open questions in this field, which will drive more methods and analyses
Abstract
In March 2019, 45 scientists and software engineers from around the world converged at the University of California, Santa Cruz for the first pangenomics codeathon. The purpose of the meeting was to propose technical specifications and standards for a usable human pangenome as well as to build relevant tools for genome graph infrastructures. During the meeting, the group held several intense and productive discussions covering a diverse set of topics, including advantages of graph genomes over a linear reference representation, design of new methods that can leverage graph-based data structures, and novel visualization and annotation approaches for pangenomes. Additionally, the participants self-organized themselves into teams that worked intensely over a three-day period to build a set of pipelines and tools for specific pangenomic applications. A summary of the questions raised and the tools developed are reported in this manuscript.
Keywords: Hackathon, Pangenome, Graph Genome, RNAseq, Structural Variant
Introduction
What is pangenomics?
The current human reference genome, GRCh38 ( Schneider et al., 2017), derives from a draft sequence that was constructed from a handful of individuals ( Lander et al., 2001) likely of African and European ancestries ( Reich et al., 2009). Today, GRCh38 captures a limited amount of additional genetic variation by providing alternative sequence representations (“alt loci”) for complex or highly variable regions, such as the SMA and MAPT loci on chromosomes 5 and 17, respectively ( Schneider et al., 2017), whose sequence is derived from additional DNA samples. However, analyses of other individual human genome assemblies from Europeans ( Ameur et al., 2018; Audano et al., 2019; Kidd et al., 2010; Levy et al., 2007; Wheeler et al., 2008), East Asians ( Audano et al., 2019; Kidd et al., 2010; Li et al., 2010; Seo et al., 2016; Shi et al., 2016), South Asians ( Audano et al., 2019; Kitzman et al., 2011), Amerindians ( Audano et al., 2019) and Africans ( Audano et al., 2019; Kidd et al., 2010; Li et al., 2010; Sherman et al., 2019) have still revealed a substantial amount of genomic information not represented in the reference assembly. Indeed, large re-sequencing projects showed an extensive human genetic diversity, even within the genomic content captured in the reference sequence ( Bycroft et al., 2018; 1000 Genomes Project Consortium et al., 2010; Mallick et al., 2016). Although GRCh38 is the most complete human reference to date, it is not clear how to construct a linear reference that can capture diversity and address population biases that impact analysis ( Brandt et al., 2015; Degner et al., 2009).
Why is a pangenome representation superior to the current human reference assembly model?
The diploid structure of human DNA is not currently represented in the current reference model, which is instead an arbitrary linear combination of different haplotypes (i.e., a mosaic) from multiple individuals. A human “pangenome” is a representation of all genomic variation observed in human populations ( Computational Pan-Genomics Consortium, 2018). In this context, a pangenome is a more comprehensive representation of genetic diversity than an individual diploid genome or a reference comprised of linear chromosomes built from multiple individuals, such as GRCh38. By extension, pangenomics encompasses approaches that utilize a pangenome reference. Pangenomics is designed to address the limitations of current standards, such as reference bias during the identification of genomic variants, population stratification and admixture, or ancestry-specific functional variants—among others, which impact evolutionary, agricultural and health genetics research. For example, reference bias in the sequence alignment to GRCh38 (excluding its alt loci) reduces our ability to correctly genotype regions that are likely to significantly diverge from the reference chromosome representations—e.g. immune regions such as the major histocompatibility complex (MHC) and killer cell immunoglobulin-like receptors (KIR), and the CYP2D6-8 loci involved in drug metabolism ( Dilthey et al., 2015). Alignment around indels becomes more challenging as their size increases with soft-clipping being preferred over split-read alignment ( Garrison et al., 2018; Narzisi et al., 2014). Variants cannot be identified within regions completely missing from the reference sequence, many of which have been recently identified to be common across individuals ( Taliun et al., 2019). Although bias and missing sequence may still persist in a pangenome, their effects should be substantially less, and may even be ameliorated by adding new content to the framework. In addition to these issues with the current reference, several studies using long reads have reported an average of ~20,000 structural variants (SV) per human genome, most of which fall within repetitive elements and segmental duplications (HGSVC) ( Audano et al., 2019; Chaisson et al., 2015). Many of these SVs intersect genes and regulatory elements, harbor transposable elements, and affect gene expression ( Audano et al., 2019; Chiang et al., 2017). Although they are largely inaccessible to short-read sequence with current methods, these variants can be more easily re-identified using a pangenome ( Chen et al., 2019; Hickey et al., 2020). Complex loci that harbor multiple repeats are also quite challenging to detect and genotype by aligning reads to a linear reference. Important disease-linked repeats, such as the CAG repeat in the HTT gene that causes HD and the CAG repeat in ATXN8 that causes Spinocerebellar ataxia type 8 (SCA8), are both flanked by other polymorphic repeats making them particularly difficult to accurately genotype. Sequence graphs offer again a general and a more flexible approach to handle these complex loci ( Dolzhenko et al., 2019).
What is a haplotype?
The International HapMap Consortium defines a haplotype as “a particular combination of alleles along a chromosome” ( International HapMap Consortium, 2005). A diploid individual has two haplotypes for any given genomic sequence—up to the complete genome itself—since it inherits a set of homologous chromosomes from each parent ( Crawford & Nickerson, 2005). At the population level, there may be more than two haplotypes for any given sequence. The definition of haplotype will vary in the scientific literature depending on discipline-specific questions and applications ( Hoehe, 2003). For evolutionary and population geneticists, haplotype may be short for haplotype block, which is a group of alleles that are inherited together across multiple generations and results from recombination and selection; the arrangement and length of haplotype blocks will inform about past population history ( Wang et al., 2002). For medical geneticists, haplotype may represent a functional haplotype at the gene level, i.e. genetic markers linked to a disease-associated allele in so-called linkage disequilibrium, or LD ( Slatkin, 2008). For livestock and crop breeders, a haplotype may be the minimal genomic region that influences a trait of interest ( Hayes, et al., 2013; Qian et al., 2017). Whatever the definition of a haplotype, haplotypic information can simultaneously provide clues about population history and disease or trait association ( Martin et al., 2018).
Why is phasing important?
Today’s widespread use of short-read sequencing provides easy access to genotypes but does not necessarily directly inform about the parental origin of each allele. However, the real power of haplotypes resides in phasing, which is the assignment of a given combination of alleles to each homologous chromosome ( Browning & Browning, 2011). Beyond the methodological challenge of phasing genomes ( Choi et al., 2018), the two haploid sequences in a diploid genome cannot be captured simultaneously in one linear sequence. However, a genome graph representation of a pangenome provides a spatial framework to embed multiple haplotypes at once and preserve phasing information ( Paten et al., 2017). This property of graph representations of a genome is critical. At the gene scale, phasing information can be used to recognize compound heterozygosity, whereby the two homologous copies of a gene are each affected by a distinct recessive mutation ( Snyder et al., 2015). Phenotype prediction depends heavily on the ability to distinguish point mutations or deletions between chromosomes ( Cirulli & Goldstein, 2010; Tewhey et al., 2011), making the retention of phasing information fundamental for the interpretation of results in a personalised medicine setting. Other applications of phasing include the inference of past population demographic history by looking at the distribution and size of haplotype blocks along chromosomes ( Schiffels & Durbin, 2014). Variant imputation also depends heavily on the availability of phasing information and becomes a key approach in large cohort studies with missing genotypes ( Das et al., 2018). Finally, the sequencing of fetal cell-free DNA in maternal plasma is a very promising way to study fetal genomes in a non-invasive manner. However, it is first essential to phase haplotypes from at least one of the parents ( Fan et al., 2012; Kitzman et al., 2012).
Methods
Here we describe the data sets and graph construction techniques used during the codeathon, as well as the pipelines and software that were developed.
Implementation
Graph coordinates system
To establish protocols to build pangenomic graphs from chromosome-level and ultra-long assemblies, we constructed graphs using the human reference genome GRCh38.p13, CHM1 cell-line data, and two primate references: chimpanzee (PanTro PTRv2; Clint; GenBank assembly accession GCA_002880755.3) and Sumatran orangutan (PonAbe3 PABv2; Susie; GenBank assembly accession GCA_002880775.3). Additionally, we built human-only graphs using the human reference genome (GRCh38.p13; GenBank assembly accession GCA_000001405.28) and the Japanese reference genome (JG1; available at https://jmorp.megabank.tohoku.ac.jp/201902/downloads/).
We used three different methods to build the graph and explore the potential limitations and advantages of each method. These methods were chosen as they allow us to explore evolutionary questions, such as ancestral states, large structural variations between groups, and complex gene genealogies. They were are used for their computational tractaibility in the limited time frame of the 3-day hackathon. We first created graphs based on sequences from chromosome 21 from GRCh38 (CM000683.2), Clint the chimpanzee (CM009259.2), Susie the Sumatran orangutan (CM009283.2), and CHM1 (AC244111.3, AC244144.2, AC244518.2, AC245051.3, AC245314.2, AC246819.2, AC255431.1, AC256301.1, AC277730.1, AC277802.1, AC277887.1).
Graph method 1: We used minimap2 (v2.16-r922) ( Li, 2018) with the parameter preset asm5 to do an all-vs-all alignment of the sequences. We then used seqwish (6e4fe705;) to induce a graph in GFAv1 (Graphical Fragment Assembly) format, and converted this to VG format ( Garrison et al., 2018) for further investigation.
Graph method 2: We used Cactus ( Paten et al., 2011), which is designed to build genome graphs of different taxa while accounting for the phylogenetic relationship between the organisms included. The generated Cactus graph in HAL format was converted to VG format using hal2vg for mapping and visualization.
Graph method 3: We used SibeliaZ ( Minkin & Medvedev, n.d.) to build a graph from chr1 of JG1 and GRCh38.
Finally, we designed a prototype of a graph coordinates system based on previously proposed ideas ( Rand et al., 2017) that streamlines the incorporation of new haplotypes into the graph, while preserving a structure that is retro-compatible with the GRCh38 linear reference coordinates ( Figure 1). Such a coordinate system offers a host of advantages, as it allows easier surjection/projection of graph coordinates onto the linear reference coordinates. It also streamlines variant discovery and improves annotation portability.
A faster, better short-read mapper with hit chaining
Our work modifies vg ( Garrison et al., 2018; Hickey et al., 2020) to create a fast and efficient read mapper. During the codeathon, we have improved a prototype minimizer-based mapper by adding a faster clustering function to cluster minimizer hits and hit extension logic for handling clusters that have no good full-length gapless alignment ( Figure 2).
The clustering algorithm has been improved by reducing the amount of data copying in the clustering implementation. Alignments may be output from the extender if chaining is not necessary. Additionally, we have devised an improved algorithm for comparing sets of clusters.
We also implement hit chaining which allows us to deal with crossovers and indels. When the extender cannot find a full-length gapless extension of the read alignment to some haplotype with below a threshold number of mismatches, where it previously would leave the read unaligned, it will instead now compute maximal unambiguous-path exact matches between the read and the graph’s embedded haplotypes and feed them to an extension step. The extension step will trace out the haplotype segments that could connect between those matches, perform gapped alignment of the relevant read sequence against each, and take the best for each possible connection. Then the resulting multipath alignment will be linearized into an optimal gapped single-path alignment for the read.
Pipeline for mapper evaluation on maize graphs
Pangenomics naturally has applications outside of human genomics, and we sought to test how current graph genome methods would apply to genomes more complex in terms of ploidy and variation.tWe also sought to test a plant mode For this, we chose the maize (Zea mays) genome, which is 2.3 Gb in length with 10 chromosomes and contains over 32,000 protein-coding genes ( Schnable et al., 2009). A total of 85% of this genome has been estimated to contain transposable elements (TE) ( Schnable et al., 2009). Using chr 10, we composed a graph using vg construct and compared it to a graph created with minimap2 for alignment and seqwish (for converted iting to GFA1 format with seqwish (Graph method 1) ( Figure 3).
We could not index the minimap2/seqwish graph for mapping because it contained extremely large snarls, with hundreds of thousands of net graph nodes. One of the indexes we needed to produce, the distance index, which is used for identifying nearby seed hits for clustering, requires doing an all-against-all distance computation on the net graph of each snarl, and that process tried to allocate more memory to hold its result than waswe had provisioned for the hackathonon our machine. We thus aborted the experiment at that step.
We believe that the graph we generated, shown in Supplemental Figure 1 (available as Extended data) as an odgi visualization, was pathologically complex and intractable, because we did not remove spurious, short alignments from the minimap2 output. The intractability of this graph precluded further analysis.
Mapping RNA sequencing data to variant graphs
Using known variants and haplotypes during mapping of RNA sequencing (RNA-seq) data have shown to be important for reducing reference bias and thus improving downstream analyses. Reference bias is known to negatively impact estimation of allele-specific expression ( Degner et al., 2009) and variant-aware mapping is one of the best ways to mitigate this problem ( Castel et al., 2015). Furthermore, it has been shown that inference of gene expression in the highly polymorphic MHC can be improved by using the alternative reference haplotypes during mapping ( Lee et al., 2018). A few variant-aware methods for mapping of RNA-seq reads exist, including GSNAP (Genomic Short-read Nucleotide Alignment Program) ( Wu & Nacu, 2010) and Hisat2 ( Kim et al., n.d.). Hisat2 is similar to vg in that it is also based off of a graph representation.
We wanted to test whether we could also use vg to map RNA-seq reads to a graph containing both known variants, splice-junctions and haplotype-specific transcript paths. We called this a spliced variation graph. We further wanted to show that we could use the reads mapped to the graph to get unbiased estimates of allele-specific transcript expression. The pipeline would serve as a proof of concept for a graph based approach for inferring allele-specific transcript expression when an individual's haplotypes are available, similar to the personal genome approach ( Rozowsky et al., 2011).
Assessment of mutation rates in and around structural variants using graph genomes
Mutation rates vary across the genome with certain hotspots associated with accessible regions as well as other genomic features. This is also discussed in the presence of gene duplication where in a single copy gene case the mutations are rare due to the selection pressure. However, this selection pressure is reduced when there are two or more copies of the gene, and higher mutation rates are possible for at least one copy of the gene.
To assess the presence of SNPs inside SVs, we constructed a graph genome in vg ( Garrison et al., 2018) to incorporate the SVs found in a recent Cell paper ( Audano et al., 2019). This highlights one application where graph genomes might provide improved insight over traditional mapping approaches. To assess this we used SNP calls for HG002, a gold standard in genomics reported to be present based on the Genome In A Bottle (GIAB) consortium ( Zook et al., 2016). We compared the power of vg over short Illumina reads and Pacific Biosciences (PacBio) Circular consensus sequencing (CCS) reads and PacBio continuous long reads (CLR). Subsequently we extended our project to additional samples, focusing on the assessment of mutation rates inside common SVs between the Caucasian and African populations. This revealed changes in mutation rates when looking at tandem duplications between the flanking and the affected regions. It would be interesting to scale this project further for larger cohort samples to assess the mutation rate across multiple samples and ethnicities. This could help understand if SVs are indeed the driver for certain phenotypes, or if the variations within the SVs are more likely to impact the phenotypes.
The code to generalize this analysis for larger cohorts such as the 1000 Genomes Project or Simons Genome Project samples is available on GitHub (See “Data and software availability”).
Implementing annotations on pangenome graphs
Linear genomes currently rely on genomic intervals as a core formalism for annotation but it is difficult to generalize this formalism to reference graphs. A genomic interval corresponds to a path through a graph. However, if we restrict the annotation to one path in the graph, the alternate alleles in the graph are not included in the annotation. We argue that connected subgraphs are a more appropriate formalism for annotating genome graphs. Using a new core formalism for annotation necessarily means that infrastructure to manipulate it does not yet exist. We need stable and exchangeable representations of the data, software support, and analysis tools to make the formalism useful for practitioners. We have developed a proof-of-concept system for projecting linear reference annotations onto genome graphs and utilizing them in downstream visualizers and analyses. The standard file format, named gGFF, has been defined on GitHub and code to manipulate and use this file format has been included in vg. We also developed a tool for performing utility operations on gGFF files, such as intersection and union.
A common use of annotations is generating gene or transcript-level counts of RNAseq read mappings for differential expression analysis. We have implemented an example RNA-seq quantification pipeline using a graph constructed from GRCh38 ch21 and variants from the 1000 Genomes Project. We converted this to a splice site-aware graph with vg rna. The next step would be to map RNA-seq reads to this graph and estimate coverage per base-pair using vg pack and gene-level quantification computed using GENCODE 29 annotation.
Operation
The software should run on most Linux installations. Interested parties are encouraged to clone the GitHub repository and follow the workflow/instructions provided for the individual implementations of the Use cases listed below. Pull requests and contacting the authors is strongly encouraged.
Results and use cases
Fundamentally, the motivation behind exploring graph genomes lies in the novel insights we may gain with their applications ( Eizenga et al., 2020). There are also regions— -- outside the alternative loci that are defined for GRCh38— -- that cannot easily be reduced to a single linear reference, and telomere-to-telomere de novo assembly of each individual genomes ( Logsdon et al., 2021; Miga et al., 2020) will likely be implausible on a large scale for the foreseeable future. Graph genomes can be used for inference of extension and phasing from sparse information derived from SNP chips and RNA-seq. They can also be used to infer allele- specific expression on an individual level. Additionally, there is development of methods to represent variation in the clinically important MHC locus, and explore this locus at a population level ( Chin et al., 2020; Dilthey, 2021).
Finally, in theory, having clusters of haplotypes within and across populations will allow us to efficiently determine the relationships of proximal and distal phenotype-relevant events.
Taking these points together, a pangenome graph would likely result in a reduction in the “total cost of ownership of genomes”; i.e. people can use information derived from graphs instead of remapping to a linear genome over and over again, expending computer resources needlessly to createing novel .bams/.vcfs files ad infinitum.
Use case: Integrating haplotype information into reference genome with retro-compatibility
Representing haplotype information in reference genomes is beneficial in increasing mappability and reducing bias. The major concerns for representing haplotypes in the existing reference genome are the alteration of coordinates, redundant representation, and ambiguous sequence inference. Our proposed notation tackles these issues with the following design philosophies:
-
1-
The haplotype contigs are coordinated and defined as an add-on outside the extant reference genome coordinates. This allows the set of haplotype contigs to be updated separately, and the inclusion of haplotype sequence does not alter the underlying reference genomic coordinates. This design also allows the user to include fix patches [i.e. updates that correct errors or add sequence associated with gaps in the reference sequence; ( Schneider et al., 2017)] in the graph or to recreate custom sequence using their haplotype of interest.
-
2-
Each haplotype and nested haplotype are defined as a unique segment based on the reference genome or the closest haplotype; therefore, the number of bases that need to be stored for each haplotype sequence is minimalized.
-
3-
Each haplotype can be uniquely represented using GFA-like notation that can track back into the node storing specific sequence for each haplotype.
Our proposed model allows nodes and edges represented in the GFA to change without changing the sequence corresponding to each haplotype ( Figure 4). Such an approach will be essential for future methods to both manipulate graphs that have already been constructed, as well as do comparative analyses between graphs using a common coordinate system as methods improve.
Use case: What about plants?
The potential for applying pangenomic methods to analyze plant genomes is immense. Several new plant genomes have recently been sequenced and built upon the previously produced model plant assemblies, providing a foundation for research and end-use applications in agriculture. Crop plants form the foundation for the world's natural food and textile resources, and plant breeding efforts are often focused on improving several quality traits. A graph-based sequence-centric view of genomes sets the stage for facilitating key decisions that can be made to improve crop infrastructure.
Diversity in plants comprises an array of genome types with regard to species identity, genome size, chromosome number, and ploidy level. Pangenome studies have commenced for many of the model plant species, such as Arabidopsis thaliana (flowering plants) ( Clark et al., 2007), Medicago truncatula (legumes) ( Miller et al., 2017; Zhou et al., 2017), and Brachypodium distachyon (grasses) ( Gordon et al., 2017), due to the attractive attributes of their small genomes and short generation-times. Likewise, pangenome studies now target on their corresponding larger cousins, which include crop plants of economic importance such as crucifers, soybean, and wheat ( Montenegro et al., 2017). These pangenome studies used highly developed sequence analyses, but not a graph-based approach. Several pangenome-related papers appear to be in preparation for other important plant species (e.g., maize); whether they all use graph-based methods remain to be seen. The exercise of testing graph-based sequence views will help formulate use-case scenarios.
Many challenges exist of course in terms of applications of graph representations of plant genomes, mostly due to their inherent complexity. One challenge is working with highly-divergent sequences during the construction of the pangenome, given the tradeoff between computational expediency and accuracy. Taking into account the transposons within plant genomes (e.g., maize as discussed above), methods relying upon global sequence alignment for whole genomes would need to address the issues of large translocations and inversions between chromosomes. Plants are often not only diploid as well, as opposed to the human genome. In sum, many pangenomic methods have had some success for verterbrate genomes (as detailed in this paper), but it is unclear how applicable these methods will be for highly complex plant genomes.
Immediate uses for graphs of plant genomes would be to validate hypothetical evolutionary tree diagrams assigned to species, and perhaps address instances where species are proposed to be ancient polyploids, or to gauge genome changes in current polyploid genomes. RNA-Seq methods may also be matched against graph-based maps to quantify expression from the genomes. For instance, it would be interesting to assess whether nutritional- or medicinal-related trait changes can be tracked to genomic structural variation using graph-based methods targeted on key metabolic pathway-associated genes. The tracking of highly repetitive transposon-initiated events may also explain some of the alterations observed in different genome species and their evolutionary consequence resulting in gene duplication, rearrangements, and the like. Use of graph-based methods to map out highly variable regions may also provide strategies toward implementing targeted engineering of species, or assist in classic breeding strategies where known attributes are known to structurally exist. Similarly, many wild ancestor lines are sought to bring in new gene function to serve as sources for disease resistance, quality traits, and nutrition; their inclusion in the graph will enable an understanding of their contributions on the whole genome scale. The construction of pangenomes by graph-based methods, and the subsequent visualization of these graphs therefore appear likely to have a valuable role in the future of agricultural improvements.
Use case: RNA-seq mapping
Within the realm of RNA-seq, graphs can also be used to validate and benchmark analytical methods. For example, we created a spliced variation graph of chr21 using the rna submodule in vg (see WDL pipeline for more details) to test the RNA-seq mapping performance of vg. We used variants from the NA12878 individual in the 1000 Genomes Project ( 1000 Genomes Project Consortium et al., 2015) and transcripts from the GENCODE v29 annotation ( Frankish et al., 2019). The paired-end RNA-seq reads were simulated using RSEM ( Li & Dewey, 2011) from the haplotype-specific transcripts generated from vg rna. vg’s two mapping algorithms map and mpmap were able to align 71.6% and 73.8% of the simulated reads with a mapping quality of at least 30, respectively. This is similar to the value observed for Hisat2 using the same data. We also tested both algorithms on graphs only consisting of exonic sequences. Using these graphs, the performance increased slightly (1.5 to 2%). Due to a lack of time we were not able to finish the second part of the pipeline that involved estimating allele-specific expression from the mapped reads.
This is very much a work in progress, and work so far has only been a proof of principle. For example, all splice-junctions and variations present in the reads were also present in the graph. In addition, due to time constraints we only used the number of mapped reads as a proxy of performance and did not assess whether the reads were correctly mapped. These issues will need to be addressed in future benchmarks in order to get a more accurate estimate of vg performance on spliced variation graphs and applications for RNAseq in general.
Use case: Producing a fully phased diploid assembly of the HG002 MHC region
The MHC, located on human chr6, is a region highly enriched for genes and variation, including the human leukocyte antigen (HLA) which is involved in immune system function. Genetic associations between variants in this region involve different diseases, including autoimmune diseases. MHC haplotypes differ substantially, making it challenging to map reads from this region and call variants with conventional methods on a linear reference. We sought to generate a base-level accurate, fully phased, diploid assembly of the MHC of GIAB HG002 (NA24385, Ashkenazi son). The only previous studies producing fully phased, contiguous diploid assemblies for the MHC involved the NA12878 genome with PacBio reads (non-CCS) ( Jain et al., 2018; Koren et al., 2018). In this work, we use newer PacBio CCS and ultralong Oxford Nanopore reads, along with 10x Genomics linked-reads, to produce and carefully evaluate a targeted MHC diploid assembly for a second individual from GIAB.
The data for this work relied on sequencing results from three different PacBio CCS libraries with average read lengths of 9 kb and 13 kb for the Sequel I chemistry and 11 kb for the Sequel II chemistry, and each dataset having ~25 to 30X coverage. We also used “ultralong” data from Oxford Nanopore Technology (ONT) with total coverage of 16X (4X coverage by reads > 100 kb), and Promethion ONT data with total coverage of ~40X (~6X coverage by reads > 100 kb). We also used 10x Genomics data for phasing. HG002 reads were extracted from the MHC (HLA1/HLA2) region on GRCh37/hg19 chr6:28,477,797-33,448,354. Illumina data for the Ashkenazi father (HG003, NA24149) and mother (HG004, NA24143) from this trio was also used to bin the CCS reads by haplotype. The HLA typing reports for HG002/HG003/HG004 were generated at Stanford Blood Center on December 16, 2016.
The first data processing step involved finding reads from each haplotype mapped to MHC regions. An initial inspection of the HG002 MHC region occurred on the whole-genome de novo assembly of trio binned reads produced using the CCS data. The MHC region initially appeared to be well-assembled, with 1 contig derived from the father and 2 contigs derived from the mother, but further inspection revealed that the results were not coherent and that some of the haplotypes may possibly have been compressed. A second approach used 15 kb PacBio CCS reads that were mapped to the MHC and then selected for each haplotype. A local de novo assembly of these reads resulted in 10-15 contigs with many gaps between, although the assembly was close to the full length of the MHC. PacBio CCS reads were processed with Whatshap v0.19 to generate a phased VCF, which was then used to partition CCS reads by haplotypes. Reads for each haplotype were assembled independently into contigs that were then aligned to 10x Genomics s linked-read GemCode WGS contigs (whereby contigs were assembled with Supernova) to generate scaffolded CCS contigs for the diploid assembly. This diploid assembly was then used as the input for vg to build a genome graph via all versus all alignment (by Minimap2) followed by seqwish.
Confirmation of the two haplotypes
The CCS and ONT long reads were aligned to the genome graph to confirm the diploid assembly using the PedMEC phasing pipeline ( Garg et al., 2016). In addition, phasing of HLA typing results in the diploid MHC assembly were also checked against the independent HLA typing results from Stanford Typing Lab, based on the proband phased haplotypes derived from the typing results of the parents (HG003 and HG004), as shown in Table 1 (the parents’ typings are not phased).
Table 1. Genotyping results for proband HG002 and parents HG003 and HG004.
Proband | Father | Mother | ||||
---|---|---|---|---|---|---|
HLA | HG002 | HG002 | HG003 | HG003 | HG004 | HG004 |
A | *26:01:01:01 | *01:01:01:01 | *30:01:01 | *26:01:01:01 | *01:01:01:01 | *33:01:01 |
B | *38:01:01 | *35:08:01 | *13:02:01 | *38:01:01 | *35:08:01 | *14:02:01:01 |
Bw | 4 | 6 | 4 | 6 | ||
Cw | *12:03:01:01 | *04:01:01:06 | *06:02:01:01 | *12:03:01:01 | *04:01:01:06 | *08:02:01:01 |
DRB1 (DR) | *04:02:01 | *10:01:01 | *07:01:01:01 | *04:02:01 | *04:04:01 | *10:01:01 |
DQB1 (DQ) | *03:02:01 | *05:01:01:02 | *02:02:01:01 | *03:02:01 | *04:02:01 | *05:01:01:02 |
DQA1 | *03:01:01 | *01:05:01 | *02:01 | *03:01:01 | *01:05:01 | *03:03:01 |
DRB3,4,5 (DR) | 4*01:03:01:01 | 4*01:03:01:01 | 4*01:03:01:01 | |||
DPA1 (DP) | *01:03:01:04 | *01:03:01:02 | *01:03:01:04 | *01:03:01:05 | *01:03:01:02 | *01:03:01:04 |
DPB1 (DP) | *04:01:01:01 | *X | *04:01:01:01 | *04:02:01:02 | *04:01:01:01 | *X |
We will continue to explore ways graph-based analyses could be used to benchmark methods used to characterize the MHC. It will be important to identify if these haplotypes can be represented in standard VCF files with respect to the primary GRCh37/38 references in GIAB benchmark sets, or whether existing benchmarks will need new representations and benchmarking tools. Although vg can project haplotypes into a VCF file with respect to the primary reference, it remains to be determined whether this is compatible with current benchmarking tools for small variants and structural variants. Other future work will entail examining whether fully phased diploid assembly is possible in other more complex, yet medically important regions, such as those of the killer-cell immunoglobulin receptor and spinal muscular atrophy.
Conclusion
Ongoing improvements in sequencing technology and diminishing costs make the generation of high-quality genome assemblies from diverse populations possible in a way today that could only have been imagined during the Human Genome Project (HGP). These new data are likely to form the basis for a new pangenome representation for the reference assembly that includes a graph, but they also raise many as-yet unanswered questions. We must consider the sample content, data/file formats that will be used, graph construction algorithms, how relevant metadata about quality and content will be communicated to users, and whether and how changes will be managed and tracked. New tools and validation sets must be built and community education will be essential, as will long-term curation, as is currently performed by the Genome Reference Consortium for the HGP reference. Ensuring the reference assembly remains a FAIR resource ( Wilkinson et al., 2016), accessible to users world-wide is also critical, and for the first time, some ethical and privacy concerns around the reference may need to be addressed. The new software developed here provide a preview of the use cases and potential for a new pangenome reference and play an important role in developing answers to these many questions.
Gratifyingly, since this first pangenomics hackathon took place a great deal of work in the domain has been started. For example, the Human Pangenome Reference Consortium (HPRC; https://humanpangenome.org/) has been initiated by the National Human Genome Research Institute. The HPRC aims to create an updated human reference genome structure—a pangenome—good enough to replace the existing human reference, GRCh38, as a basis that will alleviate bias and so much more equally represent all of humanity. Through audacious efforts like this and other global initiatives, much work is taking place to: (i) create high-quality, reference quality genomes of a diversity of humans, (ii) organize these individuals genomes within a pangenome, (iii) develop the essential tooling that can utilize this information, and (iv) deliver compelling applications. The pipelines and tooling described in this paper represents starting points for much of this future work, and were started at the hackathon meeting.
Data availability
Underlying data
All data underlying the results are available as part of the article and no additional source data are required.
Extended data
Open Science Framework: The Human Pangenome. https://doi.org/10.17605/OSF.IO/24K9N ( Busby & Biederstedt, 2019).
Folder ‘images’, contained within folder ‘Giraffe’ contains odgi.png (Supplemental Figure 1). This file is an odgi visualization of the Zea mays chr10 minimap2/seqwish graph for two species. The pink and purple bars at the top represent regions of the linearized graph that are visited by each species’ chromosome path. The black lines forming an impenetrable morass below the bars represent adjacencies between graph nodes. This graph has pathologically high connectivity.
This file is available under the MIT license.
Software availability
For graph building and observing the GRCh38 path through a primate graph, source code and directions can be found here: https://github.com/NCBI-Hackathons/TheHumanPangenome/tree/master/DS
For ultra-fast read mapping to graph structures, source code and directions can be found here: https://github.com/NCBI-Hackathons/TheHumanPangenome/tree/master/Giraffe
Code for converting from gff3 annotations to graph annotations can be found here: https://github.com/NCBI-Hackathons/TheHumanPangenome/tree/master/annotation
WDL pipeline for mapping of RNA-seq data to spliced variant graphs can be found here:
https://github.com/NCBI-Hackathons/TheHumanPangenome/tree/master/RNA
Code for assessing structural variants with graphs can be found here:
https://github.com/NCBI-Hackathons/TheHumanPangenome/tree/master/SV
Code used to graph the MHC region can be found here:
https://github.com/NCBI-Hackathons/TheHumanPangenome/tree/master/MHC
Archived source code is available at:https://doi.org/10.17605/OSF.IO/24K9N ( Busby & Biederstedt, 2019).
License:MIT License.
Acknowledgements
Certain commercial equipment, instruments, or materials are identified to specify adequately experimental conditions or reported results. Such identification does not imply recommendation or endorsement by the National Institute of Standards and Technology, nor does it imply that the equipment, instruments, or materials identified are necessarily the best available for the purpose. We would like to thank the administrative staff of the UCSC Genome Institute, Brad Plecs, Carl Leubsdorf and the NIH STRIDES initiative.
Funding Statement
The work of V.A.S., B.B. and C.L. was supported by the Intramural Research Program of the National Library of Medicine, National Institutes of Health. B.L. and Y.S. were supported by the Australian Research Council. F.J.S. was supported by US National Institutes of Health (UM1 HG008898). The work of A.M.N., G.H., J.M.E., X.C., J.S., J.M., and B.P. was supported by the National Institutes of Health (5U41HG007234), the W.M. Keck Foundation (DT06172015) and the Simons Foundation (SFLIFE# 35190). The work of J.A.S. was supported by the Carlsberg Foundation.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
[version 2; peer review: 2 approved]
References
- 1000 Genomes Project Consortium, Abecasis GR, Altshuler D, et al. : A map of human genome variation from population-scale sequencing. Nature. 2010;467(7319):1061–73. 10.1038/nature09534 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 1000 Genomes Project Consortium, Auton A, Brooks LD, et al. : A global reference for human genetic variation. Nature. 2015;526(7571):68–74. 10.1038/nature15393 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ameur A, Che H, Martin M, et al. : De Novo Assembly of Two Swedish Genomes Reveals Missing Segments from the Human GRCh38 Reference and Improves Variant Calling of Population-Scale Sequencing Data. Genes (Basel). 2018;9(10):486. 10.3390/genes9100486 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Audano PA, Sulovari A, Graves-Lindsay TA, et al. : Characterizing the Major Structural Variant Alleles of the Human Genome. Cell. 2019;176(3):663–75.e19. 10.1016/j.cell.2018.12.019 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brandt DY, Aguiar VR, Bitarello BD, et al. : Mapping bias overestimates reference allele frequencies at the HLA genes in the 1000 Genomes Project Phase I Data. G3 (Bethesda). 2015;5(5):931–941. 10.1534/g3.114.015784 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Browning SR, Browning BL: Haplotype phasing: existing methods and new developments. Nat Rev Genet. 2011;12(10):703–14. 10.1038/nrg3054 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Busby B, Biederstedt E: The Human Pangenome.2019. 10.17605/OSF.IO/24K9N [DOI] [Google Scholar]
- Bycroft C, Freeman C, Petkova D, et al. : The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562(7726):203–9. 10.1038/s41586-018-0579-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- Castel SE, Levy-Moonshine A, Mohammadi P, et al. : Tools and best practices for data processing in allelic expression analysis. Genome Biol. 2015;16(1):195. 10.1186/s13059-015-0762-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chaisson MJ, Huddleston J, Dennis MY, et al. : Resolving the complexity of the human genome using single-molecule sequencing. Nature. 2015;517(7536):608–11. 10.1038/nature13907 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen S, Krusche P, Dolzhenko E, et al. : Paragraph: A graph-based structural variant genotyper for short-read sequence data. Genome Biol. 2019;20(1):291. 10.1186/s13059-019-1909-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chiang C, Scott AJ, Davis JR, et al. : The impact of structural variation on human gene expression. Nat Genet. 2017;49(5):692–99. 10.1038/ng.3834 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chin CS, Wagner J, Zeng Q, et al. : A diploid assembly-based benchmark for variants in the major histocompatibility complex. Nat Commun. 2020;11(1):4794. 10.1038/s41467-020-18564-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Choi Y, Chan AP, Kirkness E, et al. : Comparison of phasing strategies for whole human genomes. PLoS Genet. 2018;14(4):e1007308. 10.1371/journal.pgen.1007308 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cirulli ET, Goldstein DB: Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nat Rev Genet. 2010;11(6):415–25. 10.1038/nrg2779 [DOI] [PubMed] [Google Scholar]
- Clark RM, Schweikert G, Toomajian C, et al. : Common sequence polymorphisms shaping genetic diversity in Arabidopsis thaliana. Science. 2007;317(5836):338–42. 10.1126/science.1138632 [DOI] [PubMed] [Google Scholar]
- Computational Pan-Genomics Consortium: Computational pan-genomics: status, promises and challenges. Brief Bioinform. 2018;19(1):118–35. 10.1093/bib/bbw089 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Crawford DC, Nickerson DA: Definition and clinical importance of haplotypes. Annu Rev Med. 2005;56:303–20. 10.1146/annurev.med.56.082103.104540 [DOI] [PubMed] [Google Scholar]
- Das S, Abecasis GR, Browning BL: Genotype Imputation from Large Reference Panels. Annu Rev Genomics Hum Genet. 2018;19:73–96. 10.1146/annurev-genom-083117-021602 [DOI] [PubMed] [Google Scholar]
- Degner JF, Marioni JC, Pai AA, et al. : Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics. 2009;25(24):3207–12. 10.1093/bioinformatics/btp579 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dilthey AT: State-of-the-art genome inference in the human MHC. Int J Biochem Cell Biol. 2021;131:105882. 10.1016/j.biocel.2020.105882 [DOI] [PubMed] [Google Scholar]
- Dilthey A, Cox C, Iqbal Z, et al. : Improved genome inference in the MHC using a population reference graph. Nat Genet. 2015;47(6):682–88. 10.1038/ng.3257 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dolzhenko E, Deshpande V, Schlesinger F, et al. : ExpansionHunter: A sequence-graph based tool to analyze variation in short tandem repeat regions. Bioinformatics. 2019;35(22):4754–XXXX. 10.1093/bioinformatics/btz431 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eizenga JM, Novak M, Sibbesen JA, et al. : Pangenome Graphs. Annu Rev Genomics Hum Genet. 2020;21:139–162. 10.1146/annurev-genom-120219-080406 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan HC, Gu W, Wang J, et al. : Non-invasive prenatal measurement of the fetal genome. Nature. 2012;487(7407):320–24. 10.1038/nature11251 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Frankish A, Diekhans M, Ferreira AM, et al. : GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 2019;47(D1):D766–73. 10.1093/nar/gky955 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Garg S, Martin M, Marschall T: Read-based phasing of related individuals. Bioinformatics. 2016;32(12):i234–42. 10.1093/bioinformatics/btw276 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Garrison E, Sirén J, Novak AM, et al. : Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat Biotechnol. 2018;36(9):875–79. 10.1038/nbt.4227 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gordon SP, Contreras-Moreira B, Woods DP, et al. : Extensive gene content variation in the Brachypodium distachyon pan-genome correlates with population structure. Nat Commun. 2017;8(1): 2184. 10.1038/s41467-017-02292-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hayes BJ, Lewin HA, Goddard ME: The future of livestock breeding: genomic selection for efficiency, reduced emissions intensity, and adaptation. Trends Genet. 2013;29(4):206–14. 10.1016/j.tig.2012.11.009 [DOI] [PubMed] [Google Scholar]
- Hickey G, Heller D, Monlong J, et al. : Genotyping Structural Variants in Pangenome Graphs Using the vg Toolkit. Genome Biol 2020;21(1):35. 10.1186/s13059-020-1941-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hoehe MR: Haplotypes and the systematic analysis of genetic variation in genes and genomes. Pharmacogenomics. 2003;4(5):547–70. 10.2217/14622416.4.5.547 [DOI] [PubMed] [Google Scholar]
- International HapMap Consortium: A haplotype map of the human genome. Nature. 2005;437(7063):1299–1320. 10.1038/nature04226 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jain M, Koren S, Miga KH, et al. : Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol. 2018;36(4):338–45. 10.1038/nbt.4060 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kidd JM, Sampas N, Antonacci F, et al. : Characterization of missing human genome sequences and copy-number polymorphic insertions. Nat Methods. 2010;7(5):365–71. 10.1038/nmeth.1451 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim D, Paggi JM, Salzberg SL: HISAT-Genotype: Next Generation Genomic Analysis Platform on a Personal Computer. bioRxiv. 2018. 10.1101/266197 [DOI] [Google Scholar]
- Kitzman JO, Mackenzie AP, Adey A, et al. : Haplotype-resolved genome sequencing of a Gujarati Indian individual. Nat Biotechnol. 2011;29(1):59–63. 10.1038/nbt.1740 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kitzman JO, Snyder MW, Ventura M, et al. : Noninvasive whole-genome sequencing of a human fetus. Sci Transl Med. 2012;4(137): 137ra76. 10.1126/scitranslmed.3004323 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Koren S, Rhie A, Walenz BP, et al. : De novo assembly of haplotype-resolved genomes with trio binning. Nat Biotechnol. 2018. 10.1038/nbt.4277 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lander ES, Linton LM, Birren B, et al. : Initial sequencing and analysis of the human genome. Nature. 2001;409(6822):860–921. 10.1038/35057062 [DOI] [PubMed] [Google Scholar]
- Lee W, Plant K, Humburg P, et al. : AltHapAlignR: improved accuracy of RNA-seq analyses through the use of alternative haplotypes. Bioinformatics. 2018;34(14):2401–8. 10.1093/bioinformatics/bty125 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Levy S, Sutton G, Ng PC, et al. : The diploid genome sequence of an individual human. PLoS Biol. 2007;5(10):e254. 10.1371/journal.pbio.0050254 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H: Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34(18):3094–3100. 10.1093/bioinformatics/bty191 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li B, Dewey CN: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011;12:323. 10.1186/1471-2105-12-323 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li R, Li Y, Zheng H, et al. : Building the sequence map of the human pan-genome. Nat Biotechnol. 2010;28(1):57–63. 10.1038/nbt.1596 [DOI] [PubMed] [Google Scholar]
- Logsdon GA, Vollger MR, Hsieh P, et al. : The structure, function and evolution of a complete human chromosome 8. Nature. 2021;593(7857):101–107. 10.1038/s41586-021-03420-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mallick S, Li H, Lipson M, et al. : The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature. 2016;538(7624):201–6. 10.1038/nature18964 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Martin AR, Karczewski KJ, Kerminen S, et al. : Haplotype Sharing Provides Insights into Fine-Scale Population History and Disease in Finland. Am J Hum Genet. 2018;102(5):760–75. 10.1016/j.ajhg.2018.03.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Miga KH, Koren S, Rhie A, et al. : Telomere-to-telomere assembly of a complete human X chromosome. Nature. 2020;585(7823):79–84. 10.1038/s41586-020-2547-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Miller JR, Zhou P, Mudge J, et al. : Hybrid assembly with long and short reads improves discovery of gene family expansions. BMC Genomics. 2017;18(1):541. 10.1186/s12864-017-3927-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Minkin I, Medvedev P: Scalable Multiple Whole-Genome Alignment and Locally Collinear Block Construction with SibeliaZ. bioRxiv. n.d. 10.1101/548123 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Montenegro JD, Golicz AA, Bayer PE, et al. : The pangenome of hexaploid bread wheat. Plant J. 2017;90(5):1007–13. 10.1111/tpj.13515 [DOI] [PubMed] [Google Scholar]
- Narzisi G, O’Rawe JA, Iossifov I, et al. : Accurate de novo and transmitted indel detection in exome-capture data using microassembly. Nat Methods. 2014;11(10):1033–36. 10.1038/nmeth.3069 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Paten B, Earl D, Nguyen N, et al. : Cactus: Algorithms for genome multiple sequence alignment. Genome Res. 2011;21(9):1512–28. 10.1101/gr.123356.111 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Paten B, Novak AM, Eizenga JM, et al. : Genome Graphs and the Evolution of Genome Inference. Genome Res. 2017;27(5):665–76. 10.1101/gr.214155.116 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Qian L, Hickey LT, Stahl A, et al. : Exploring and Harnessing Haplotype Diversity to Improve Yield Stability in Crops. Front Plant Sci. 2017;8:1534. 10.3389/fpls.2017.01534 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rand KD, Grytten I, Nederbragt AJ, et al. : Coordinates and intervals in graph-based reference genomes. BMC Bioinformatics. 2017;18(1):263. 10.1186/s12859-017-1678-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Reich D, Nalls MA, Kao WH, et al. : Reduced neutrophil count in people of African descent is due to a regulatory variant in the Duffy antigen receptor for chemokines gene. PLoS Genet. 2009;5(1):e1000360. 10.1371/journal.pgen.1000360 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rozowsky J, Abyzov A, Wang J, et al. : AlleleSeq: analysis of allele-specific expression and binding in a network framework. Mol Syst Biol. 2011;7:522. 10.1038/msb.2011.54 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schiffels S, Durbin R: Inferring human population size and separation history from multiple genome sequences. Nat Genet. 2014;46(8):919–25. 10.1038/ng.3015 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schnable PS, Ware D, Fulton RS, et al. : The B73 Maize Genome: Complexity, Diversity, and Dynamics. Science. 2009;326(5956):1112–1115. 10.1126/science.1178534 [DOI] [PubMed] [Google Scholar]
- Schneider VA, Graves-Lindsay T, Howe K, et al. : Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 2017;27(5):849–64. 10.1101/gr.213611.116 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Seo JS, Rhie A, Kim J, et al. : De novo assembly and phasing of a Korean human genome. Nature. 2016;538(7624):243–47. 10.1038/nature20098 [DOI] [PubMed] [Google Scholar]
- Sherman RM, Forman J, Antonescu V, et al. : Assembly of a pan-genome from deep sequencing of 910 humans of African descent. Nat Genet. 2019;51(1):30–35. 10.1038/s41588-018-0273-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shi L, Guo Y, Dong C, et al. : Long-read sequencing and de novo assembly of a Chinese genome. Nat Commun. 2016;7:12065. 10.1038/ncomms12065 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Slatkin M: Linkage disequilibrium--understanding the evolutionary past and mapping the medical future. Nat Rev Genet. 2008;9(6):477–85. 10.1038/nrg2361 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Snyder MW, Adey A, Kitzman JO, et al. : Haplotype-resolved genome sequencing: experimental methods and applications. Nat Rev Genet. 2015;16(6):344–58. 10.1038/nrg3903 [DOI] [PubMed] [Google Scholar]
- Taliun D, Harris DN, Kessler MD, et al. : Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. bioRxiv. 2019. 10.1101/563866 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tewhey R, Bansal V, Torkamani A, et al. : The importance of phase information for human genomics. Nat Rev Genet. 2011;12(3):215–23. 10.1038/nrg2950 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang N, Akey JM, Zhang K, et al. : Distribution of recombination crossovers and the origin of haplotype blocks: the interplay of population history, recombination, and mutation. Am J Hum Genet. 2002;71(5):1227–34. 10.1086/344398 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wheeler DA, Srinivasan M, Egholm M, et al. : The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008;452(7189):872–76. 10.1038/nature06884 [DOI] [PubMed] [Google Scholar]
- Wilkinson MD, Dumontier M, Aalbersberg JJJ, et al. : The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3:160018. 10.1038/sdata.2016.18 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu TD, Nacu S: Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics. 2010;26(7):873–81. 10.1093/bioinformatics/btq057 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhou P, Silverstein KAT, Ramaraj T, et al. : Exploring structural variation and gene family architecture with De Novo assemblies of 15 Medicago genomes. BMC Genomics. 2017;18(1):261. 10.1186/s12864-017-3654-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zook JM, Catoe D, McDaniel J, et al. : Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci Data. 2016;3:160025. 10.1038/sdata.2016.25 [DOI] [PMC free article] [PubMed] [Google Scholar]