A high-quality bonobo genome refines the analysis of hominid evolution

Yafei Mao; Claudia R Catacchio; LaDeana W Hillier; David Porubsky; Ruiyang Li; Arvis Sulovari; Jason D Fernandes; Francesco Montinaro; David S Gordon; Jessica M Storer; Marina Haukness; Ian T Fiddes; Shwetha Canchi Murali; Philip C Dishuck; PingHsun Hsieh; William T Harvey; Peter A Audano; Ludovica Mercuri; Ilaria Piccolo; Francesca Antonacci; Katherine M Munson; Alexandra P Lewis; Carl Baker; Jason G Underwood; Kendra Hoekzema; Tzu-Hsueh Huang; Melanie Sorensen; Jerilyn A Walker; Jinna Hoffman; Françoise Thibaud-Nissen; Sofie R Salama; Andy W C Pang; Joyce Lee; Alex R Hastie; Benedict Paten; Mark A Batzer; Mark Diekhans; Mario Ventura; Evan E Eichler

doi:10.1038/s41586-021-03519-x

. 2021 May 5;594(7861):77–81. doi: 10.1038/s41586-021-03519-x

A high-quality bonobo genome refines the analysis of hominid evolution

Yafei Mao ^1,^#, Claudia R Catacchio ^2,^#, LaDeana W Hillier ¹, David Porubsky ¹, Ruiyang Li ¹, Arvis Sulovari ¹, Jason D Fernandes ³, Francesco Montinaro ^2,⁴, David S Gordon ^1,⁵, Jessica M Storer ⁶, Marina Haukness ³, Ian T Fiddes ³, Shwetha Canchi Murali ^1,⁵, Philip C Dishuck ¹, PingHsun Hsieh ¹, William T Harvey ¹, Peter A Audano ¹, Ludovica Mercuri ², Ilaria Piccolo ², Francesca Antonacci ², Katherine M Munson ¹, Alexandra P Lewis ¹, Carl Baker ¹, Jason G Underwood ⁷, Kendra Hoekzema ¹, Tzu-Hsueh Huang ¹, Melanie Sorensen ¹, Jerilyn A Walker ⁸, Jinna Hoffman ⁹, Françoise Thibaud-Nissen ⁹, Sofie R Salama ^3,¹⁰, Andy W C Pang ¹¹, Joyce Lee ¹¹, Alex R Hastie ¹¹, Benedict Paten ³, Mark A Batzer ⁸, Mark Diekhans ³, Mario Ventura ^2,^✉, Evan E Eichler ^1,^5,^✉

PMCID: PMC8172381 NIHMSID: NIHMS1695658 PMID: 33953399

Abstract

The divergence of chimpanzee and bonobo provides one of the few examples of recent hominid speciation^1,2. Here we describe a fully annotated, high-quality bonobo genome assembly, which was constructed without guidance from reference genomes by applying a multiplatform genomics approach. We generate a bonobo genome assembly in which more than 98% of genes are completely annotated and 99% of the gaps are closed, including the resolution of about half of the segmental duplications and almost all of the full-length mobile elements. We compare the bonobo genome to those of other great apes^1,3–5 and identify more than 5,569 fixed structural variants that specifically distinguish the bonobo and chimpanzee lineages. We focus on genes that have been lost, changed in structure or expanded in the last few million years of bonobo evolution. We produce a high-resolution map of incomplete lineage sorting and estimate that around 5.1% of the human genome is genetically closer to chimpanzee or bonobo and that more than 36.5% of the genome shows incomplete lineage sorting if we consider a deeper phylogeny including gorilla and orangutan. We also show that 26% of the segments of incomplete lineage sorting between human and chimpanzee or human and bonobo are non-randomly distributed and that genes within these clustered segments show significant excess of amino acid replacement compared to the rest of the genome.

Subject terms: Genome informatics, Evolutionary genetics, Genome evolution, Sequencing

A high-quality bonobo genome assembly provides insights into incomplete lineage sorting in hominids and its relevance to gene evolution and the genetic relationship among living hominids.

Main

The bonobo or pygmy chimpanzee (Pan paniscus) and the common chimpanzee (Pan troglodytes) are among the most-recently diverged ape species (around 1.7 million years ago)^1,2. Both species represent the closest living species to humans and, therefore, offer the potential to pinpoint genetic changes that are also unique to human. The first bonobo sequence, which was generated using short-read whole-genome sequencing¹, resulted in a genome assembly (panpan1.1) with more than 108,000 gaps in which the vast majority of segmental duplications were not incorporated and few structural variants were identified (Supplementary Table 1). As a result of the lower accuracy of early next-generation sequencing technology and the fragmentary nature of the original chimpanzee genome, large fractions of the genomes of great apes could not be compared and gene models were often incomplete^3–8. In the past few years, long-read genome-sequencing technologies have considerably enhanced our ability to generate contiguous, high-quality genomes in which most genes and common repeat elements are fully annotated⁹. Here, we apply a multiplatform approach to produce a highly contiguous, accurate bonobo reference genome. Our analysis highlights the extent to and rapidity at which hominid genomes can differ and provides insights into incomplete lineage sorting (ILS) and its relevance to gene evolution and the genetic relationship among living hominids.

Sequence and assembly

We sequenced DNA from a female bonobo (Mhudiblu, P. paniscus) to 74-fold sequence coverage using the long-read PacBio RS II platform (Supplementary Tables 2, 3 and Supplementary Fig. 1). We generated a 3.0-gigabase assembly (contig N50 of 16.58 megabases (Mb)) (Supplementary Table 4) and constructed a chromosomal-level AGP (a golden path) assembly (Mhudiblu_PPA_v0) using Bionano Genomics optical maps and a clone-order framework using fluorescent in situ hybridization (FISH) of bacterial artificial chromosomes (BACs)¹⁰ (Fig. 1). The Mhudiblu_PPA_v0 assembly assigns 74 Mb of new sequence to chromosomes, closing 99.5% of the original 108,095 gaps (Supplementary Table 5). This assembly has been annotated by NCBI and is available in the UCSC Genome Browser (panPan3, Methods, Supplementary Data and Extended Data Fig. 1). We estimate the sequence accuracy of the bonobo assembly to be 99.97–99.99% (Supplementary Table 6 and Supplementary Data). The overall nucleotide divergence between chimpanzee and bonobo based on these new long-read assemblies is 0.421 ± 0.086% for autosomes and 0.311 ± 0.060% for the X chromosome (Supplementary Table 7). Using these new assemblies, we genotyped 27 previously sequenced great ape genomes, which resulted in slight adjustments in median effective population sizes for the great apes (Extended Data Fig. 2).

Fig. 1 — a, Schematic of the Mhudiblu_PPA_v0 assembly depicting the centromere location (red rhombus), FISH probes used to create assembly backbone (black dots), fixed bonobo-specific insertions (blue) and deletions (red) (Supplementary Data), remaining gaps (black horizontal lines) and large-scale inversions (arrows). We distinguish bonobo-specific inversions (dark orange, PPA) from *Pan*-specific inversions (dark green, PTR-PPA). b, FISH validation of the bonobo chromosome 2a and 2b fusion and the 2b pericentric inversion (probes: RP11-519H15 in red, RP11-67L14 in green, RP11-1146A22 in blue, RP11-350P7 in yellow) (top left); the chromosome 9 pericentric inversion (probes: RP11-1006E22 in red, RP11-419G16 in green, RP11-876N18 in blue, RP11-791A8 in yellow) (top right); and the inversion Strand-seq_chr7_inv4a (probes: RP11-118D11 in green, WI2-3210F8 in red, RP11-351B3 in blue) (bottom).

Extended Data Fig. 1 — Processing steps to create the reference sequences Mhudiblu_PPA_v0, Mhudiblu_PPA_v1 and Mhudiblu_PPA_v2.

Extended Data Fig. 2 — a–c, Pairwise sequentially Markovian coalescent (PSMC) plots based on an analysis of Illumina WGS genomes of 10 bonobos (a; red), 10 chimpanzees (b; green) and 7 gorillas (c; blue). The y axis represents the effective population size (N_e) (×10⁴) inferred by the PSMC and the x axis represents the time in years. N_e values and time are scaled with generation time g = 25 years and a mutation rate of μ = 1.2 × 10⁻⁸ per bp per generation¹⁶. d, Values in boxes refer to median and 95% confidence interval N_e (×10⁴) values inferred through PSMC analysis considering bonobo (red boxes) and chimpanzee (purple). We extracted size estimates from time intervals between 4 and 7 million years ago for the *Homo*, *Pan N*_e and been 1 and 2.5 million years ago for the *P. paniscus*, *P. troglodytes N*_e, considering μ = 0.5 × 10⁻⁹ mutations (bp × year) and a generation time of 25 years. Values using μ = 1 × 10⁻⁹ mutations (bp × year) are reported in Supplementary Data.

Gene annotation

We predict 22,366 full-length protein-coding genes and 9,066 noncoding genes using the NCBI Eukaryotic Genome Annotation Pipeline. We also generated 867,690 full-length bonobo cDNAs (Supplementary Table 8) and applied the Comparative Annotation Toolkit¹¹ to identify 20,478 protein-coding and 36,880 noncoding bonobo gene models; 99.5% of the protein-encoding models show no frameshift errors¹² and 38.4% of the protein-coding isoforms are now more complete. We identify 119 genes that have potential frameshifting insertions or deletions that disrupt the primary isoform relative to the human reference (GRCh38) (Supplementary Table 9). Respectively, 206 and 1,576 protein-coding genes are part of gene families that contracted or expanded in the bonobo genome compared to the human genome (Supplementary Tables 10, 11). We identify 65 putatively previously undescribed exons with support from full-length cDNA (Supplementary Tables 12–14), such as the protein-coding exon in ANAPC2, which is found in the bonobo but not in the chimpanzee sequence (Supplementary Fig. 2). Using other great ape genomes^13,14 and a genome-wide analysis from 20 bonobo and chimpanzee samples, we identified genes that showed an excess of amino acid replacement, balancing selection and potential selective sweeps (Tajima’s D and SweepFinder2)¹⁵. Most of the genes that showed selective sweeps in bonobo (DIRC1, GULP1 and ERC2) (Supplementary Tables 15–18) or chimpanzee (KIAA040, TM4SF4 and FOXP2) (Supplementary Tables 19–22) genomes are novel.

Mobile element insertions

The number of full-length (retrotransposition-competent), lineage-specific long interspersed nuclear element-1 (L1) in the bonobo genome (413 chimpanzee-specific L1 elements (L1Pt)) is similar to that in the chimpanzee genome (383 L1Pt) and 15–25% greater than the number of elements in the human genome (330 human-specific L1 elements (L1Hs)) (Supplementary Figs. 3–5). An analysis of Alu short interspersed nuclear element (SINE) repeats leads to a refined subfamily classification and we find that the number of bonobo-specific elements (n = 1,492) is nearly identical to that in the chimpanzee genome (n = 1,431). Pan lineages, therefore, show among the lowest rates of Alu insertions compared to the human genome (in which the rate has doubled) and the rhesus macaque genome (which shows a tenfold increased rate) (Extended Data Fig. 3). Although the bonobo genome shows a reduced genetic diversity of single-nucleotide variants^7,16 compared to the chimpanzee genome, we find that bonobo SINE–variable number tandem repeat (VNTR)–Alu (SVA) elements are more copy number polymorphic (45%) (Extended Data Fig. 3) compared to the chimpanzee genome (35%; P < 6.5 × 10⁻⁴). By contrast, the chimpanzee-specific endogenous retrovirus (PtERV1) shows an indistinguishable low rate of polymorphism for PtERV1 in both species (7% for bonobo and 9% for chimpanzee), which suggests relatively little activity since the divergence of Pan (Supplementary Data).

Segmental duplications

We identified 87.4 Mb of segmental duplications (≥1 kilobase (kb) and ≥90% identity) (Extended Data Fig. 3, Supplementary Figs. 6, 7 and Supplementary Table 23), most of which was previously unassembled. Segmental duplications are interspersed with an excess of large (≥10 kb) intrachromosomal duplications, which is consistent with the burst of segmental duplications that occurred at the root of the hominid lineage¹⁷. Despite the approximately sixfold improvement, the largest and most identical duplications were still not initially resolved (around 84 Mb). Using the Segmental Duplication Assembler algorithm^18,19, we successfully resolved an additional 56 Mb (Supplementary Table 24) and used these data to identify recent gene family expansions (Extended Data Fig. 4 and Supplementary Tables 25–31). We show, for example, that the eukaryotic translation initiation factor 4 subunit A3 (EIF4A3) gene family has expanded in both chimpanzee and bonobo genomes. There is evidence that five out of the six paralogues are expressed and encode a full-length open-reading frame (Fig. 2 and Extended Data Fig. 5). We estimate that the initial EIF4A3 gene duplication occurred in the ancestral lineage approximately 2.9 million years ago. It then subsequently expanded and experienced gene conversion events independently in the chimpanzee and bonobo lineages, creating five and six copies of the EIF4A3 gene family, respectively. Notably, some of the gene conversion signals correspond to a set of specific amino acid changes in the basic ancestral structure that are now common to only chimpanzee and bonobo (Fig. 2 and Extended Data Fig. 5).

Extended Data Fig. 4 — a, *Pan*-specific duplication of the *CLN3* locus and bonobo-specific deletion of *IGFL1*. HiFi read depth and whole-genome shotgun detection of bonobo, chimpanzee, orangutan, gorilla and human individuals relative to GRCh38 detect these events (top), which are validated by interphase FISH of each species using fosmid clones spanning the region (bottom). b, *Pan*-specific duplication of the *EIF3C* locus and bonobo-specific deletion of *SAMD9*. HiFi read depth and whole-genome shotgun detection of bonobo, chimpanzee, orangutan, gorilla and human individuals relative to GRCh38 detect these events (top), which are validated by interphase FISH of each species using fosmid clones spanning the region (bottom). Genomes were included from the following individuals (from top to bottom): bonobo (Pan_paniscus_A915_Kosana, A927_Salonga, A922_Catherine, A917_Dzeeta, A918_Hermien, A924_Chipita, A926_Natalie, A928_Kumbuka, A914_Hortense, A919_Desmond, A925_Bono); chimpanzee (Pan_troglodytes_troglodytes_A958_Doris, A957_Vaillant, A960_Clara, Pan_troglodytes_verus_Clint); orangutan (Pongo_abelii_A950_Babu, Pongo_pygmaeus_A944_Napoleon); gorilla (Gorilla_gorilla_gorilla_KB4986_Katie); human (AFR_Aari_ETAR005_F, AMR_Nahua_Mex20_M, EA_Mongola_HGDP01228_M, SA_Kalash_HGDP00328_M, WEA_FinlandFIN_HG00360_M).

Fig. 2 — a, Multiple sequence alignment shows *EIF4A3* amino acid differences between the human, Mhudiblu_PPA and chimpanzee assembled paralogues, and sequences of other great apes. A polymorphic 18-bp motif VNTR is located at the 5′ UTR of nonhuman primate *EIF4A3* and accounts for most of the differences between various isoforms. A phylogenetic tree is built from neutral sequences of *EIF4A3* paralogues using Bayesian phylogenetic inference. This analysis is conducted using BEAST2 software. Numbers on each major node denote estimated divergence time. Ma, million years ago. The blue error bar on each node indicates the 95% confidence interval of the age estimation. Bayesian posterior probabilities are reported using asterisks for nodes with posterior probability >99%. b, FISH on metaphase chromosomes and interphase nuclei with human probe WI2-3271P14 confirms an *EIF4A3* subtelomeric expansion of chromosome 17 in bonobo and chimpanzee relative to human, gorilla and orangutan.

Extended Data Fig. 5 — a, A comparison of *EIF4A3* copy number among great apes based on a sequence-read-depth analysis confirms a variable copy number expansion in the bonobo and chimpanzee lineages (9–33 diploid copies). This recent duplication was not fully resolved initially in the bonobo reference genome (Mhudiblu_PPA_v0) because high-identity duplicated sequences were collapsed. b, Bonobo Iso-Seq full-length transcript reads map with higher identity to four of the paralogues compared to Mhudiblu_PPA_v0. c, Contigs that encompass *EIF4A3* expansions and 100 kb of the flanking regions were assembled using bonobo and chimpanzee PacBio HiFi data. The 12-kb genomic sequence of human *EIF4A3* mapped onto the assembled contigs. Six tandem copies of *EIF4A3* spanning 310 kb in bonobo and five tandem copies spanning 262 kb in chimpanzee are recovered. Schematics show structural differences in *EIF4A3* in primate genomes. Grey, black and striped arrows show different alignment blocks across the samples. A solid line connecting alignment blocks indicates an insertion event. d, Paralogues are expressed and show evidence of gene conversion in both bonobo and chimpanzee lineages. Analysis of bonobo Iso-Seq data confirms that five of the six *EIF4A3* copies are expressed and maintain an open-reading frame (heat map indicates the number of Iso-Seq transcripts supporting each copy; minimap2 -ax splice -G 3000 -f 1000 --sam-hit-only --secondary=no --eqx -K 100M -t 20 --cs -2 | samtools view -F 260). GENECONV software shows significant signals (P ≤ 0.05 after multiple-test correction) of gene conversion for 16 out of 67 kb of the paralogous locus (grey bars) using multiple sequence alignment was performed using MAFFT version 7.453 (command: mafft -adjustdirection [input.fasta] > [output.msa_fasta]; GENECONV version 1.81a)). A subset of gene conversion events overlap with sites of amino acids that are specific to the *Pan* lineage. Triangles indicate the sites of amino acid change in each of the primate genomes compared to GRCh38. Different colours mark different changes: purple marks phenylalanine to leucine; yellow marks arginine to cysteine; red marks serine to arginine; teal marks tyrosine to serine. Same phylogenetic tree from Fig. 2 is reshaped to show the inferred evolutionary relationships among the paralogues. Nodes with >99% Bayesian posterior probabilities are indicated by asterisks; otherwise the actual number is shown. e, A phylogenetic tree was constructed from 16-kb noncoding *EIF3C* paralogues using Bayesian phylogenetic inference. This analysis was conducted using BEAST2 software. Numbers in bold on each major node denote estimated divergence time. The other numbers (not bold) indicate posterior probabilities. The blue error bar on each node indicates the 95% confidence interval of the age estimation. Bootstrap supports are reported using asterisks for nodes with posterior probability >99%. f, Gene models for transcribed loci based on Iso-Seq data (top). Human *EIF3C* and *EIF3CL* are compared to predicted open-reading frames for bonobo paralogues and Liftoff gene predictions for chimpanzee, orangutan and gorilla paralogues from contigs assembled from HiFi reads (bottom).

Structural variation and gene disruption

As part of the assembly curation, we validated nine larger inversions that distinguish human and bonobo karyotypes, created a FISH-based chromosomal backbone (Fig. 1) and used single-cell DNA template strand sequencing (Strand-seq) to assign orphan contigs to chromosomes (36 Mb) (Mhudiblu_PPA_v1) (Supplementary Tables 32–38). We identify 17 fixed inversions that differentiate bonobo from chimpanzee, of which 11 are bonobo-specific (Supplementary Table 39) and 22 regions that probably represent bonobo inversion polymorphisms (Supplementary Table 40). Moreover, we assign 38 fixed inversions that occurred in the common Pan ancestor (Supplementary Table 39). We annotated and validated the breakpoint intervals of each tested inversion (Supplementary Table 41) and found segmental duplications or long interspersed nuclear elements at the breakpoints of inversions in 82% and 86% of cases, respectively (Supplementary Table 40). We also compared the bonobo genome to the human, chimpanzee and gorilla genomes to identify deletions and insertions (>50 base pairs (bp)). We classify 15,786 insertions and 7,082 deletions as bonobo-specific and genotyped these in a population of great ape samples^7,16,20 to identify 3,604 fixed insertions and 1,965 fixed deletions, of which only a small fraction (2.66% or 148 out of 5,569) intersect with genic functional elements (Supplementary Tables 42–45).

Bonobo-specific events that delete ENCODE regulatory elements²¹ (n = 381), for example, are enriched in membrane-associated genes with extracellular domains whereas chimpanzee-specific events (n = 187) are associated with cadherin-related genes (Supplementary Table 46). Deletions (n = 1,040) shared between the chimpanzee and bonobo genomes show an enrichment of the loss of putative regulatory elements associated with post-synaptic genes (3.32 enrichment; P = 1.2 × 10⁻⁷) and pleckstrin homology-like domains (6.15 enrichment; P = 1.20 × 10⁻⁹). We validate 110 events that disrupt protein-coding genes by generating high-fidelity genomic sequencing for each of the great ape reference genomes and restricting to those events that could be genotyped in a population of genomes (Supplementary Data). As expected, many fixed gene-loss events occurred in genes that are tolerant to mutation, redundant duplicated genes or genes in which the event simply altered the structure of the protein. For example, we validate a 25.7-kb gene loss of one of the keratin-associated genes (KRTAP19-6) associated with hair production in the ancestral lineage of chimpanzee and bonobo (Supplementary Fig. 8). In the bonobo lineage, we identify five fixed structural variants that affect protein-coding genes (Supplementary Table 47), but only two of which completely ablate the gene. For example, LYPD8, which encodes a secreted protein that prevents invasion of the colonic epithelium by Gram-negative bacteria, has been completely deleted by a 24.3-kb bonobo-specific deletion. Similarly, SAMD9 (SAMD family member 9) is a fixed gene loss in bonobo as a result of a 41.46-kb bonobo-specific deletion. The other three bonobo-specific fixed structural variant events in protein-coding regions all maintain the open-reading frame, including a 49-amino acid deletion of ADAR1, which encodes a protein that is critical for RNA editing and is implicated in human disease^22–24 (Extended Data Fig. 6).

Extended Data Fig. 6 — a, Size distribution of fixed (left) and polymorphic (right) structural variant (SV) insertions and deletions in the bonobo genome for structural variants of 50–1,000 bp (top) or >1,000 bp (bottom) in length. Events are deemed to be specific to the bonobo lineage based on copy number genotyping against a panel of 27 ape genomes and a threshold of F_ST > 0.8 to define fixed events in bonobo. Modes are observed corresponding to full-length L1 (6 kb) and Alu (300 bp) mobile elements and are predominantly insertions reflecting the homoplasy-free nature of this class of mutation. b, A small fixed deletion predicts a 49 amino acid deletion in *ADAR1* in the bonobo lineage. RefSeq *ADAR1* structure is shown (top) compared with the Iso-Seq coverage of gorilla, human, chimpanzee and bonobo (middle). The protein alignment (bottom) shows that an in-frame deletion is created. c, A 24.3-kb fixed deletion results in the complete loss of *LYPD8* in bonobo. Gene structure, duplication and repeat annotations are shown with respect to gorilla, human, chimpanzee and bonobo genomes. A lineage-specific duplication adjacent to *LYPD8* is present in the gorilla genome (large grey triangles). d, A 41.5-kb fixed deletion mediated by directly orientated L1 repeats ablates *SAMD9* leaving only *SAMD9L* in the bonobo lineage. e, Short-read whole-genome shotgun detection genotyping shows that *LYPD8* was lost in the bonobo lineage. f, Short-read whole-genome shotgun detection genotyping shows *SAMD9* was lost in the bonobo lineage.

A comparison of ILS in hominids

The higher quality and more contiguous nature of the bonobo genome provide an opportunity to generate a higher-resolution ILS map. In comparison to the original bonobo assembly in which only around 800 Mb (27%) could be analysed, it is now possible to align approximately 76% of the genome in a four-way ape genome alignment (2,357 Mb within 10-kb windows) (Supplementary Table 48) owing to long-read genome assemblies¹⁴. We performed a genome-wide phylogenetic window-based analysis to systematically identify regions that are inconsistent with the species tree and classified these as human–bonobo and human–chimpanzee ILS topologies (Fig. 3). We predict that 5.07% of the human genome is genetically closer to chimpanzee or bonobo (Table 1); 2.52% of the human genome is more closely related to the bonobo genome (human–bonobo ILS segments) than the chimpanzee genome whereas 2.55% of the human genome is more closely related to the chimpanzee genome (human–chimpanzee ILS) than the bonobo genome (Fig. 3a). This proportion of ILS nearly doubles previous estimates (3.3%)¹ (Supplementary Table 1). Consistent with previous observations¹, the largest ILS segments are biased (around 1.8-fold) to intergenic regions, depleted for genes (>35%) and are particularly enriched in L1 content. Notably, the distribution of ILS segments is highly non-random based on simulation experiments. We specifically measured the distance between ILS segments (see below) and identified a subset (around 26%) of sites that are significantly more clustered than expected by chance (Fig. 3b).

Fig. 3 — a, A whole-genome ILS cladogram analysis (left) for bonobo–human (red) and chimpanzee–human (blue) and a schematic map (right) of clustered ILS segments (500-bp resolution) specifically for chromosomes 3, 4 and 7. The lighter density plot represents the clustered ILS events mapping to intragenic regions, whereas the vertical lines represent the subset that overlap with protein-coding exons. b, Distribution of distances between ILS segments (inter-ILS) (500-bp resolution) compared with a simulated (null) expectation (from 400,000 simulations) reveals a bimodal pattern with a subset (26%) that is clustered and significantly non-randomly distributed. A two-sample Wilcoxon rank-sum test was used to calculate the P value in R. c, ILS exons show a significant excess of amino acid replacement (dN/dS) for both human–bonobo (H–B; red line; P = 0.004778) and human–chimpanzee (H–C; blue line; P = 0.03924) ILS. In particular, exons mapping to the ILS clustered segments (b) show the most significant excess of amino acid replacements dN/dS (dotted purple line; P = 0.001015) compared to the genome-wide null distribution (grey density plot). This shift is not observed for the non-clustered ILS segments (NC ILS; dotted black line; P = 0.3161). Significance was analysed using the one-sample Student’s t-test in R. The silhouette of the chimpanzee in a is created by T. Michael Keesey and Tony Hisgett (http://phylopic.org/; image is under a Creative Commons Attribution 3.0 Unported licence); silhouettes of bonobo and gorilla are from http://phylopic.org/ under a Public Domain Dedication 1.0 licence.

Table 1.

Hominid genome-wide ILS estimates

Window size	Number of ILS segments		Percentage of ILS		Total ILS^a	Genomic properties
Window size	(G, ((B, H), C))	(G, ((H, C), B))	(G, ((B, H), C))	(G, ((H, C), B))	Total ILS^a	GC^a	Intergenic/intragenic	Alu^a	L1^a	Exon^a
20 kb	218	218	0.19	0.19	0.38	37.7	1.79	6.37	31.44	0.49
10 kb	1,143	1,138	0.49	0.48	0.97	38.39	1.73	7.35	27.08	0.47
5 kb	4,314	4,373	0.91	0.92	1.83	38.95	1.64	7.85	24.67	0.58
2 kb	18,218	18,334	1.52	1.53	3.05	39.58	1.49	8.71	21.51	0.72
1 kb	46,584	46,938	2.06	2.07	4.13	40.06	1.37	9.8	19.85	0.8
500 bp	102,197	103,338	2.52	2.55	5.07	40.54	1.33	11.24	18.66	0.75
Genome average						40.89	1.21	10.17	17.42	1.17

Open in a new tab

B, bonobo; C, chimpanzee; G, gorilla; H, human. (G, ((B, H), C)) and (G, ((H, C), B)) represent two different ILS topologies. Intergenic/intragenic indicates the intergenic to intragenic ratio.

^aContent is shown as a percentage; the GC, Alu, L1 and exon contents are based on the GRCh38 genome.

We focused specifically on protein-coding exons based on the human RefSeq annotation²⁵ and identified 1,446 exons that mapped to ILS topologies (713 exons to a human–bonobo topology and 733 exons to a human–chimpanzee topology) (Supplementary Table 49). As a whole, genes corresponding to these ILS exons are significantly enriched in both glycoprotein function (P = 1.30 × 10⁻¹⁴ for human–bonobo and P = 5.60 × 10⁻¹¹ for human–chimpanzee) and calcium-binding epidermal growth factor (EGF) domain function (P = 4.40 × 10⁻¹² for human–bonobo and P = 9.40 × 10⁻⁷ for human–chimpanzee) (Supplementary Table 50). We considered multiple occurrences in the same gene and identified 84 genes with at least two exons under ILS (Supplementary Table 51) with some enrichment in photoreceptor activity (P = 1.6 × 10⁻⁴) (Supplementary Table 51 and Supplementary Fig. 9) as well as EGF-like (P = 1.9 × 10⁻⁶) and transmembrane (P = 2.4 × 10⁻³) functions. Overall, we observe a significant excess of amino acid replacement (dN/dS) for all 1,446 ILS exons compared to non-ILS exons (P = 0.0048 for human–bonobo, P = 0.039 for human–chimpanzee) (Fig. 3c), which is consistent with either the action of relaxed selection or positive selection. Exons mapping to the clustered ILS segments show greater dN/dS with respect to exons in the non-clustered ILS segments, which suggests that these clustered ILS segments are contributing disproportionately to accelerated amino acid evolution in the hominid genome.

We extended the ILS analysis (Supplementary Data) across 15 million years of hominid evolution through the inclusion of genome data from orangutan and gorilla. As expected, ILS estimates for the human genome increase to more than 36.5% (Extended Data Fig. 7 and Supplementary Table 52) similar to (albeit still greater than) previous estimates^3,14. We measured the inter-ILS distance and observed a consistent non-random pattern of clustered ILS for these deeper topologies with more ancient ILS showing an even greater proportion of clustered sites (Extended Data Fig. 7). Once again, we observe a significantly increased mean dN/dS in clustered human–chimpanzee and human–bonobo topologies (P < 2.2 × 10⁻¹⁶, mean = 0.366) as well as clustered orangutan–human and orangutan–gorilla–human topologies (P < 2.2 × 10⁻¹⁶, mean = 0.316) compared to the null distribution (Supplementary Fig. 10). A Gene Ontology analysis²⁶ of the genes that intersect these combined data confirm not only the most significant signals for immunity (for example, glycoprotein (P = 1.3 × 10⁻²⁵) and immunoglobulin-like fold/FN3 (P = 2.4 × 10⁻²⁰)), but also genes related to EGF signalling (P = 1.6 × 10⁻¹³), solute transporter function (for example, transmembrane region (P = 1.3 × 10⁻²⁵)) and, specifically, calcium transport (P = 3.7 × 10⁻⁸) (Supplementary Table 53). Although ILS regions, in general, show diversity patterns of single-nucleotide polymorphisms that are consistent with balancing selection, it is noteworthy that both clustered and non-clustered ILS exons show a significant excess of polymorphic gene-disruptive events that are consistent with the action of relaxed as well as balancing selection (Supplementary Fig. 11). An examination of these gene-rich clustered ILS regions reveals a complex pattern of diverse ILS topologies that suggests deep coalescence operating across specific regions of the human genome as has previously been reported for the major histocompatibility complex^1,3 (Extended Data Fig. 8).

Extended Data Fig. 7 — The distance between adjacent ILS segments (inter-ILS) (500-bp resolution) was calculated and the distribution was compared to a simulated expectation based on a random distribution. The analysis reveals a bimodal (and possibly an emerging trimodal) pattern in which a distinct subset of ILS segments are clustered (that is, clustered ILS sites). Four different topologies were considered. a, A (orangutan, (((bonobo, chimpanzee), gorilla), human)) ILS topology in which 31.58% of inter-ILS is clustered is shown. b, A (orangutan, ((bonobo, chimpanzee), (gorilla, human))) ILS topology in which 33.5% is clustered is shown. c, A (orangutan, (((bonobo, human), chimpanzee), gorilla)) ILS topology in which 8.14% is clustered is shown. d, A (orangutan, ((bonobo, (chimpanzee, human)), gorilla)) ILS topology in which 9.89% of sites is clustered is shown. e, An example of a cluster of human–bonobo (red triangles) and human–chimpanzee (blue triangles) ILS corresponding to a group of genes. A four-species alignment of one exon from *EGF* (exon 5) is shown with a nominal signal of positive selection.

Extended Data Fig. 8 — a, The four main ILS topologies are colour-coded. The four colour lines representing ILS segments are shown above the chromosome coordinate (GRCh38). The clustered ILS segments are shown above the four colour lines (black). The MHC region (red bar) corresponds to genomic coordinates on chromosome 6: 28510120–33480577. b, A magnified view of the MHC region (chromosome 6: 32786501–33103000) depicting clustered ILS nearby *HLA* genes. c, Nucleotide diversity of bonobo (green) and chimpanzee (blue) is shown based on human genomic coordinates (GRCh38, chromosome 6: 25000000–29000000). The mean (dashed line) is shown for bonobo (mean = 4.45 × 10⁻⁴) and chimpanzee (mean = 9.35 × 10⁻⁴). A region of reduced diversity (grey) is shown that corresponds to a segmental duplication in which single-nucleotide polymorphisms were excluded due to potential mismapping. d, Same as c but merged onto the same scale and highlighting five regions (red arrows) in which diversity is reduced in bonobo compared to chimpanzee. Three of these correspond to previously identified regions¹; however, they are not among the top 1% of genome candidates showing positive selection by Tajima’s D and SweepFinder2¹⁵. The overall diversity of single-nucleotide polymorphisms is reduced across the region in bonobo compared to chimpanzee.

Discussion

High-quality hominid genomes are a critical resource for understanding the genetic differences that make us human as well as the diversification of the Pan lineage over the past two million years of evolution. The bonobo represents the last of the great ape genomes to be sequenced using long-read sequencing technology. Its sequence will facilitate more systematic genetic comparisons between human, chimpanzee, gorilla and orangutan without the limitations of technological differences in sequencing and assembly of the original reference^1,3–5,14. As a result, we now predict that a greater fraction (around 5.1%) of the human genome is genetically closer to chimpanzee or bonobo compared to previous studies (3.3%)¹. We estimate that more than 36.5% of the hominid genome shows ILS if we consider a deeper phylogeny that includes gorilla and orangutan. Notably, 26% of the ILS regions are clustered and exons that underlie these clustered ILS signals show elevated rates of amino acid replacement. These findings support a previous study in gorilla that showed a subtler correlation in which genes with higher dN/dS values are enriched in ILS segments³. In that study, however, the authors explained the observation as a result of stronger purifying selection in non-ILS sites or background selection that reduced the effective population size and, as a result, led a depletion of ILS. Our genome-wide exon analyses specifically show that only a subset of clustered ILS exons are driving this effect and that these genes are enriched in glycoprotein and EGF-like calcium signalling functions owing to the action of either relaxed selection or positive selection of genes in these pathways (Supplementary Data).

Methods

We sequenced and assembled the genome of a single female bonobo (Mhudiblu, also known as Mhudibluy, who was obtained from the San Diego Zoo, ISIS 601152, born 15 April 2001 and who was later transferred to the Wuppertal Zoo in Germany where she was referred to as Muhdeblu) using long-read PacBio RS II sequencing chemistry and the Falcon genome assembler. The assembly was error-corrected using Quiver²⁷, Pilon²⁸ and an in-house FreeBayes-based²⁹ insertion or deletion correction pipeline optimized to improve continuous long-read assemblies¹⁴. We also generated Illumina whole-genome sequencing (WGS) data using the Illumina TruSeq PCR-Free library preparation kit. Genome assembly contigs were ordered and oriented into scaffolds using Bionano optical maps (Supplementary Table 54 and Supplementary Data) (HybridScaffolds suite, Bionano Genomics Saphyr platform) and four-colour FISH of 324 BAC clones. Cell lines from chimpanzee, bonobo, gorilla and orangutan were obtained from Coriell (S006007) or from a collection developed by M. Rocchi; no approval from ethics committees were required for use of these established lines. We assigned each contig and scaffold into unique groups corresponding to individual chromosomal homologues using SaaRclust^30,31 while applying Strand-seq to detect inversions, assign orphan contig and orient contigs^32,33. To estimate genome-wide sequence accuracy, we applied Merqury³⁴ using Illumina WGS data. We also generated a bonobo large-insert BAC library (VMRC74) and selected at random 17 clones for complete PacBio insert sequencing³⁵. The Comparative Annotation Toolkit (CAT)¹¹ was used for genome annotation using human GENCODE v.33 and RNA-sequencing data. We also generated more than 860,000 full-length non-chimeric transcripts from full-length isoform sequencing (Iso-Seq) data generated from induced pluripotent stem cell and derived neuronal progenitor cell lines³⁶ from bonobo sample AG05253 and we searched for gene structures split over multiple contigs (Supplementary Table 55). Repeat content of the assembled genome was analysed using RepeatMasker (RepeatMasker-Open-4.1.0) and the Dfam3 repeat library. We assigned lineage-specific Alu and full-length long interspersed nuclear element, SVA_D and PtERV elements to subfamilies by applying COSEG (http://www.repeatmasker.org/COSEGDownload.html) to determine the lineage-specific subfamily composition. For cross-species analysis of mobile element insertions (MEIs), we performed liftOver on the basis of the chains built from the Cactus whole-genome alignments generated during CAT annotation. For cross-assembly analyses of bonobo MEI insertions and a specific subset of other analyses (Supplementary Data), we used Bowtie 2 to map MEI flanking sequences between genomes. We estimated the duplication content in the bonobo assembly, applying the whole-genome analysis comparison method³⁷ and targeted collapsed duplications for assembly using Segmental Duplication Assembler¹⁹. Insertions and deletions were detected in bonobo, chimpanzee and gorilla using PBSV, Sniffles³⁸ and Smartie-sv¹⁴ and genotyped using Paragraph³⁹ against a panel of 27 Illumina WGS genomes. We searched for evidence of ILS among the chimpanzee, gorilla and human lineages applying Prank (v.140110) to construct multiple sequence alignments and using ete3 module to identify segments and exons under ILS (Supplementary Table 56). For consistency, NCBI reference genome nomenclature has been used throughout the manuscript and corresponds to the following UCSC IDs (NCBI/UCSC): panpan1.1/panPan2, Mhudiblu_PPA_v0/panPan3, Clint_PTRv2/panTro6, Kamilah_GGO_v0/gorGor6, Susie_PABv2/ponAbe3 and GRCh38/hg38 (details of the methods used are provided in the Supplementary Data).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.

Online content

Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at 10.1038/s41586-021-03519-x.

Supplementary information

Supplementary Information^{(3.8MB, pdf)}

The Supplementary Information file includes: Supplementary Figures 1-11, Supplementary Discussion, and Supplementary References.

Reporting Summary^{(74KB, pdf)}

Supplementary Data^{(9.4MB, pdf)}

Supplementary Tables^{(7MB, xlsx)}

The file contains Supplementary Tables 1-57.

Peer Review File^{(4.1MB, pdf)}

Acknowledgements

We thank L. Carbone and O. Ryder for providing the Mhudiblu bonobo cell line; R. Gage and C. Marchetto for the preparation of the RNA sequence and access to bonobo induced pluripotent stem cell lines; A. Rhie for assistance with Merqury analysis; and T. Brown for manuscript proofreading and editing. The silhouettes of the bonobo and gorilla in Fig. 3 and Extended Data Fig. 2 (and, additionally, the human silhouette in Extended Data Fig. 2) are downloaded from phylopic.org under a Public Domain Dedication 1.0 licence. The silhouette of the chimpanzee in Fig. 3 and Extended Data Fig. 2 is downloaded from phylopic.org under a Creative Commons Attribution 3.0 Unported licence (https://creativecommons.org/licenses/by/3.0/, credit to T. M. Keesey and T. Hisgett). This work was supported, in part, by National Institutes of Health (NIH) grants HG002385 and 1U24HG009081 to E.E.E.; Futuro in Ricerca 2010-RBFR103CE3 to M.V.; R01 GM59290 to M.A.B.; 2U41HG007234 to M.D. and B.P.; U01HG010961, U41HG010972, R01HG010485, U01HL137183 and 5U54HG007990 to B.P.; Arian Smit’s NHGRI grant RO1 HG002939 to J.M.S.; 5T32HG008345-04 to B.P.; R01HG010329-01 to S.R.S.; and European Regional Development Fund 2014-2020.4.01.16-0030 to F.M. E.E.E. is an investigator of the Howard Hughes Medical Institute. The work of J.H. and F.T.-N. was supported by the Intramural Research Program of the National Library of Medicine, National Institutes of Health. P.H. is supported by an NIH Pathway to Independence Award (NHGRI, K99HG011041).

Extended data figures and tables

Author contributions

K.M.M. and A.P.L. generated long-read sequencing data; C.B. created the BAC library (VMRC74); K.H., M.S., P.A.A., D.S.G., L.W.H., S.C.M., M.D., C.R.C., L.M., I.P., M.V., E.E.E. and D.P. completed the de novo assembly, its curation and quality assessment; P.C.D., R.L., Y.M., W.T.H., T.-H.H., D.S.G., M.V. and E.E.E. performed segmental duplication and gene family analyses; D.P. performed Strand-seq single-cell data analysis; C.R.C., I.P., L.M., F.A. and M.V. performed FISH analyses; A.W.C.P., J.L. and A.R.H. led the Bionano Genomics analyses; I.T.F., M.D., Y.M., M.H. and B.P. performed gene annotation and gene analyses; J.H. and F.T.-N. performed the RefSeq annotation; J.G.U. performed Iso-Seq; F.M., P.H. and Y.M. performed population genetic analyses; J.D.F., J.M.S., S.R.S., Y.M., J.A.W. and M.A.B. performed MEI analyses; Y.M. performed structural variant analyses; Y.M., A.S. and P.H. performed ILS analyses; M.V. and E.E.E. supervised the project and finalized the manuscript. All authors read and approved the manuscript.

Data availability

The Mhudiblu_PPA_v0 (GCA_013052645.1), Mhudiblu_PPA_v1 (GCA_013052645.2) and Mhudiblu_PPA_v2 (GCA_013052645.3) assemblies are deposited in the NCBI under BioProject accession number PRJNA526933. The raw PacBio continuous long-read, Strand-seq, Illumina and Iso-Seq data of bonobo are deposited in the NCBI under SRA accession number SRP188441. The Bionano map of bonobo Mhudiblu is deposited in the NCBI under BioProject accession number PRJNA526933. The raw PacBio HiFi data of bonobo Mhudiblu and gorilla Kamilah are deposited in the NCBI under SRA accession number SRP301932 under BioProject accession number PRJNA691628. The BACs used in this study are listed in Supplementary Table 57 in the NCBI with BioProject accession PRJNA634395.

Code availability

Custom scripts used in this study are available at GitHub (https://github.com/EichlerLab and https://github.com/MaoYafei).

Competing interests

J.G.U. is an employee of Pacific Biosciences. A.W.C.P., J.L. and A.R.H. are employees of Bionano Genomics. The other authors declare no competing interests.

Footnotes

Peer review information Nature thanks Marcela Uliano-Silva and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors contributed equally: Yafei Mao, Claudia R. Catacchio

Contributor Information

Mario Ventura, Email: mario.ventura@uniba.it.

Evan E. Eichler, Email: eee@gs.washington.edu

Extended data

is available for this paper at 10.1038/s41586-021-03519-x.

Supplementary information

The online version contains supplementary material available at 10.1038/s41586-021-03519-x.

References

1.Prüfer K, et al. The bonobo genome compared with the chimpanzee and human genomes. Nature. 2012;486:527–531. doi: 10.1038/nature11128. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Takemoto H, Kawamoto Y, Furuichi T. How did bonobos come to range south of the Congo River? Reconsideration of the divergence of Pan paniscus from other Pan populations. Evol. Anthropol. 2015;24:170–184. doi: 10.1002/evan.21456. [DOI] [PubMed] [Google Scholar]
3.Scally A, et al. Insights into hominid evolution from the gorilla genome sequence. Nature. 2012;483:169–175. doi: 10.1038/nature10842. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Locke DP, et al. Comparative and demographic analysis of orang-utan genomes. Nature. 2011;469:529–533. doi: 10.1038/nature09687. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.The Chimpanzee Sequencing and Analysis Consortium Initial sequence of the chimpanzee genome and comparison with the human genome. Nature. 2005;437:69–87. doi: 10.1038/nature04072. [DOI] [PubMed] [Google Scholar]
6.Luo C, Tsementzi D, Kyrpides N, Read T, Konstantinidis KT. Direct comparisons of Illumina vs. Roche 454 sequencing technologies on the same microbial community DNA sample. PLoS ONE. 2012;7:e30087. doi: 10.1371/journal.pone.0030087. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Prado-Martinez J, et al. Great ape genetic diversity and population history. Nature. 2013;499:471–475. doi: 10.1038/nature12228. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Sudmant PH, et al. Global diversity, population stratification, and selection of human copy-number variation. Science. 2015;349:aab3761. doi: 10.1126/science.aab3761. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Logsdon GA, Vollger MR, Eichler EE. Long-read human genome sequencing and its applications. Nat. Rev. Genet. 2020;21:597–614. doi: 10.1038/s41576-020-0236-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Ventura M, et al. Gorilla genome structural variation reveals evolutionary parallelisms with chimpanzee. Genome Res. 2011;21:1640–1649. doi: 10.1101/gr.124461.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Fiddes IT, et al. Comparative Annotation Toolkit (CAT)-simultaneous clade and personal genome annotation. Genome Res. 2018;28:1029–1038. doi: 10.1101/gr.233460.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Stanke M, Diekhans M, Baertsch R, Haussler D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics. 2008;24:637–644. doi: 10.1093/bioinformatics/btn013. [DOI] [PubMed] [Google Scholar]
13.Gordon D, et al. Long-read sequence assembly of the gorilla genome. Science. 2016;352:aae0344. doi: 10.1126/science.aae0344. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Kronenberg ZN, et al. High-resolution comparative analysis of great ape genomes. Science. 2018;360:eaar6343. doi: 10.1126/science.aar6343. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Pavlidis P, Alachiotis N. A survey of methods and tools to detect recent and strong positive selection. J. Biol. Res. (Thessalon.) 2017;24:7. doi: 10.1186/s40709-017-0064-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.de Manuel M, et al. Chimpanzee genomic diversity reveals ancient admixture with bonobos. Science. 2016;354:477–481. doi: 10.1126/science.aag2602. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Marques-Bonet T, et al. A burst of segmental duplications in the genome of the African great ape ancestor. Nature. 2009;457:877–881. doi: 10.1038/nature07744. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Vollger MR, et al. Improved assembly and variant detection of a haploid human genome using single-molecule, high-fidelity long reads. Ann. Hum. Genet. 2020;84:125–140. doi: 10.1111/ahg.12364. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Vollger MR, et al. Long-read sequence and assembly of segmental duplications. Nat. Methods. 2019;16:88–94. doi: 10.1038/s41592-018-0236-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Sudmant PH, et al. Evolution and diversity of copy number variation in the great ape lineage. Genome Res. 2013;23:1373–1382. doi: 10.1101/gr.158543.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.The ENCODE Project Consortium et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature. 2020;583:699–710. doi: 10.1038/s41586-020-2493-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Rice GI, et al. Mutations in ADAR1 cause Aicardi–Goutières syndrome associated with a type I interferon signature. Nat. Genet. 2012;44:1243–1248. doi: 10.1038/ng.2414. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Savva YA, Rieder LE, Reenan RA. The ADAR protein family. Genome Biol. 2012;13:252. doi: 10.1186/gb-2012-13-12-252. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Gallo A, Vukic D, Michalík D, O’Connell MA, Keegan LP. ADAR RNA editing in human disease; more to it than meets the I. Hum. Genet. 2017;136:1265–1278. doi: 10.1007/s00439-017-1837-0. [DOI] [PubMed] [Google Scholar]
25.O’Leary NA, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44(D1):D733–D745. doi: 10.1093/nar/gkv1189. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Huang W, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protocols. 2009;4:44–57. doi: 10.1038/nprot.2008.211. [DOI] [PubMed] [Google Scholar]
27.Chin CS, et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods. 2013;10:563–569. doi: 10.1038/nmeth.2474. [DOI] [PubMed] [Google Scholar]
28.Walker BJ, et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS ONE. 2014;9:e112963. doi: 10.1371/journal.pone.0112963. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at https://arxiv.org/abs/1207.3907 (2012).
30.Ghareghani M, et al. Strand-seq enables reliable separation of long reads by chromosome via expectation maximization. Bioinformatics. 2018;34:i115–i123. doi: 10.1093/bioinformatics/bty290. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Porubsky D, et al. Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads. Nat. Biotechnol. 2021;39:302–308. doi: 10.1038/s41587-020-0719-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Falconer E, et al. DNA template strand sequencing of single-cells maps genomic rearrangements at high resolution. Nat. Methods. 2012;9:1107–1112. doi: 10.1038/nmeth.2206. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Sanders AD, et al. Characterizing polymorphic inversions in human genomes by single-cell sequencing. Genome Res. 2016;26:1575–1587. doi: 10.1101/gr.201160.115. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Rhie A, Walenz BP, Koren S, Phillippy AM. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 2021;21:245. doi: 10.1186/s13059-020-02134-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Huddleston J, et al. Reconstructing complex regions of genomes using long-read sequencing technology. Genome Res. 2014;24:688–696. doi: 10.1101/gr.168450.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Marchetto MC, et al. Species-specific maturation profiles of human, chimpanzee and bonobo neural cells. eLife. 2019;8:e37527. doi: 10.7554/eLife.37527. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Bailey JA, Yavor AM, Massa HF, Trask BJ, Eichler EE. Segmental duplications: organization and impact within the current human genome project assembly. Genome Res. 2001;11:1005–1017. doi: 10.1101/gr.GR-1871R. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Sedlazeck FJ, et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat. Methods. 2018;15:461–468. doi: 10.1038/s41592-018-0001-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Chen S, et al. Paragraph: a graph-based structural variant genotyper for short-read sequence data. Genome Biol. 2019;20:291. doi: 10.1186/s13059-019-1909-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information^{(3.8MB, pdf)}

The Supplementary Information file includes: Supplementary Figures 1-11, Supplementary Discussion, and Supplementary References.

Reporting Summary^{(74KB, pdf)}

Supplementary Data^{(9.4MB, pdf)}

Supplementary Tables^{(7MB, xlsx)}

The file contains Supplementary Tables 1-57.

Peer Review File^{(4.1MB, pdf)}

Data Availability Statement

Custom scripts used in this study are available at GitHub (https://github.com/EichlerLab and https://github.com/MaoYafei).

[CR1] 1.Prüfer K, et al. The bonobo genome compared with the chimpanzee and human genomes. Nature. 2012;486:527–531. doi: 10.1038/nature11128. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] 2.Takemoto H, Kawamoto Y, Furuichi T. How did bonobos come to range south of the Congo River? Reconsideration of the divergence of Pan paniscus from other Pan populations. Evol. Anthropol. 2015;24:170–184. doi: 10.1002/evan.21456. [DOI] [PubMed] [Google Scholar]

[CR3] 3.Scally A, et al. Insights into hominid evolution from the gorilla genome sequence. Nature. 2012;483:169–175. doi: 10.1038/nature10842. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Locke DP, et al. Comparative and demographic analysis of orang-utan genomes. Nature. 2011;469:529–533. doi: 10.1038/nature09687. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.The Chimpanzee Sequencing and Analysis Consortium Initial sequence of the chimpanzee genome and comparison with the human genome. Nature. 2005;437:69–87. doi: 10.1038/nature04072. [DOI] [PubMed] [Google Scholar]

[CR6] 6.Luo C, Tsementzi D, Kyrpides N, Read T, Konstantinidis KT. Direct comparisons of Illumina vs. Roche 454 sequencing technologies on the same microbial community DNA sample. PLoS ONE. 2012;7:e30087. doi: 10.1371/journal.pone.0030087. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Prado-Martinez J, et al. Great ape genetic diversity and population history. Nature. 2013;499:471–475. doi: 10.1038/nature12228. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Sudmant PH, et al. Global diversity, population stratification, and selection of human copy-number variation. Science. 2015;349:aab3761. doi: 10.1126/science.aab3761. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Logsdon GA, Vollger MR, Eichler EE. Long-read human genome sequencing and its applications. Nat. Rev. Genet. 2020;21:597–614. doi: 10.1038/s41576-020-0236-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Ventura M, et al. Gorilla genome structural variation reveals evolutionary parallelisms with chimpanzee. Genome Res. 2011;21:1640–1649. doi: 10.1101/gr.124461.111. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Fiddes IT, et al. Comparative Annotation Toolkit (CAT)-simultaneous clade and personal genome annotation. Genome Res. 2018;28:1029–1038. doi: 10.1101/gr.233460.117. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Stanke M, Diekhans M, Baertsch R, Haussler D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics. 2008;24:637–644. doi: 10.1093/bioinformatics/btn013. [DOI] [PubMed] [Google Scholar]

[CR13] 13.Gordon D, et al. Long-read sequence assembly of the gorilla genome. Science. 2016;352:aae0344. doi: 10.1126/science.aae0344. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Kronenberg ZN, et al. High-resolution comparative analysis of great ape genomes. Science. 2018;360:eaar6343. doi: 10.1126/science.aar6343. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Pavlidis P, Alachiotis N. A survey of methods and tools to detect recent and strong positive selection. J. Biol. Res. (Thessalon.) 2017;24:7. doi: 10.1186/s40709-017-0064-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] 16.de Manuel M, et al. Chimpanzee genomic diversity reveals ancient admixture with bonobos. Science. 2016;354:477–481. doi: 10.1126/science.aag2602. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.Marques-Bonet T, et al. A burst of segmental duplications in the genome of the African great ape ancestor. Nature. 2009;457:877–881. doi: 10.1038/nature07744. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.Vollger MR, et al. Improved assembly and variant detection of a haploid human genome using single-molecule, high-fidelity long reads. Ann. Hum. Genet. 2020;84:125–140. doi: 10.1111/ahg.12364. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR19] 19.Vollger MR, et al. Long-read sequence and assembly of segmental duplications. Nat. Methods. 2019;16:88–94. doi: 10.1038/s41592-018-0236-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] 20.Sudmant PH, et al. Evolution and diversity of copy number variation in the great ape lineage. Genome Res. 2013;23:1373–1382. doi: 10.1101/gr.158543.113. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.The ENCODE Project Consortium et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature. 2020;583:699–710. doi: 10.1038/s41586-020-2493-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] 22.Rice GI, et al. Mutations in ADAR1 cause Aicardi–Goutières syndrome associated with a type I interferon signature. Nat. Genet. 2012;44:1243–1248. doi: 10.1038/ng.2414. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Savva YA, Rieder LE, Reenan RA. The ADAR protein family. Genome Biol. 2012;13:252. doi: 10.1186/gb-2012-13-12-252. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] 24.Gallo A, Vukic D, Michalík D, O’Connell MA, Keegan LP. ADAR RNA editing in human disease; more to it than meets the I. Hum. Genet. 2017;136:1265–1278. doi: 10.1007/s00439-017-1837-0. [DOI] [PubMed] [Google Scholar]

[CR25] 25.O’Leary NA, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44(D1):D733–D745. doi: 10.1093/nar/gkv1189. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.Huang W, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protocols. 2009;4:44–57. doi: 10.1038/nprot.2008.211. [DOI] [PubMed] [Google Scholar]

[CR27] 27.Chin CS, et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods. 2013;10:563–569. doi: 10.1038/nmeth.2474. [DOI] [PubMed] [Google Scholar]

[CR28] 28.Walker BJ, et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS ONE. 2014;9:e112963. doi: 10.1371/journal.pone.0112963. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at https://arxiv.org/abs/1207.3907 (2012).

[CR30] 30.Ghareghani M, et al. Strand-seq enables reliable separation of long reads by chromosome via expectation maximization. Bioinformatics. 2018;34:i115–i123. doi: 10.1093/bioinformatics/bty290. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR31] 31.Porubsky D, et al. Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads. Nat. Biotechnol. 2021;39:302–308. doi: 10.1038/s41587-020-0719-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] 32.Falconer E, et al. DNA template strand sequencing of single-cells maps genomic rearrangements at high resolution. Nat. Methods. 2012;9:1107–1112. doi: 10.1038/nmeth.2206. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR33] 33.Sanders AD, et al. Characterizing polymorphic inversions in human genomes by single-cell sequencing. Genome Res. 2016;26:1575–1587. doi: 10.1101/gr.201160.115. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR34] 34.Rhie A, Walenz BP, Koren S, Phillippy AM. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 2021;21:245. doi: 10.1186/s13059-020-02134-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR35] 35.Huddleston J, et al. Reconstructing complex regions of genomes using long-read sequencing technology. Genome Res. 2014;24:688–696. doi: 10.1101/gr.168450.113. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR36] 36.Marchetto MC, et al. Species-specific maturation profiles of human, chimpanzee and bonobo neural cells. eLife. 2019;8:e37527. doi: 10.7554/eLife.37527. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR37] 37.Bailey JA, Yavor AM, Massa HF, Trask BJ, Eichler EE. Segmental duplications: organization and impact within the current human genome project assembly. Genome Res. 2001;11:1005–1017. doi: 10.1101/gr.GR-1871R. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR38] 38.Sedlazeck FJ, et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat. Methods. 2018;15:461–468. doi: 10.1038/s41592-018-0001-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR39] 39.Chen S, et al. Paragraph: a graph-based structural variant genotyper for short-read sequence data. Genome Biol. 2019;20:291. doi: 10.1186/s13059-019-1909-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A high-quality bonobo genome refines the analysis of hominid evolution

Yafei Mao

Claudia R Catacchio

LaDeana W Hillier

David Porubsky

Ruiyang Li

Arvis Sulovari

Jason D Fernandes

Francesco Montinaro

David S Gordon

Jessica M Storer

Marina Haukness

Ian T Fiddes

Shwetha Canchi Murali

Philip C Dishuck

PingHsun Hsieh

William T Harvey

Peter A Audano

Ludovica Mercuri

Ilaria Piccolo

Francesca Antonacci

Katherine M Munson

Alexandra P Lewis

Carl Baker

Jason G Underwood

Kendra Hoekzema

Tzu-Hsueh Huang

Melanie Sorensen

Jerilyn A Walker

Jinna Hoffman

Françoise Thibaud-Nissen

Sofie R Salama

Andy W C Pang

Joyce Lee

Alex R Hastie

Benedict Paten

Mark A Batzer

Mark Diekhans

Mario Ventura

Evan E Eichler

Abstract

Main

Sequence and assembly

Fig. 1. Sequence and assembly of the bonobo genome.

Extended Data Fig. 1. Workflow schematic of bonobo assembly pipeline.

Extended Data Fig. 2. Pairwise sequentially Markovian coalescent analysis and estimates of the effective population size predating the divergence in Homo and Pan.

Gene annotation

Mobile element insertions

Extended Data Fig. 3. Sequence and assembly of the bonobo genome and bonobo genome repeat structure.

Segmental duplications

Extended Data Fig. 4. Pan-specific duplications and bonobo-specific deletions.

Fig. 2. EIF4A3 gene family expansion and sequence resolution.

Extended Data Fig. 5. EIF4A3 and EIF3C gene family expansion and sequence resolution.

Structural variation and gene disruption

Extended Data Fig. 6. Bonobo structural variants and gene deletions.

A comparison of ILS in hominids

Fig. 3. Hominid ILS.

Table 1.

Extended Data Fig. 7. Hominid ILS.

Extended Data Fig. 8. Ideogram of the MHC region with ILS annotations.

Discussion

Methods

Reporting summary

Online content

Supplementary information

Acknowledgements

Extended data figures and tables

Author contributions

Data availability

Code availability

Competing interests

Footnotes

Contributor Information

Extended data

Supplementary information

References

Associated Data

Supplementary Materials

Data Availability Statement