Skip to main content
eLife logoLink to eLife
. 2019 Jun 25;8:e42989. doi: 10.7554/eLife.42989

Haplotypes spanning centromeric regions reveal persistence of large blocks of archaic DNA

Sasha A Langley 1,2, Karen H Miga 3, Gary H Karpen 1,2, Charles H Langley 4,
Editors: Magnus Nordborg5, Diethard Tautz6
PMCID: PMC6592686  PMID: 31237235

Abstract

Despite critical roles in chromosome segregation and disease, the repetitive structure and vast size of centromeres and their surrounding heterochromatic regions impede studies of genomic variation. Here we report the identification of large-scale haplotypes (cenhaps) in humans that span the centromere-proximal regions of all metacentric chromosomes, including the arrays of highly repeated α-satellites on which centromeres form. Cenhaps reveal deep diversity, including entire introgressed Neanderthal centromeres and equally ancient lineages among Africans. These centromere-spanning haplotypes contain variants, including large differences in α-satellite DNA content, which may influence the fidelity and bias of chromosome transmission. The discovery of cenhaps creates new opportunities to investigate their contribution to phenotypic variation, especially in meiosis and mitosis, as well as to more incisively model the unexpectedly rich evolution of these challenging genomic regions.

Research organism: Human

Introduction

The centromere is the unique chromosomal locus that forms the kinetochore, which interacts with spindle microtubules and directs segregation of replicated chromosomes to daughter cells (McKinley and Cheeseman, 2016). Human centromeres assemble on a subset of large blocks (many Mbps) of highly repeated (171 bp) α-satellite arrays found on all chromosomes. These repetitive arrays and the flanking heterochromatin (together the Centromere Proximal Regions or CPRs) play critical roles in the integrity of mitotic and meiotic inheritance (Janssen et al., 2018). In somatic tissues, chromosome instability, including loss and gain of chromosomes, plays large and complex roles in aging, cancer (Naylor and van Deursen, 2016), and human embryonic survival (McCoy, 2017). Sequence variation in CPRs can affect meiotic pairing (Dernburg et al., 1996; Karpen et al., 1996), kinetochore formation (Rosin and Mellone, 2017) and nonrandom segregation (Karpen et al., 1996). A large component of genetic disease stems from aneuploidies arising during meiosis (Nagaoka et al., 2012). Further, the unique asymmetry of transmission in female meiosis, where only one parental chromosome is transmitted, presents an opportunity for the evolution of strong deviations from mendelian segregation ratios (meiotic drive) (Pardo-Manuel de Villena and Sapienza, 2001; Chmátal et al., 2014). Recurrent meiotic drive is a potential cause of the evolutionarily rapid divergence of satellite DNAs and centromeric chromatin proteins (Malik and Henikoff, 2001; Talbert et al., 2004; Rudd et al., 2006; Malik and Henikoff, 2009), as well as the observed high levels of meiotic aneuploidy arising from the tradeoff between fidelity and drive (Zwick et al., 1999). However, the challenges inherent to assessing genomic variation in these repetitive and dynamic regions remain a significant barrier to incisive functional and evolutionary investigations.

Results

Recognizing the potential research value of well-genotyped diversity across human CPRs, we hypothesized that the low rates of meiotic exchange in these regions (Nambiar and Smith, 2016) might result in large haplotypes in populations, perhaps even spanning the α-satellite arrays. To test this, we examined the Single Nucleotide Polymorphism (SNP) linkage disequilibrium (LD) and haplotype variation surrounding the centromeres among the diverse collection of genotyped individuals in Phase 3 of the 1000 Genomes Project (Auton et al., 2015). Figure 1a depicts the predicted patterns of strong LD (red) and associated unbroken haplotypic structures surrounding the gap of unassembled satellite DNA of a metacentric chromosome. Unweighted Pair Group Method with Arithmetic Mean (UMPGA) clustering on 800 SNPs immediately flanking the chrX centromeric gap in males (Figure 1c) reveals a clear haplotypic structure that spans the gap and extends, as predicted, to a much larger region (≈7 Mbp, Figure 1b). Similar clustering of the imputed genotypes of females also falls into the same distinct high-level haplotypes (Figure 1—figure supplement 1). This discovery of the predicted haplotypes spanning CPRs (hereafter referred to as cenhaps) on chrX and most metacentric chromosomes (Figure 2) opens a new window into their evolutionary history and functional associations.

Figure 1. Strong LD across centromeric gaps forms large-scale centromere-spanning haplotypes, or cenhaps.

A full resolution version of this figure is available as Figure 1—source data 2. (a) The predicted patterns of the magnitude of linkage disequilibrium (LD) (triangle at top) for a Centromere Proximal Region (CPR) in a metacentric human chromosome (bottom) in a large outbreeding population. Central blue bands represent clustered haplotypes expected if crossing-over declines to zero in and around the highly repeated α-satellite DNA (central assembly gap) and the SNP-rich flanking regions (light blue). (b) Triangle (top) shows the LD between pairs of 17702 SNPs (Left: chrX:55623011–58563685, Right: chrX:61725513–68381787; hg19) flanking the centromere and α-satellite assembly gap (red vertical line) from 1231 human male X chromosomes from the 1000 Genomes Project. The color maps (see adjacent legend) to the -log10(p) where the p value derives from the 2×2 χ2 for independence of alleles at each pair of SNPs. Below, a broad haplotypic representation of these same data. SNPs were filtered for minor allele count (MAC) ≥ 60, but not by 4gt_dco. Minor alleles shown in black. Poorly genotyped SNPs near edges of the gap (red line) were masked. Superpopulation (SP; AFRica, AMeRicas, East ASia, EURope, South ASia) and scaled estimate of chrX-specific α-satellite array size (AS) indicated at left side. Approximate position of HuRef chrX indicated by black asterisk at right of the tree. Dendrogram represents UPGMA clustering based on the hamming distance between haplotypes comprised of 800 filtered SNPs immediately flanking the centromere (Left: chrX:58374895–58563685, Right: chrX:61725513–61921419; hg19), indicated by red bar at bottom and shown in detail in c. The three most common X cenhaps are highlighted with colored vertical bars. (d) A UPGMA tree based on the synonymous divergence in 21 genes (see Figure 1—source data 1) in the three major chrX cenhaps (indicated in c), assuming the TMRCA of humans and chimps is 6.5MY. The bars at each node represent ±two standard deviations of distributions of estimated TMRCAs across the genes. Widths of the triangles are proportional to the log10 of number of members of each cenhap, and the height is proportional to the average divergence within each cenhap.

Figure 1—source data 1. The 21 chrX coding genes in the CPR (8 left and 13 right of the centromere gap) used in the UPGMA clustering and estimation of TMRCAs.
Gene models and alignments from Ensembl release 92 (April 2018). Numbers of sites divergent (human-chimp): div_sites. Numbers of sites polymorphic: polym_sites. Average nonsynomymous divergence: nonsyn_div. Average synonymous divergence: syn_div. Average nonsynonymous diversity: nonsyn_π. Average synonymous diversity: syn_π.
DOI: 10.7554/eLife.42989.007
Figure 1—source data 2. Full resolution version of Figure 1.
DOI: 10.7554/eLife.42989.008

Figure 1.

Figure 1—figure supplement 1. X chromosome cenhaps from phased female data align with those from haploid males.

Figure 1—figure supplement 1.

A full resolution version of this figure is available as Figure 1—figure supplement 1—source data 1. (a) Haplotypic representation of 17702 SNPs flanking the gap in the assembly where the centromere typically forms (Left: chrX:55623011–58563685, Right: chrX:61725513–68381787; hg19) in 2542 phased human female X chromosomes (1271 individuals) from the 1000 Genomes Project. SNPs were filtered for minor allele count (MAC) ≥ 60. Minor alleles shown in black. The assembly gap is indicated by the red line. Poorly genotyped SNPs near edges of the gap were masked (see Materials and methods). Superpopulation (SP; AFRica, AMeRicas, East ASia, EURope, South ASia) is indicated on the left side. Tree represents UPGMA clustering based on the hamming distance for haplotypes comprised of 800 SNPs immediately flanking the centromere, indicated by red bar at bottom and shown in detail in b.
Figure 1—figure supplement 1—source data 1. Full resolution version of Figure 1—figure supplement 1.
DOI: 10.7554/eLife.42989.004

Figure 1—figure supplement 2. Filtering of chrX CPR recombinants for CDS divergence, expected heterozygosity and TMRCAs.

Figure 1—figure supplement 2.

A full resolution version of this figure is available as Figure 1—figure supplement 2—source data 1. To more reliably infer the average divergence in the CDSs in the region, the male X chromosome haplotypes in Figure 1b with apparent ancestral exchange in the CPR were filtered to yield a subset of 620. Haplotypic representation of 12458 SNPs flanking the gap in the assembly (Left: chrX:55623011–58563685, Right: chrX:61725513–68381787; hg19) in these 620 male X chromosomes. Minor alleles shown in black, assembly gap is indicated by red line. The three most common X cenhaps highlighted with colored vertical bars at right. The tree is based on the UPGMA clustering of hamming distance of 800 SNPs immediately flanking the centromere, as indicated by the red bar at bottom.
Figure 1—figure supplement 2—source data 1. Full resolution version of Figure 1—figure supplement 2.
DOI: 10.7554/eLife.42989.006

Figure 2. Cenhap diversity is found on many chromosomes.

Figure 2.

A full resolution version of this figure is available as Figure 2—source data 2. SNPs were filtered for MAC ≥ 80 and passing the 4gt_dco with a tolerance of 0 (see Materials and methods). Minor alleles shown in black, assembly gap is indicated by red line. Panel (a) for each chromosome shows the diversity in a subset of SNPs immediately surrounding the gap. SNPs from panel a were used for UPGMA clustering based on the hamming distance (see Materials and methods and Figure 2—source data 1). Panel (b) for each chromosome is the haplotypic representation of SNPs in the CPR of each chromosome based on imputed genotypes from the 1000 Genomes Project (see Figure 2—source data 1 for coordinates), using the clustering as for panel (a). The red bar at the bottom of panel (b) shows the position of the clustering region depicted in (a). Superpopulation is indicated in bar at far left.

Figure 2—source data 1. Centromere-Proximal Regions examined.
The hg19 coordinates (p_begin to p_end and q_begin to q_end) of the CPRs in which SNPs in the 1000 Genomes (Phase 3) were investigated, panel b in Figure 2. Imputed haplotypes were UMPGA clustered based on filtered SNPs in a symmetrical central region immediately flanking the centromeric gap in the assembly (p_c to p_end and q_begin to p_c).
DOI: 10.7554/eLife.42989.010
Figure 2—source data 2. Full resolution version of Figure 2.
DOI: 10.7554/eLife.42989.011

The pattern of geographic differentiation across the inferred chrX CPR (Figure 1b,c) exhibits higher diversity in African samples, as observed throughout the genome (Auton et al., 2015). Despite being fairly common among Africans today, a distinctly diverged chrX cenhap (cenhap 1, highlighted in purple, Figure 1b,c) is rare outside of Africa. Examination of the haplotypic clustering and estimated synonymous divergence in the coding regions of 21 genes included in the chrX cenhap region (see Figure 1—source data 1) yields a parallel relationship among the three major cenhaps and an estimated Time of the Most Recent Common Ancestor (TMRCA) of ≈600 KYA (Figure 1d) for this most diverged example. While ancient segments have been inferred in African genomes (Hammer et al., 2011; Hsieh et al., 2016), this cenhap stands out as genomically (if not genetically) large. The presence of such polymorphic ancient cenhaps is inconsistent with the predicted hitchhiking effect of sequential fixation of new meiotically driven centromeres (Malik and Henikoff, 2001). Further, the detection of near-ancient segments spanning the centromere contrasts with the observation of substantially more recent ancestry across the remainder of chrX and with the expectation of reduced archaic sequences on chrX (Dutheil et al., 2015). The large block on the right in Figure 1b is comprised of SNPs in exceptionally high frequency in Africans. The synonymous divergence in coding genes in this block indicates it too is quite old (data not shown) and may share ancestry with the ancient African cenhap. Putative distal recombinants of this block are observed outside of Africa and may contribute to associations of SNPs in this region with a diverse set of phenotypes, including male pattern hair loss (Hagenaars et al., 2017) and prostate cancer (Al Olama et al., 2014).

This deep history of the chrX CPR raises the possibility of even more ancient lineages on other chromosomes, either derived by admixture with archaic hominins or maintained by balancing forces. Although putatively introgressed archaic segments in African genomes have been inferred from genome-wide demographic modeling (Hammer et al., 2011; Hsieh et al., 2016Durvasula and Sankararaman, 2019; Speidel et al., 2019), ancient cenhaps could also persist within the ancestral population due to natural selection. The relatively recent origin of AMHs outside of Africa and the availability of Neanderthal and Denisovan genomes derived from fossil DNAs support more direct methods for detection of enrichment of archaic segments outside of Africa (Green et al., 2010). Such studies firmly establish genome-wide evidence of recent introgression into Eurasian populations of AMHs (Green et al., 2010; Patterson et al., 2012; Sankararaman et al., 2012; Prüfer et al., 2014; Sankararaman et al., 2014; Prüfer et al., 2017).

To identify likely Neanderthal or Denisovan introgressed cenhaps, we looked for highly diverged examples in non-African populations (see Figure 2) that shared a strong excess of derived alleles with those archaic hominids and not with African genomes, using chimpanzee as the outgroup (Green et al., 2010; Patterson et al., 2012; Prüfer et al., 2014; Prüfer et al., 2017). Applying this approach to the CPR of chr11 revealed a compelling example of Neanderthal introgression, which is illustrated in Figure 3a in the context of the seven most common chr11 cenhaps. The most diverged lineage contains a small basal group of primarily out-of-Africa genomes (cenhap 1, highlighted in green). This cenhap carries a large proportion of the derived alleles assigned to the Neanderthal lineage, DM/(DM+DN)=0.98, where DM is the cenhap mean number of shared Neanderthal Derived Matches and DN is the cenhap mean number of Neanderthal Derived Non-matches (Figure 3a, at left). The ratio DM/(DM+AN)=0.91, where AN is the number of Neanderthal-cenhap Non-matches that are Ancestral in the Neanderthal, suggests that this large cenhap lineage, including the centromere and satellite sequences, shared most of its evolutionary history with Neanderthals. This diverged cenhap is limited to populations outside of Africa, supporting the conclusion that it is an introgressed archaic centromere. Figure 3b shows these mean counts for each SNP class by cenhap group, confirming that the affinity to Neanderthals is slightly stronger than to Denisovans. A second basal lineage found principally in Africa (cenhap 2, highlighted in purple, Figure 3a) separates shortly after the inferred Neanderthal. It is unclear if this cenhap represents an introgression from a distinct archaic hominin in Africa or a surviving ancient lineage within the population that gave rise to AMHs.

Figure 3. Archaic cenhaps are found in AMH populations.

A full resolution version of this figure is available as Figure 3—source data 3. (a) Haplotypic representation of 8816 SNPs from 5008 imputed chr11 genotypes from the 1000 Genomes Project (Left: chr11:50509493–51594084, Right: chr11:54697078–55326684; hg19). SNPs were filtered for MAC ≥ 35 and passing the 4gt_dco with a tolerance of three (see Materials and methods). Minor alleles shown in black and assembly gap indicated by red line. Haplotypes were clustered with UPGMA based on the hamming distance between haplotypes comprised of 1000 SNPs surrounding the gap (Left: chr11:51532172–51594084, Right: chr11:54697078–54845667; hg19, indicated by red bar at bottom). Superpopulation and cenhap partitioning are indicated by bars at far left. Log2 counts of DM (derived in archaic, shared by haplotype), DN (derived in archaic, not shared by haplotype) and AN (ancestral in archaic, not shared by haplotype) for each cenhap relative to Altai Neanderthal (NEA) and Denisovan (DEN) at left. Gray horizontal bar (top) indicates region included in analysis of archaic content; black bars indicate SNPs with data for archaic and ancestral states. (b) Bar plots indicating the mean and 95% confidence intervals of DM, DN, AM (ancestral in archaic, shared by cenhap) and AN counts for cenhap groups (as partitioned in a. and c.) relative to Altai Neanderthal and Denisovan genomes, using chimpanzee as an outgroup (Speidel et al., 2019). (c) Haplotypic representation, as above, of 21950 SNPs from 5008 imputed chr12 genotypes from the 1000 Genomes Project (Left: chr12:33939700–34856380, Right: chr12:37856765–39471374; hg19). SNPs were filtered for MAC ≥ 35. Haplotypes were clustered with UPGMA based on 1000 SNPs surrounding the gap (Left: chr12:34821738–34856670, Right: chr12:37856765–37923684; hg19). Bars at side, top and bottom same as in a. (d) A UPGMA tree based on the synonymous divergence for 30 genes in the seven major chr11 cenhaps (see Figure 3—source data 2), assuming the TMRCA of humans and chimpanzee is 6.5MY (see Materials and methods and legend for Figure 1d). The error bars at each node represent ±two standard deviations of distributions of estimated TMRCAs across the genes.

Figure 3—source data 1. The 37 chr11 coding genes in the CPR (2 left and 35 right of the centromere gap) used in the UPGMA clustering and estimation of TMRCAs.
Gene models and alignments from Ensembl release 92 (April 2018). Numbers of nonsynonymous differences in the two basal cenhaps (1, 2 and both, 1_&_2; see Figure 3) from the other cenhaps of the 5008 imputed chr11 CPR haplotypes (see Materials and methods). Numbers of sites divergent (human-chimp): div_sites. Numbers of sites polymorphic: polym_sites. Average nonsynomymous divergence: nonsyn_div. Average synonymous divergence: syn_div. Average nonsynonymous diversity: nonsyn_π. Average synonymous diversity: syn_π.
DOI: 10.7554/eLife.42989.019
Figure 3—source data 2. The eight chr8 coding genes in the CPR (8 left and 0 right of the centromere gap) used in the UPGMA clustering and estimation of TMRCAs.
Gene models and alignments from Ensembl release 92 (April 2018). Numbers of sites divergent (human-chimp): div_sites. Numbers of sites polymorphic: polym_sites. Average nonsynomymous divergence: nonsyn_div. Average synonymous divergence: syn_div. Average nonsynonymous diversity: nonsyn_π. Average synonymous diversity: syn_π.
DOI: 10.7554/eLife.42989.020
Figure 3—source data 3. Full resolution version of Figure 3.
DOI: 10.7554/eLife.42989.021

Figure 3.

Figure 3—figure supplement 1. Region of chromosome 11 used for cenhap coding region divergence.

Figure 3—figure supplement 1.

A full resolution version of this figure is available as Figure 3—figure supplement 1—source data 1. (a) Haplotypic representation of 38644 SNPs from 5008 imputed chr11 genotypes from the 1000 Genomes project (Left: chr11:46509551–51594084, Right: chr11:54695707–59326455; hg19). Green lines indicate the region used for analysis of divergence in coding regions (Left: 49952369, Right: 56643039; hg19). SNPs were filtered for MAC ≥ 35 and passing the 4gt_dco with a tolerance of three (see Materials and methods). Minor alleles are shown in black, and assembly gap is indicated by red line. Haplotypes were clustered with UPGMA based on the hamming distances between haplotypes comprised of 1000 SNPs surrounding the gap, indicated by the red bar at bottom. Superpopulation and cenhap partitioning is shown at left.
Figure 3—figure supplement 1—source data 1. Full resolution version of Figure 3—figure supplement 1.
DOI: 10.7554/eLife.42989.014

Figure 3—figure supplement 2. Evidence of an archaic cenhap within Africa on chromosome 8.

Figure 3—figure supplement 2.

A full resolution version of this figure is available as Figure 3—figure supplement 2—source data 1.(a) Haplotypic representation of 14000 SNPs from 5008 imputed chr8 genotypes from the 1000 Genomes Project (Left: chr8:42178101–43838849, Right: chr8:46839548–49217022; hg19). SNPs were filtered for MAC ≥ 35 and passing the 4gt_dco with a tolerance of three (see Materials and methods). Minor alleles shown in black. Centromeric gap is indicated by red line. Haplotypes were clustered with UPGMA based on the hamming distances between haplotypes comprised of 1000 SNPs surrounding the gap (indicated by red bar at bottom). Superpopulation is indicated at left. (b) Filtered cenhaps show very little evidence of recombination and support archaic ancestry of a basal cenhap found in Africa. Cenhaps with putative exchange in their ancestry were filtered from the data in a by clustering SNPs on the low recombination regions on the left and right side of the gap separately (Left: chr8:42668082–43838849, Right: chr8:46839548–48639846, indicated by green and red lines; hg19). Left-side and right-side clades with little evidence of recombination were intersected to yield 1661 cenhaps used in downstream analysis of archaic contribution and TMRCA. Analysis of possible archaic descent was limited to an internal window of 10602 SNPs, indicated by green lines (chr8:43202774–47755914; hg19). At the left are log2 counts of DM (derived in archaic, shared by cenhap), DN (derived in archaic, not shared by cenhap) and AN (ancestral in archaic, not shared by cenhaps) based on the Altai Neanderthal (NEA) and Denisovan (DEN) sequence using chimpanzee as an outgroup (Prüfer et al., 2017). Gray bar at top indicates region included in analysis of archaic content and black bars indicate SNPs with data for archaic and outgroup state. Red bar at bottom indicates 1000 SNP region used for clustering. (c) A UPGMA tree based on the synonymous divergence in eight genes (see Figure 3—source data 2) in the three major chr8 cenhaps, assuming the TMRCA of humans and chimps is 6.5MYA. The error bars at each node represent ±two standard deviations of distributions of estimated TMRCAs across the genes. (d) Bar plots indicating the mean and 95% confidence intervals of DM, DN, AM (ancestral in archaic, shared by cenhap) and AN counts for cenhap groups (as partitioned b) relative to the archaic genomes (Prüfer et al., 2017).
Figure 3—figure supplement 2—source data 1. Full resolution version of Figure 3—figure supplement 2.
DOI: 10.7554/eLife.42989.016

Figure 3—figure supplement 3. Evidence of archaic cenhap introgression on chromosome 10.

Figure 3—figure supplement 3.

A full resolution version of this figure is available as Figure 3—figure supplement 3—source data 1. Haplotypic representation of 14000 SNPs from 5008 imputed chr10 genotypes from the 1000 Genomes Project (Left: chr10:37341777–39154888, Right: chr10:42354982–43762908; hg19). SNPs were filtered for MAC ≥ 35 and passing the 4gt_dco with a tolerance of three (see Materials and methods). Minor alleles are shown in black, and centromeric gap is indicated by red line. Haplotypes were clustered with UPGMA based on the hamming distance between haplotypes comprised of 1400 SNPs surrounding the gap (indicated by red bar at bottom). Superpopulation is indicated at left. Analysis of possible archaic descent was limited to an internal window of 7221 SNPs showing little evidence of exchange in the most centromere-distal regions, indicated by green lines. At the left are log2 counts of DM (derived in archaic, shared by cenhap), DN (derived in archaic, not shared by cenhap) and AN (ancestral in archaic, not shared by cenhaps) based on the Altai Neanderthal (NEA) and Denisovan (DEN) sequence using chimpanzee as an outgroup (Prüfer et al., 2017). Gray bar at top indicates region included in analysis of archaic content and black bars indicate SNPs with data for archaic and outgroup state.
Figure 3—figure supplement 3—source data 1. Full resolution version of Figure 3—figure supplement 3.
DOI: 10.7554/eLife.42989.018

The relatively large expanses of these chr11 cenhaps and unexpectedly sparse evidence of recombination could be explained by either relatively recent introgressions or cenhap-specific suppression of crossing over with other AMH genomes in this CPR (e.g., an inversion). As with chrX above, the clustering of cenhaps based on coding synonymous SNPs in an extended window containing 37 genes (Figure 3d, Figure 3—figure supplement 1) yields a congruent topology and estimates of TMRCAs for the two basal cenhaps of ~1 and ~0.5 MYA, consistent with relatively ancient origins. Among the 37 genes in the chr11 CPR are 34 of the ~300 known (Malnic et al., 2004) odorant receptors (ORs). Nonsynonymous SNPs in these chr11 ORs are associated with variation in human olfactory perception of particular volatile chemicals (Trimmer et al., 2019). 73 amino acid replacements polymorphisms are observed in 25 of these ORs (Figure 3—source data 1). The vast majority (63) are on the lineages to the two putative archaic cenhaps. Indeed, 60 are on the lineage to the Neanderthal haplotype suggesting this cenhap encodes Neanderthal-specific determinants of smell and taste. Similarly, in the second putative archaic African cenhap, seven of these ORs harbor ten amino acid replacements, of which only one is shared with cenhap 1 (see Figure 3—source data 1). The frequencies of the Neanderthal cenhap in Europe, South Asia and the Americas (0.061, 0.032 and 0.033 respectively), and of second ancient cenhap in Africa (0.036), are sufficiently high that together they contribute ≈ 18% of the amino acid replacement diversity in these 34 ORs among the 1000 Genomes. Thus, a substantial and rather unique part of the variation in chemical perception among AMH may be contributed by these two ancient cenhaps.

The most diverged, basal clade in the chr12 CPR (Figure 3c, indicated in brown) is common in Africa, but, like the most diverged chrX cenhap, is not represented among the descendants of the out-of-Africa migrations (Bae et al., 2017). The great depth of the lineage of this cenhap is further supported by comparison to homologous archaic sequences (Green et al., 2010; Prüfer et al., 2014; Prüfer et al., 2017). Consistent with the hypothesis that this branch split off before that of Neanderthals/Denisovans, members of this cenhap share fewer matches with derived SNPs on the Neanderthal and Denisovan lineages (DM) and exhibit strikingly more ancestral non-matches (AN) than other chr12 cenhaps (see Figure 3b). This putatively archaic chr12 cenhap represents a large and obvious example of the potentially introgressed sequences within African populations inferred from model-based analyses of the distributions of sequence divergence (Hammer et al., 2011; Hsieh et al., 2016; Durvasula and Sankararaman, 2019). The small out-of-Africa cenhap nested within a mostly African subclade (indicated in blue in Figure 3c) appears to be a typical Eurasian archaic introgression with higher affinity to Neanderthals (DM/(DN + DM)=0.91 and DM/(DM +AN)=0.90) than to Denisovans (Figure 3b). This bolsters the conclusion that the basal African cenhap represents a distinctly older archaic lineage. Unfortunately, there are too few coding bases in this region to support confident estimation of the TMRCAs of these ancient chr12 cenhaps. Based on the numbers of SNPs underlying the cenhaps, this basal cenhap is twice as diverged as the apparent introgressed Neanderthal cenhap, placing the TMRCA at ~1.1 MYA, assuming the Neanderthal TMRCA was 575KYA (Prüfer et al., 2017). While there is no direct evidence of recent introgression, the large genomic scale of the most diverged chr12 cenhap (relative to apparent exchanges in other cenhaps) is consistent with recent admixture with an extinct archaic in Africa; yet, again, selective maintenance of ancient cenhaps with associated suppression of crossing over is an alternative explanation.

Chromosomes X, 11 and 12 harbor a diversity of large cenhaps, including those representing archaic lineages. Notably, the CPRs of other chromosomes include diverged, basal lineages that are likely to be relatively old, if not archaic (Figure 2). Two examples are chromosome 8, containing an ancient cenhap limited to Africa with an estimated TMRCA of ~730 KYA (Figure 3—figure supplement 2), and chr10 that appears to harbor another clear Neanderthal cenhap introgression (Figure 3—figure supplement 3). Obtaining genomic sequence from African archaics would shed light on the evolutionary origins of the ancient cenhaps not associated with Neanderthal and Denisovan introgression. It should be noted that the very large genomic sizes of these ancient cenhaps could allow identification of archaic homology even with modest genomic sequence coverages from archaic fossils.

These SNP-based cenhaps portray a rich, highly structured view of the diversity in the unique segments flanking repetitive regions. While the divergence of satellites may be dynamic on a shorter time scale (Smith, 1976), we reasoned that the paucity of exchange in these regions would create cenhap associations with satellite divergence in both sequence and array size. Miga et al. (2014) generated chromosome-specific graphical models of the α-satellite arrays and reported a bimodal distribution in estimated chrX-specific α-satellite array (DXZ1) sizes (Willard et al., 1983) for a subset of the 1000 Genomes males. Figure 1b extends this observation to the entire data. The cumulative distributions of estimated array sizes of the three common chrX cenhaps designated in Figure 1c show substantial differences (Figure 4a). α-satellite array sizes in cenhap-homozygous females are parallel to males, and imputed cenhap heterozygotes are intermediate, as expected. Similarly, Figure 4b shows an even more striking example of variation in array size between cenhap homozygotes on chr11, and Figure 4c demonstrates that heterozygotes of the two most common cenhaps are reliably intermediate in size. While we confirmed that reference bias does not explain the observed cenhaps with large array size on chrX and chr11 (see Materials and methods, Figure 1b, Figure 4b and Figure 4—figure supplement 1), it is a potential explanation for particular instances of cenhaps with small estimated array sizes, for example the relatively low chrX-specific α-satellite content in the highly diverged African cenhap (see Figure 1b,c and Figure 4a, cenhap 1, highlighted in purple). Importantly, our results demonstrate that cenhaps do robustly tag a substantial component of the genetic variation in array size.

Figure 4. Cenhaps differ in α-satellite array size.

A full resolution version of this figure is available as Figure 4—source data 1. (a) Empirical cumulative densities (ecdf) of chrX α-satellite array size for cenhap homozygotes and heterozygotes (see Figure 1b for cenhap designations). 1_2 and 1_3 heterozygotes were excluded due to insufficient data. Female (F) values were normalized (x 0.5) to facilitate plotting with hemizygote male (M) data. (b) Haplotypic representation of 1000 SNPs from 1546 imputed chr11 genotypes from 773 cenhap-homozygous individuals. SNPs were filtered for MAC ≥ 35 and passing the 4gt_dco with a tolerance of 3. Minor alleles shown in black. Assembly gap indicated by red line. Superpopulation (SP) and scaled chr11-specific α-satellite array size (AS) at left. Cenhap partitions at right; most common cenhap ‘3’ and cenhap with larger mean array size ‘4’ are highlighted. Most probable HuRef cenhap genotypes are indicated by black asterisks at right. (c) Empirical cumulative density of array size for chr11 cenhap (from b) homozygotes (3_3 and 4_4) and heterozygotes (3_4).

Figure 4—source data 1. Full resolution version of Figure 4.
DOI: 10.7554/eLife.42989.025

Figure 4.

Figure 4—figure supplement 1. HuRef’s chr11 cenhap genotype.

Figure 4—figure supplement 1.

A full resolution version of this figure is available as Figure 4—figure supplement 1—source data 1.(a) Haplotypic representation of 1000 SNPs flanking the gap from 1546 imputed haploid chr11 genotypes from 773 cenhap-homozygous individuals. SNPs were filtered for MAC ≥ 35 and passing the 4gt_dco with a tolerance of 3. Minor alleles are shown in black. Assembly is gap indicated by the red line. Scaled chr11-specific α-satellite array size (AS) at left. Cenhap partitions at right; most common cenhap ‘3’, the cenhap with largest mean array size (‘4’) and an additional likely HuRef cenhap ‘5’ are highlighted. (b) Box plots of the distributions of numbers of non-matching SNPs between HuRef and the indicated cenhap genotypes. The same 8816 SNPs as in Figure 3 were genotyped in both the 1000 Genomes and HuRef. HuRef cenhap genotype is most likely 3_5 and does not involve the cenhap with the largest mean array size.
Figure 4—figure supplement 1—source data 1. Full resolution version of Figure 4—figure supplement 1.
DOI: 10.7554/eLife.42989.024

Discussion

The potential impact of sequence variation in CPRs and their associated satellites on centromere and heterochromatin functions has been long recognized but difficult to study (Pardo-Manuel de Villena and Sapienza, 2001). Both binding of the centromere-specific histone, CENPA (Sullivan et al., 2011), and kinetochore size (Iwata-Otsubo et al., 2017) are known to scale with the size of arrays and to fluctuate with sequence variation in satellite DNAs (Aldrup-MacDonald et al., 2016). Through these interactions with kinetochore function and other roles for heterochromatin in chromosome segregation (Dernburg et al., 1996Karpen et al., 1996; Peng and Karpen, 2008), α-satellite array variations can affect mitotic stability in human cells (Sullivan et al., 2017), as well as meiotic drive systems in the mouse (Chmátal et al., 2014). Meiotic drive has been proposed as the likely explanation for the saltatory divergence of satellite sequences and the excess of nonsynonymous divergence of several centromere proteins, some of which interact directly with the DNA (Malik and Henikoff, 2001; Talbert et al., 2004; Malik and Henikoff, 2009). However, the high levels of haplotypic diversity and deep cenhap lineages observed (Figure 2) conflict with the predictions of a naïve turnover model based on recurrent strong directional selection yielding sequential fixation of new centromeric haplotypes. Indeed, the levels of synonymous diversity, πs, in the few coding genes in the CPR of chrX, 0.00062 (0.00043–0.00128) and chr11, 0.00128 (0.00088–0.00217), are not different from levels of diversity in non-CPR regions (Dutheil et al., 2015). The genes in the CPR of chr8 show a considerably lower mean πs, 0.00010 (0.00007–0.00019); but we note there are only eight genes and their mean divergence from Pan orthologs is also low (Figure 3—source data 2). The inherent frequency-dependence of meiotic drive (Charlesworth and Hartl, 1978), associative overdominance (Ohta, 1971), a likely tradeoff between meiotic transmission bias and the fidelity of segregation of driven centromeres (Zwick et al., 1999), and the expected impact of unlinked suppressors (Charlesworth and Hartl, 1978) are plausible forces that would mitigate the impact of hitchhiking and background selection on the levels of standing polymorphism in CPRs.

The identification of human cenhaps raises new questions about the evolution of these unique genomic regions, but also provides the resolution and framework necessary to quantitatively address them. Our results transform large, previously obscure and shunned genomic regions into genetically rich and tractable resources, revealing unexpected diversity, including immense ancient CPRs, several of which are apparent Neanderthal introgressions. Most importantly, cenhaps can now be investigated for associations with variation in evolutionarily important chromosome functions, such as meiotic drive (Meyer et al., 2012) and recombination (Nambiar and Smith, 2016), as well as disease-related functions, such as aneuploidy in the germline (Nagaoka et al., 2012) and in development (McCoy, 2017), cancer and aging (Naylor and van Deursen, 2016).

Materials and methods

Identification of cenhaps in 1000 genomes

SNPs from the 1000 Genomes (Phase 3) (Auton et al., 2015) data (ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/) were examined for linkage disequilibria and haplotypic structure in the Centromere-Proximal Regions of each chromosome (see Figure 2—source data 1).

4gt_dco filter of putative genotyping errors

While the imputation used in 1000 Genomes (Phase 3) typically yielded calls in the CPR that fit clearly into the cenhaps, occasionally, in particular regions on particular chromosomes, the called haplotypes appear random (do not associate with the larger scale haplotypes). To filter such SNPs, we applied the 4gt_dco algorithm base on the following rationale. In a region of very low exchange, genotyping errors can appear as apparent gene conversions at a single site (double cross overs) in the context of a sample comprised of clear haplotypes. In a set of homologous genomic sequences, randomly sampled from an outbreeding diploid population, the equilibrium scale of linkage disequilibrium is approximately 1/4N(r + g) (where N is the diploid population size and r is the rate of crossing over, and g is the rate of gene conversion, see Song et al. (2007). If we assume a selectively neutral infinite sites model and that genomic scale of crossing over is much larger than that of gene conversion (rg, and the gene conversion track length is also small), both gene conversion AND genotyping errors can be inferred based on a simple test, the observation of all four gametotypes at two linked loci. 4gt_dco is positive and the focal SNP is filtered, if all four possible two-locus-two-allele gametes between the focal SNP and either of a pair flanking (on opposite sides) SNPs are observed, while the flanking SNPs do not exhibit all four gametes between themselves. We applied 4gt_dco across the target genomic regions from 5’ to 3’ in successively larger windows of flanking SNPs of surrounding surviving SNPs. In preliminary analyses (results not shown) we found that applying 4gt_dco first in a window of ±10, then ±20, ±30 and finally ±40, flanking SNPs (as yet unfiltered) can eliminate SNPs that are not well represented in centromeric haplotypes (cenhaps). We also found that incorporating ‘tolerance’ (maximum number of pairs of flanking SNPs in a window failing before filtering the focal SNP) improved the performance of 4gt_dco. On the X chromosome, the data were too sparse to support the 4gt_dco test. Instead, small regions of contiguous unreliable genotyping at the edges of the assembly flanking the centromeric gap were hard masked (chrX:5856368–61725513). For chromosomes 8, 10 and 11 we applied the 4gt_dco with a tolerance of 3.

Haplotype clustering and visualization

To examine haplotypic structure of CPRs in the filtered 1000 Genomes data, we used UPGMA cluster analysis based on the hamming distance of haplotypes comprised of the indicated central subsets of SNPs flanking the assembly gap. Resultant dendrograms were cut to generate cenhap groups. In some instances, dendrograms were cut at multiple heights to isolate groups of interest. Haplotypic representations were plotted in R using the gplots package.

Inference of introgression of Neanderthal and/or Denisovan cenhaps

The boundaries of highly diverged cenhaps were determined by excluding flanking regions with apparent exchanges in the history of the 1000 Genomes. 1000 Genomes SNPs (MAC ≥ 35 and passing the 4gt_dco with a tolerance of three) from cenhap regions were classified relative to Altai Neanderthal and Denisovan assemblies (using Pan troglotytes as an outgroup) as DM (derived in archaic, match in the imputed haplotype), DN (derived in archaic, no match in haplotype), AM (ancestral in archaic and matching the haplotype) and AN (ancestral in archaic, no match in haplotype, that is derived in the haplotype) (Prüfer et al., 2017). Bar plots, generated using ggplot2, depict the mean and standard error of each class for cenhap groups.

Estimates of cenhap divergence

For estimates of synonymous and nonsynonymous divergence between and diversity, πs, within cenhaps (as well as divergence from Pan), coding sequences of genes located in the central regions of cenhaps (i.e., where there is little or no evidence of exchange in the descent) were chosen for analyses. We identified the transcript with the longest CDS for each coding cenhap gene in Gencode Release 27 (Harrow et al., 2012) annotations and extracted multi-fasta files for these CDS regions from the 1000 Genomes. The corresponding Ensembl (release 23) genes were used to retrieve orthologs in Pan troglodytes (if not available, then from Pan paniscus). Ensembl orthologous sequences were aligned using CLUSTAL_W on coding portions of the cDNAs. Small edits were introduced to facilitate the computations. Tables in Figure 1—source data 1 and Figure 3—source datas 1 and 2 list the annotated coding genes for the CPR of chromosomes X, 11 and 8, respectively. Estimates of pairwise average nonsynonymous and synonymous divergence and diversity (expected heterozygosity) are based on method I of Nei and Gojobori (1986) and Aguadé et al. (1992). 95% confidence limit for estimates of diversity were based on bias corrected bootstrapping (R package bootstrap v2017.2). Estimates of the average divergence were derived from the UPGMA clustering. In cases where significant numbers (>10) of apparent recombinants between cenhaps were observed, these were identified and filtered from downstream cenhap group analyses of CDS divergence, expected heterozygosity and TMRCAs (Figure 1—figure supplement 2 and Figure 3—figure supplement 2).

Estimation of TMRCA for cenhaps

To assign estimates of the ages of TMRCAs the MRCA of each gene in Homo and Pan was assumed to be 6.5 MYA (Dutheil et al., 2015). The estimates of TMRCAs of various cenhaps were calculated from the height of the relevant node on the UPGMA dendrograms based on the average divergence. Approximate confidence intervals of the TMRCAs were estimated as ±two standard deviations in the observed variation across genes in the estimated TMRCA at each node.

Chromosome-specific array size estimates

Array size estimates were generated using the publicly available mapping of 1000 Genomes sequencing data to GRCh38, including models of CEN regions for each chromosome (Zheng-Bradley et al., 2017). For each sample, counts were computed for reads mapping to CEN regions, either uniquely or to multiple sites on a specific chromosome. Chromosome-specific read counts were then normalized by the mean coverage of chr1 unique regions for the sample. We observed significant sample-to-sample variation in array size across chromosomes. Such variation might arise from differences in library preparation, sequencing technology and sequencing center for 1000 Genomes samples. To moderate this issue, we performed a second normalization, dividing estimated array sizes for each chromosome within a sample by the sample sum of array size over all chromosomes.

HuRef cenhaps

The CEN models incorporated into GRCh38 are based on the long Sanger reads from the HuRef genome (Levy et al., 2007). To evaluate the potential impact of reference bias on the estimation of CEN-mapping of the 1000 Genome Illumina reads for various cenhaps, we determined the similarity of the genotype of the HuRef genome for SNPs defining the cenhaps on chrX and chr11. Since HuRef is a male genome, this involved simply reading off the genotypes at the defining SNPs for chrX. For the autosomes, the diploid genotype of the HuRef genome is needed. Fortunately Mu et al. (2015) extended, improved and validated genome-wide genotyping of HuRef. Since the HuRef diploid SNP genotypes are not phased, we attempted to identify the two most likely chr11 cenhap genotypes in HuRef by counting the numbers of mismatches between the diploid HuRef genotypes and each of 2504 individuals in the 1000 Genomes. One individual showed the fewest number of mismatches in the 8816 genotyped sites (chr11:50509493–55326684): seven SNPs exhibited a two-allele mismatch, 165 SNPs a one-allele mismatch (0/0 versus 0/1 or 1/1 versus 0/1) and 8644 SNPs matched. Placed in the context of the seven most common chr11 cenhaps, this individual is a cenhap-heterozygous genotype 3_5 (see Figure 4—figure supplement 1). As a group, individuals heterozygous for these two cenhaps, 3 and 5, exhibit the lowest numbers of mismatches from HuRef. Figure 4—figure supplement 1 shows the distribution of the sums of non-matching SNPs between HuRef and individuals with different cenhap genotypes.

Acknowledgements

We thank Benjamin Vernot for help accessing archaic DNA sequence data, and Graham Coop, Mikkel Schierup and Yuh Chwen Grace Lee for helpful discussions.

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Contributor Information

Charles H Langley, Email: chlangley@ucdavis.edu.

Magnus Nordborg, Austrian Academy of Sciences, Austria.

Diethard Tautz, Max-Planck Institute for Evolutionary Biology, Germany.

Funding Information

This paper was supported by the following grants:

  • National Institute of General Medical Sciences R01 GM117420 to Gary H Karpen.

  • National Institute of General Medical Sciences R01 GM119011 to Gary H Karpen.

Additional information

Competing interests

No competing interests declared.

Author contributions

Conceptualization, Software, Formal analysis, Investigation, Visualization, Methodology, Writing—original draft, Writing—review and editing.

Conceptualization, Investigation, Methodology, Writing—review and editing.

Conceptualization, Supervision, Investigation, Writing—review and editing.

Conceptualization, Resources, Formal analysis, Investigation, Visualization, Methodology, Writing—original draft, Writing—review and editing.

Additional files

Transparent reporting form
DOI: 10.7554/eLife.42989.026

Data availability

All data needed to evaluate the conclusions in the paper are present in the paper or the supplementary materials. The human populations genomic variation analyzed for linkage disequilibria and haplotypic structure in the Centromere-Proximal Regions of each chromosome was accessed from the 1000 Genomes Project (Phase 3) (ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/). The inference of Neanderthal and Denisovan ancestry in the Centromere Proximal Regions was based on data available at https://bioinf.eva.mpg.de/jbrowse described in Prüfer et al. 2017. The inference of haplotypes of variation in the Centromere Proximal Regions of the HuRef reference genome used to create the CEN regions in hg38 was based on the genotyping available at http://bioinform.github.io/huref-gs/ and described in Mu, et al. 2015.

References

  1. Aguadé M, Miyashita N, Langley CH. Polymorphism and divergence in the Mst26A male accessory gland gene region in Drosophila. Genetics. 1992;132:755–770. doi: 10.1093/genetics/132.3.755. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Al Olama AA, Kote-Jarai Z, Berndt SI, Conti DV, Schumacher F, Han Y, Benlloch S, Hazelett DJ, Wang Z, Saunders E, Leongamornlert D, Lindstrom S, Jugurnauth-Little S, Dadaev T, Tymrakiewicz M, Stram DO, Rand K, Wan P, Stram A, Sheng X, Pooler LC, Park K, Xia L, Tyrer J, Kolonel LN, Le Marchand L, Hoover RN, Machiela MJ, Yeager M, Burdette L, Chung CC, Hutchinson A, Yu K, Goh C, Ahmed M, Govindasami K, Guy M, Tammela TL, Auvinen A, Wahlfors T, Schleutker J, Visakorpi T, Leinonen KA, Xu J, Aly M, Donovan J, Travis RC, Key TJ, Siddiq A, Canzian F, Khaw KT, Takahashi A, Kubo M, Pharoah P, Pashayan N, Weischer M, Nordestgaard BG, Nielsen SF, Klarskov P, Røder MA, Iversen P, Thibodeau SN, McDonnell SK, Schaid DJ, Stanford JL, Kolb S, Holt S, Knudsen B, Coll AH, Gapstur SM, Diver WR, Stevens VL, Maier C, Luedeke M, Herkommer K, Rinckleb AE, Strom SS, Pettaway C, Yeboah ED, Tettey Y, Biritwum RB, Adjei AA, Tay E, Truelove A, Niwa S, Chokkalingam AP, Cannon-Albright L, Cybulski C, Wokołorczyk D, Kluźniak W, Park J, Sellers T, Lin HY, Isaacs WB, Partin AW, Brenner H, Dieffenbach AK, Stegmaier C, Chen C, Giovannucci EL, Ma J, Stampfer M, Penney KL, Mucci L, John EM, Ingles SA, Kittles RA, Murphy AB, Pandha H, Michael A, Kierzek AM, Blot W, Signorello LB, Zheng W, Albanes D, Virtamo J, Weinstein S, Nemesure B, Carpten J, Leske C, Wu SY, Hennis A, Kibel AS, Rybicki BA, Neslund-Dudas C, Hsing AW, Chu L, Goodman PJ, Klein EA, Zheng SL, Batra J, Clements J, Spurdle A, Teixeira MR, Paulo P, Maia S, Slavov C, Kaneva R, Mitev V, Witte JS, Casey G, Gillanders EM, Seminara D, Riboli E, Hamdy FC, Coetzee GA, Li Q, Freedman ML, Hunter DJ, Muir K, Gronberg H, Neal DE, Southey M, Giles GG, Severi G, Cook MB, Nakagawa H, Wiklund F, Kraft P, Chanock SJ, Henderson BE, Easton DF, Eeles RA, Haiman CA, Breast and Prostate Cancer Cohort Consortium (BPC3) PRACTICAL (Prostate Cancer Association Group to Investigate Cancer-Associated Alterations in the Genome) Consortium. COGS (Collaborative Oncological Gene-environment Study) Consortium. GAME-ON/ELLIPSE Consortium A meta-analysis of 87,040 individuals identifies 23 new susceptibility loci for prostate cancer. Nature Genetics. 2014;46:1103–1109. doi: 10.1038/ng.3094. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Aldrup-MacDonald ME, Kuo ME, Sullivan LL, Chew K, Sullivan BA. Genomic variation within alpha satellite DNA influences centromere location on human chromosomes with metastable epialleles. Genome Research. 2016;26:1301–1311. doi: 10.1101/gr.206706.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, Abecasis GR, The 1000 Genomes Project Consortium A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bae CJ, Douka K, Petraglia MD. On the origin of modern humans: asian perspectives. Science. 2017;358:eaai9067. doi: 10.1126/science.aai9067. [DOI] [PubMed] [Google Scholar]
  6. Charlesworth B, Hartl DL. Population dynamics of the segregation distorter polymorphism of Drosophila Melangoster. Genetics. 1978;89:171–192. doi: 10.1093/genetics/89.1.171. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Chmátal L, Gabriel SI, Mitsainas GP, Martínez-Vargas J, Ventura J, Searle JB, Schultz RM, Lampson MA. Centromere strength provides the cell biological basis for meiotic drive and karyotype evolution in mice. Current Biology. 2014;24:2295–2300. doi: 10.1016/j.cub.2014.08.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Dernburg AF, Sedat JW, Hawley RS. Direct evidence of a role for heterochromatin in meiotic chromosome segregation. Cell. 1996;86:135–146. doi: 10.1016/S0092-8674(00)80084-7. [DOI] [PubMed] [Google Scholar]
  9. Durvasula A, Sankararaman S. Recovering signals of ghost archaic introgression in african populations. bioRxiv. 2019 doi: 10.1101/285734. [DOI] [PMC free article] [PubMed]
  10. Dutheil JY, Munch K, Nam K, Mailund T, Schierup MH. Strong selective sweeps on the X chromosome in the Human-Chimpanzee ancestor explain its low divergence. PLOS Genetics. 2015;11:e1005451. doi: 10.1371/journal.pgen.1005451. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Green RE, Krause J, Briggs AW, Maricic T, Stenzel U, Kircher M, Patterson N, Li H, Zhai W, Fritz MH, Hansen NF, Durand EY, Malaspinas AS, Jensen JD, Marques-Bonet T, Alkan C, Prüfer K, Meyer M, Burbano HA, Good JM, Schultz R, Aximu-Petri A, Butthof A, Höber B, Höffner B, Siegemund M, Weihmann A, Nusbaum C, Lander ES, Russ C, Novod N, Affourtit J, Egholm M, Verna C, Rudan P, Brajkovic D, Kucan Ž, Gušic I, Doronichev VB, Golovanova LV, Lalueza-Fox C, de la Rasilla M, Fortea J, Rosas A, Schmitz RW, Johnson PLF, Eichler EE, Falush D, Birney E, Mullikin JC, Slatkin M, Nielsen R, Kelso J, Lachmann M, Reich D, Pääbo S. A draft sequence of the neandertal genome. Science. 2010;328:710–722. doi: 10.1126/science.1188021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Hagenaars SP, Hill WD, Harris SE, Ritchie SJ, Davies G, Liewald DC, Gale CR, Porteous DJ, Deary IJ, Marioni RE. Genetic prediction of male pattern baldness. PLOS Genetics. 2017;13:e1006594. doi: 10.1371/journal.pgen.1006594. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Hammer MF, Woerner AE, Mendez FL, Watkins JC, Wall JD. Genetic evidence for archaic admixture in africa. PNAS. 2011;108:15123–15128. doi: 10.1073/pnas.1109300108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken BL, Barrell D, Zadissa A, Searle S, Barnes I, Bignell A, Boychenko V, Hunt T, Kay M, Mukherjee G, Rajan J, Despacio-Reyes G, Saunders G, Steward C, Harte R, Lin M, Howald C, Tanzer A, Derrien T, Chrast J, Walters N, Balasubramanian S, Pei B, Tress M, Rodriguez JM, Ezkurdia I, van Baren J, Brent M, Haussler D, Kellis M, Valencia A, Reymond A, Gerstein M, Guigó R, Hubbard TJ. GENCODE: the reference human genome annotation for the ENCODE project. Genome Research. 2012;22:1760–1774. doi: 10.1101/gr.135350.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Hsieh P, Woerner AE, Wall JD, Lachance J, Tishkoff SA, Gutenkunst RN, Hammer MF. Model-based analyses of whole-genome data reveal a complex evolutionary history involving archaic introgression in central african pygmies. Genome Research. 2016;26:291–300. doi: 10.1101/gr.196634.115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Iwata-Otsubo A, Dawicki-McKenna JM, Akera T, Falk SJ, Chmátal L, Yang K, Sullivan BA, Schultz RM, Lampson MA, Black BE. Expanded satellite repeats amplify a discrete CENP-A nucleosome assembly site on chromosomes that drive in female meiosis. Current Biology. 2017;27:2365–2373. doi: 10.1016/j.cub.2017.06.069. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Janssen A, Colmenares SU, Karpen GH. Heterochromatin: guardian of the genome. Annual Review of Cell and Developmental Biology. 2018;34:265–288. doi: 10.1146/annurev-cellbio-100617-062653. [DOI] [PubMed] [Google Scholar]
  18. Karpen GH, Le MH, Le H. Centric heterochromatin and the efficiency of achiasmate disjunction in Drosophila female meiosis. Science. 1996;273:118–122. doi: 10.1126/science.273.5271.118. [DOI] [PubMed] [Google Scholar]
  19. Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, Walenz BP, Axelrod N, Huang J, Kirkness EF, Denisov G, Lin Y, MacDonald JR, Pang AW, Shago M, Stockwell TB, Tsiamouri A, Bafna V, Bansal V, Kravitz SA, Busam DA, Beeson KY, McIntosh TC, Remington KA, Abril JF, Gill J, Borman J, Rogers YH, Frazier ME, Scherer SW, Strausberg RL, Venter JC. The diploid genome sequence of an individual human. PLOS Biology. 2007;5:e254. doi: 10.1371/journal.pbio.0050254. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Malik HS, Henikoff S. Adaptive evolution of cid, a centromere-specific histone in Drosophila. Genetics. 2001;157:1293–1298. doi: 10.1093/genetics/157.3.1293. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Malik HS, Henikoff S. Major evolutionary transitions in centromere complexity. Cell. 2009;138:1067–1082. doi: 10.1016/j.cell.2009.08.036. [DOI] [PubMed] [Google Scholar]
  22. Malnic B, Godfrey PA, Buck LB. The human olfactory receptor gene family. PNAS. 2004;101:2584–2589. doi: 10.1073/pnas.0307882100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. McCoy RC. Mosaicism in preimplantation human embryos: when chromosomal abnormalities are the norm. Trends in Genetics. 2017;33:448–463. doi: 10.1016/j.tig.2017.04.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. McKinley KL, Cheeseman IM. The molecular basis for centromere identity and function. Nature Reviews Molecular Cell Biology. 2016;17:16–29. doi: 10.1038/nrm.2015.5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Meyer WK, Arbeithuber B, Ober C, Ebner T, Tiemann-Boege I, Hudson RR, Przeworski M. Evaluating the evidence for transmission distortion in human pedigrees. Genetics. 2012;191:215–232. doi: 10.1534/genetics.112.139576. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Miga KH, Newton Y, Jain M, Altemose N, Willard HF, Kent WJ. Centromere reference models for human chromosomes X and Y satellite arrays. Genome Research. 2014;24:697–707. doi: 10.1101/gr.159624.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Mu JC, Tootoonchi Afshar P, Mohiyuddin M, Chen X, Li J, Bani Asadi N, Gerstein MB, Wong WH, Lam HY. Leveraging long read sequencing from a single individual to provide a comprehensive resource for benchmarking variant calling methods. Scientific Reports. 2015;5:14493. doi: 10.1038/srep14493. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Nagaoka SI, Hassold TJ, Hunt PA. Human aneuploidy: mechanisms and new insights into an age-old problem. Nature Reviews Genetics. 2012;13:493–504. doi: 10.1038/nrg3245. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Nambiar M, Smith GR. Repression of harmful meiotic recombination in centromeric regions. Seminars in Cell & Developmental Biology. 2016;54:188–197. doi: 10.1016/j.semcdb.2016.01.042. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Naylor RM, van Deursen JM. Aneuploidy in cancer and aging. Annual Review of Genetics. 2016;50:45–66. doi: 10.1146/annurev-genet-120215-035303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Nei M, Gojobori T. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Molecular Biology and Evolution. 1986;3:418–426. doi: 10.1093/oxfordjournals.molbev.a040410. [DOI] [PubMed] [Google Scholar]
  32. Ohta T. Associative overdominance caused by linked detrimental mutations. Genetical Research. 1971;18:277–286. doi: 10.1017/S0016672300012684. [DOI] [PubMed] [Google Scholar]
  33. Pardo-Manuel de Villena F, Sapienza C. Nonrandom segregation during meiosis: the unfairness of females. Mammalian Genome. 2001;12:331–339. doi: 10.1007/s003350040003. [DOI] [PubMed] [Google Scholar]
  34. Patterson N, Moorjani P, Luo Y, Mallick S, Rohland N, Zhan Y, Genschoreck T, Webster T, Reich D. Ancient admixture in human history. Genetics. 2012;192:1065–1093. doi: 10.1534/genetics.112.145037. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Peng JC, Karpen GH. Epigenetic regulation of heterochromatic DNA stability. Current Opinion in Genetics & Development. 2008;18:204–211. doi: 10.1016/j.gde.2008.01.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Prüfer K, Racimo F, Patterson N, Jay F, Sankararaman S, Sawyer S, Heinze A, Renaud G, Sudmant PH, de Filippo C, Li H, Mallick S, Dannemann M, Fu Q, Kircher M, Kuhlwilm M, Lachmann M, Meyer M, Ongyerth M, Siebauer M, Theunert C, Tandon A, Moorjani P, Pickrell J, Mullikin JC, Vohr SH, Green RE, Hellmann I, Johnson PL, Blanche H, Cann H, Kitzman JO, Shendure J, Eichler EE, Lein ES, Bakken TE, Golovanova LV, Doronichev VB, Shunkov MV, Derevianko AP, Viola B, Slatkin M, Reich D, Kelso J, Pääbo S. The complete genome sequence of a neanderthal from the altai mountains. Nature. 2014;505:43–49. doi: 10.1038/nature12886. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Prüfer K, de Filippo C, Grote S, Mafessoni F, Korlević P, Hajdinjak M, Vernot B, Skov L, Hsieh P, Peyrégne S, Reher D, Hopfe C, Nagel S, Maricic T, Fu Q, Theunert C, Rogers R, Skoglund P, Chintalapati M, Dannemann M, Nelson BJ, Key FM, Rudan P, Kućan Ž, Gušić I, Golovanova LV, Doronichev VB, Patterson N, Reich D, Eichler EE, Slatkin M, Schierup MH, Andrés AM, Kelso J, Meyer M, Pääbo S. A high-coverage neandertal genome from vindija cave in Croatia. Science. 2017;358:655–658. doi: 10.1126/science.aao1887. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Rosin LF, Mellone BG. Centromeres drive a hard bargain. Trends in Genetics. 2017;33:101–117. doi: 10.1016/j.tig.2016.12.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Rudd MK, Wray GA, Willard HF. The evolutionary dynamics of alpha-satellite. Genome Research. 2006;16:88–96. doi: 10.1101/gr.3810906. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Sankararaman S, Patterson N, Li H, Pääbo S, Reich D. The date of interbreeding between neandertals and modern humans. PLOS Genetics. 2012;8:e1002947. doi: 10.1371/journal.pgen.1002947. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Sankararaman S, Mallick S, Dannemann M, Prüfer K, Kelso J, Pääbo S, Patterson N, Reich D. The genomic landscape of neanderthal ancestry in present-day humans. Nature. 2014;507:354–357. doi: 10.1038/nature12961. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Smith GP. Evolution of repeated DNA sequences by unequal crossover. Science. 1976;191:528–535. doi: 10.1126/science.1251186. [DOI] [PubMed] [Google Scholar]
  43. Song YS, Ding Z, Gusfield D, Langley CH, Wu Y. Algorithms to distinguish the role of gene-conversion from single-crossover recombination in the derivation of SNP sequences in populations. Journal of Computational Biology. 2007;14:1273–1286. doi: 10.1089/cmb.2007.0096. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Speidel L, Forest M, Shi S, Myers S. A method for genome-wide genealogy estimation for thousands of samples. bioRxiv. 2019 doi: 10.1101/550558. [DOI] [PMC free article] [PubMed]
  45. Sullivan LL, Boivin CD, Mravinac B, Song IY, Sullivan BA. Genomic size of CENP-A domain is proportional to total alpha satellite array size at human centromeres and expands in cancer cells. Chromosome Research. 2011;19:457–470. doi: 10.1007/s10577-011-9208-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Sullivan LL, Chew K, Sullivan BA. α satellite DNA variation and function of the human centromere. Nucleus. 2017;8:331–339. doi: 10.1080/19491034.2017.1308989. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Talbert PB, Bryson TD, Henikoff S. Adaptive evolution of centromere proteins in plants and animals. Journal of Biology. 2004;3:18. doi: 10.1186/jbiol11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Trimmer C, Keller A, Murphy NR, Snyder LL, Willer JR, Nagai MH, Katsanis N, Vosshall LB, Matsunami H, Mainland JD. Genetic variation across the human olfactory receptor repertoire alters odor perception. PNAS. 2019;116 doi: 10.1073/pnas.1804106115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Willard HF, Smith KD, Sutherland J. Isolation and characterization of a major tandem repeat family from the human X chromosome. Nucleic Acids Research. 1983;11:2017–2034. doi: 10.1093/nar/11.7.2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Zheng-Bradley X, Streeter I, Fairley S, Richardson D, Clarke L, Flicek P, 1000 Genomes Project Consortium Alignment of 1000 genomes project reads to reference assembly GRCh38. GigaScience. 2017;6:1–8. doi: 10.1093/gigascience/gix038. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Zwick ME, Salstrom JL, Langley CH. Genetic variation in rates of nondisjunction: association of two naturally occurring polymorphisms in the chromokinesin nod with increased rates of nondisjunction in Drosophila melanogaster. Genetics. 1999;152:1605–1614. doi: 10.1093/genetics/152.4.1605. [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision letter

Editor: Magnus Nordborg1
Reviewed by: Magnus Nordborg2, Andrew G Clark3

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your article "Haplotypes spanning centromeric regions reveal persistence of large blocks of archaic DNA" for consideration by eLife. Your article has been reviewed by two peer reviewers, including Magnus Nordborg as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by Diethard Tautz as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Andrew G Clark (Reviewer #2).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary:

This is a fascinating and focused paper confirming that the extremely low recombination rate in centromeric regions have led to the preservation of extremely long haplotypes, some of which appear to be of archaic origin. The results are novel, potentially important, and not obvious (for example, the existence of ancient polymorphisms argues against massive segregation distortion and meiotic drive, as you note). The results also suggest interesting avenues for future research. For example, the observation that these haplotypes capture large numbers of odorant receptors suggest a possible role for selection, and the potential importance of centromere driven meiotic drive can also now be explored. In general, it is a significant advance in our understanding of these hitherto inaccessible regions.

Essential revisions:

Although we believe you do an excellent job supporting the conclusions of the paper, we are also well aware of the technical pitfalls inherent in analyzing these kinds of polymorphism data. We're in particular worried about unpredictable interactions between the 4g_dco test and imputation, but there could be other problems that are very difficult to predict and detect. Based on your description of your methods, you are clearly aware of this, and we agree with your logic. Nonetheless, it would be nice to have an independent sanity check, and we would therefore suggest that you confirm the transmission of at a least a subset of these haplotypes using publicly available trio data (1000 Genomes has a few). As you know, this is commonly done in SNP calling, and is an important safeguard against artifactual SNP calls due to misalignment. It will be exciting one day, when there is enough trio data out there, to assess segregation ratios of these cenhaps.

A second point concerns the writing and claims. Arguably, your results are entirely consistent with an entirely neutral process. Is this correct, or is there anything in your data that is incompatible with a standard neutral model? Low recombination gives you long haplotypes that look diverged, but are actually just more visible because of the high LD (as anyone who has looked at Arabidopsis data knows). Furthermore, we already know that SNPs are shared with Neanderthals – why would long haplotypes behave differently?

We feel it is important to explicitly discuss whether these haplotypes have a different history than neutral SNPs, and, if there is no evidence for this, change the writing accordingly. Not doing so will likely to lead to misunderstanding and exaggerated claims ("Neanderthal centromeres invaded modern humans!").

[Editors' note: further revisions were requested prior to acceptance, as described below.]

Thank you for resubmitting your work entitled "Haplotypes spanning centromeric regions reveal persistence of large blocks of archaic DNA" for further consideration at eLife. Your revised article has been favorably evaluated by Diethard Tautz (Senior Editor) and a Reviewing Editor.

The manuscript has been improved but there are some remaining issues that need to be addressed before acceptance. We were curious why one would not include the trio analysis in the actual paper as well as in the response letter. Perhaps this could be added to Materials and methods section?

eLife. 2019 Jun 25;8:e42989. doi: 10.7554/eLife.42989.029

Author response


Essential revisions:

Although we believe you do an excellent job supporting the conclusions of the paper, we are also well aware of the technical pitfalls inherent in analyzing these kinds of polymorphism data. We're in particular worried about unpredictable interactions between the 4g_dco test and imputation, but there could be other problems that are very difficult to predict and detect. Based on your description of your methods, you are clearly aware of this, and we agree with your logic. Nonetheless, it would be nice to have an independent sanity check, and we would therefore suggest that you confirm the transmission of at a least a subset of these haplotypes using publicly available trio data (1000 Genomes has a few). As you know, this is commonly done in SNP calling, and is an important safeguard against artifactual SNP calls due to misalignment. It will be exciting one day, when there is enough trio data out there, to assess segregation ratios of these cenhaps.

This is a reasonable request. We were, of course, concerned that ShapeIt2 and the way it was parameterized might have been tuned for distinctly different nature of human population genomic variation in the non-Centromere-Proximal regions – the vast majority of and (to most) the more interesting part of the genome. While we became convinced that our success in identifying cenhaps could only mean that ShapeIt2 phasing was robust to such changes in genomic scale of linkage disequilibrium, we surmised that the inclusion of these very trios in the imputation must have contributed critically to the quality of the inference. In response to your request we have gladly conducted a ‘sanity check’. As is documented below the imputed centromeric haplotypes in the 1000 Genomes trio parents are, indeed, transmitted intact to the progeny.

While the parents in the 1000 Genomes trios were extensively sequenced, and their entire genome imputed, the available genotyping of the progeny in the majority of these trios is based solely on genotyping arrays (no sequencing or imputation). We focused on data from the OMNI genotyping platform, which includes 353 progeny of 1000 Genomes Phase 3 parents and incorporates light coverage of the CPRs. Thus, data for each trio include a modest number of unphased diploid SNPs in the progeny and the four imputed phased genomes in the two parents. A simple test of the ‘general’ validity of phasing (not a rigorous estimation of a very small error rate) is to consider the predicted four possible genotypes of the progeny. Since our article highlighted variation on chromosomes 8, 10, 11 and 12, we chose to examine the transmission of imputed centromeric haplotypes in these CPRs. If the phasing is correct, then the distributions of the numbers of SNP genotype non-matches between each of the four possible progeny genotypes and the observed progeny’s genotype should be disjunction with one ‘matching’ and the other three exhibiting a number of mismatches. In Author response image 1, one sees a comparison of the distribution over all 353 chr11 trios for both the mean numbers of mismatches (over the four possible genotypes) and that of the minimum of those the four. Trios in which the means were small, ≤ 3 (little power) were excluded. For the vast majority of the remaining 296 trios, the diploid genotype for SNPs in the chr11 cenhap region were indeed transmitted as expected. “Sanity checks out”. Similar analyses and plots for SNPs in the CPRs of chr8, chr10 and chr12 yield the same conclusion.

Author response image 1.

Author response image 1.

A second point concerns the writing and claims. Arguably, your results are entirely consistent with an entirely neutral process. Is this correct, or is there anything in your data that is incompatible with a standard neutral model? Low recombination gives you long haplotypes that look diverged, but are actually just more visible because of the high LD (as anyone who has looked at Arabidopsis data knows). Furthermore, we already know that SNPs are shared with Neanderthals – why would long haplotypes behave differently?

We feel it is important to explicitly discuss whether these haplotypes have a different history than neutral SNPs, and, if there is no evidence for this, change the writing accordingly. Not doing so will likely to lead to misunderstanding and exaggerated claims ("Neanderthal centromeres invaded modern humans!").

We have further qualified our statements and make no claim about selection’s impact on the distribution of archaic cenhaps. We continue to favor the use of “persistence” in the title because it is correct. But for some readers it may have a broader connotation.

The rigorous analysis of the fit to a selectively neutral demographic model of the genomic variation in many centromere proximal regions, CPRs, in the 1000 Genome data is beyond the scope of this article. The discovery of large-scale haplotypes spanning the vast α-satellite DNA arrays and the introgression of putatively archaic CPRs is in itself a significant observation. The population genetics model-based analysis of such a large and complex data set in the context of the already large and evolving literature of human demographic history is an important challenge for the future. What we present here already is a great deal of work. We feel science will advance more rapidly if we first get the existence of cenhaps in front of the community, since they can be the foundation for many other investigations in more appropriate data sets.

From a different perspective, we note the speculative, yet widely cited/repeated literature expounding the hypothesis that directional selection via meiotic drive and selection for modifiers thereof as an explanation for rapid concerted evolution of centromeric satellite DNAs and rapid divergence of proteins that are primary components of centromere formation/structure/function. Clearly the naïve prediction that recent selective sweeps will wipe out all sequence population genomic diversity in CPR is not supported by our results. But this speculation is not a quantitative model. The time scale on which such sweeps occur could be much greater than the TMRCA of hominins. For those who imagined frequent recurrent meiotic drive sweeps our results are a caution – they are as polymorphic as the non-CPR region. For those who see a likely more complex evolutionary dynamic here, our results hopefully suggest interesting ways to test for their impacts on different time scales.

[Editors' note: further revisions were requested prior to acceptance, as described below.]

The manuscript has been improved but there are some remaining issues that need to be addressed before acceptance. We were curious why one would not include the trio analysis in the actual paper as well as in the response letter. Perhaps this could be added to Materials and methods section?

Thank you for handling our manuscript and helping to improve it. The genomic phasing achieved in the 1000 Genomes (phase 3) explicitly utilizes the trio data as input to the ShapeIt imputation pipeline. For this reason, it would be unexpected that the requested additional analysis would discover inconsistencies. Nevertheless, we did examine the four focal chromosomes for inconsistencies between the reported imputed parental cenhaps and the observed diploid genotypes of the trio progeny as previously reported in our "Author response". Although hundreds of trios had sufficient diversity in the cenhap regions to readily detect phasing errors (and/or genetic exchange), we found only one inconsistent cenhap-trio out of 1294 examined. Thus, we confirm that the phasing of these four chromosomes in the 1000 Genomes (phase 3) is robust in the centromere proximal regions. This conclusion is in agreement with our results: that is, if such trio-based inconsistencies were indeed common they would yield substantial phasing errors and we would not expect to define the clear clustering/descent of cenhaps that we report in our paper. In summary, our focused study of trio data from four chromosomes revealed no obvious discrepancies between the ShapeIt phased parental cenhap genotypes and observed progeny genotypes. We believe that these confirmative results of the 'sanity-check' of the phasing reported in the published 1000 Genomes paper is most appropriately presented in the online correspondence ('Author response') where it can potentially be helpful to concerned readers.

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Figure 1—source data 1. The 21 chrX coding genes in the CPR (8 left and 13 right of the centromere gap) used in the UPGMA clustering and estimation of TMRCAs.

    Gene models and alignments from Ensembl release 92 (April 2018). Numbers of sites divergent (human-chimp): div_sites. Numbers of sites polymorphic: polym_sites. Average nonsynomymous divergence: nonsyn_div. Average synonymous divergence: syn_div. Average nonsynonymous diversity: nonsyn_π. Average synonymous diversity: syn_π.

    DOI: 10.7554/eLife.42989.007
    Figure 1—source data 2. Full resolution version of Figure 1.
    DOI: 10.7554/eLife.42989.008
    Figure 1—figure supplement 1—source data 1. Full resolution version of Figure 1—figure supplement 1.
    DOI: 10.7554/eLife.42989.004
    Figure 1—figure supplement 2—source data 1. Full resolution version of Figure 1—figure supplement 2.
    DOI: 10.7554/eLife.42989.006
    Figure 2—source data 1. Centromere-Proximal Regions examined.

    The hg19 coordinates (p_begin to p_end and q_begin to q_end) of the CPRs in which SNPs in the 1000 Genomes (Phase 3) were investigated, panel b in Figure 2. Imputed haplotypes were UMPGA clustered based on filtered SNPs in a symmetrical central region immediately flanking the centromeric gap in the assembly (p_c to p_end and q_begin to p_c).

    DOI: 10.7554/eLife.42989.010
    Figure 2—source data 2. Full resolution version of Figure 2.
    DOI: 10.7554/eLife.42989.011
    Figure 3—source data 1. The 37 chr11 coding genes in the CPR (2 left and 35 right of the centromere gap) used in the UPGMA clustering and estimation of TMRCAs.

    Gene models and alignments from Ensembl release 92 (April 2018). Numbers of nonsynonymous differences in the two basal cenhaps (1, 2 and both, 1_&_2; see Figure 3) from the other cenhaps of the 5008 imputed chr11 CPR haplotypes (see Materials and methods). Numbers of sites divergent (human-chimp): div_sites. Numbers of sites polymorphic: polym_sites. Average nonsynomymous divergence: nonsyn_div. Average synonymous divergence: syn_div. Average nonsynonymous diversity: nonsyn_π. Average synonymous diversity: syn_π.

    DOI: 10.7554/eLife.42989.019
    Figure 3—source data 2. The eight chr8 coding genes in the CPR (8 left and 0 right of the centromere gap) used in the UPGMA clustering and estimation of TMRCAs.

    Gene models and alignments from Ensembl release 92 (April 2018). Numbers of sites divergent (human-chimp): div_sites. Numbers of sites polymorphic: polym_sites. Average nonsynomymous divergence: nonsyn_div. Average synonymous divergence: syn_div. Average nonsynonymous diversity: nonsyn_π. Average synonymous diversity: syn_π.

    DOI: 10.7554/eLife.42989.020
    Figure 3—source data 3. Full resolution version of Figure 3.
    DOI: 10.7554/eLife.42989.021
    Figure 3—figure supplement 1—source data 1. Full resolution version of Figure 3—figure supplement 1.
    DOI: 10.7554/eLife.42989.014
    Figure 3—figure supplement 2—source data 1. Full resolution version of Figure 3—figure supplement 2.
    DOI: 10.7554/eLife.42989.016
    Figure 3—figure supplement 3—source data 1. Full resolution version of Figure 3—figure supplement 3.
    DOI: 10.7554/eLife.42989.018
    Figure 4—source data 1. Full resolution version of Figure 4.
    DOI: 10.7554/eLife.42989.025
    Figure 4—figure supplement 1—source data 1. Full resolution version of Figure 4—figure supplement 1.
    DOI: 10.7554/eLife.42989.024
    Transparent reporting form
    DOI: 10.7554/eLife.42989.026

    Data Availability Statement

    All data needed to evaluate the conclusions in the paper are present in the paper or the supplementary materials. The human populations genomic variation analyzed for linkage disequilibria and haplotypic structure in the Centromere-Proximal Regions of each chromosome was accessed from the 1000 Genomes Project (Phase 3) (ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/). The inference of Neanderthal and Denisovan ancestry in the Centromere Proximal Regions was based on data available at https://bioinf.eva.mpg.de/jbrowse described in Prüfer et al. 2017. The inference of haplotypes of variation in the Centromere Proximal Regions of the HuRef reference genome used to create the CEN regions in hg38 was based on the genotyping available at http://bioinform.github.io/huref-gs/ and described in Mu, et al. 2015.


    Articles from eLife are provided here courtesy of eLife Sciences Publications, Ltd

    RESOURCES