Local adaptation and archaic introgression shape global diversity at human structural variant loci

Stephanie M Yan; Rachel M Sherman; Dylan J Taylor; Divya R Nair; Andrew N Bortvin; Michael C Schatz; Rajiv C McCoy

doi:10.7554/eLife.67615

. 2021 Sep 16;10:e67615. doi: 10.7554/eLife.67615

Local adaptation and archaic introgression shape global diversity at human structural variant loci

Stephanie M Yan ¹, Rachel M Sherman ², Dylan J Taylor ¹, Divya R Nair ¹, Andrew N Bortvin ¹, Michael C Schatz ^1,², Rajiv C McCoy ^1,^✉

Editors: George H Perry³, George H Perry⁴

PMCID: PMC8492059 PMID: 34528508

Abstract

Large genomic insertions and deletions are a potent source of functional variation, but are challenging to resolve with short-read sequencing, limiting knowledge of the role of such structural variants (SVs) in human evolution. Here, we used a graph-based method to genotype long-read-discovered SVs in short-read data from diverse human genomes. We then applied an admixture-aware method to identify 220 SVs exhibiting extreme patterns of frequency differentiation – a signature of local adaptation. The top two variants traced to the immunoglobulin heavy chain locus, tagging a haplotype that swept to near fixation in certain southeast Asian populations, but is rare in other global populations. Further investigation revealed evidence that the haplotype traces to gene flow from Neanderthals, corroborating the role of immune-related genes as prominent targets of adaptive introgression. Our study demonstrates how recent technical advances can help resolve signatures of key evolutionary events that remained obscured within technically challenging regions of the genome.

Research organism: Human

Introduction

Rapid global dispersal and cultural evolution have exposed modern humans to a striking diversity of environments, to which they have developed numerous genetic adaptations (Fan et al., 2016). The vast majority of genomic research on human adaptation has focused on single-nucleotide polymorphisms (SNPs) and short insertions and deletions (indels) due to their ease of discovery by high-throughput short-read DNA sequencing (e.g., The 1000 Genomes Project Consortium, 2015; The International HapMap Consortium, 2007). This focus overlooks the potential impacts of larger insertions, deletions, and inversions – collectively termed structural variants (SVs) – whose abundance and complexity are now being appreciated with the advent of long-read DNA sequencing (Audano et al., 2019; Chaisson et al., 2015; Sedlazeck et al., 2018a). SVs impact more total sequence per genome than SNPs and may severely disrupt genes and regulatory elements to confer large effects on gene expression (Chiang et al., 2017). The size, modularity, and functional potential of SVs may offer a rich substrate for natural selection, as supported by several specific examples of adaptive insertions and deletions in diet and pigmentation-related genes (Hsieh et al., 2019; Kothapalli et al., 2016; Perry et al., 2007; Saitou and Gokcumen, 2019), as well as genome-wide evidence from primarily short-read data (Almarri et al., 2020; Sudmant et al., 2015).

More comprehensive analysis of SV evolution has been limited by the long and repetitive nature of many SVs. Short-read approaches to SV discovery depend on abnormalities in depth of coverage or other characteristics of read alignments (Kosugi et al., 2019) but achieve sensitivity of only 10–50% (Chaisson et al., 2019; Jakubosky et al., 2020b). In contrast, the recent development of long-read sequencing methods has revolutionized SV discovery, achieving high sensitivity and specificity by spanning SV sequences and their unique flanking regions (Sedlazeck et al., 2018a). This application of long-read sequencing to human samples has dramatically expanded the catalog of known human variation, generating databases of hundreds of thousands of SVs discovered in globally diverse individuals (Audano et al., 2019; Ebert et al., 2021). Yet with rare exceptions (Beyter et al., 2021), these sequencing methods remain impractical for population-scale applications due to their high costs and low throughput.

We addressed this limitation by applying a hybrid approach to integrate SV discovery from long-read sequencing and SV genotyping in short-read sequenced datasets. Using a graph-based method, we genotyped a catalog of long-read discovered SVs in high-coverage short-read sequencing data from the 1000 Genomes Project (1KGP) (Byrska-Bishop et al., 2021). By intersecting these genotype data with existing RNA-seq data from human cell lines, we discovered 1121 significant associations between SVs and the expression of nearby genes, including several components of the human leukocyte antigen (HLA) gene cluster and its interaction partners. We then applied an admixture-aware method to identify 220 SVs that exhibit strong allele frequency differentiation across human populations. Among the top hits, we discovered an extreme signature of local adaptation tagged by separate insertion and deletion polymorphisms at the immunoglobulin heavy chain locus, which encodes key components of the adaptive immune system. Alternative alleles defining this haplotype achieve high frequencies in certain southeast Asian populations, but are rare or absent from other global populations. Searching for signatures of archaic introgression within our set of highly differentiated SVs further revealed evidence that the adaptive haplotype entered the modern human population via ancient hybridization with Neanderthals. Together, our work highlights the role of structurally complex and repetitive regions of the genome as hidden sources of functional diversity and evolutionary innovation in the hominin lineage.

Results

Graph genotyping of structural variation

To assess the role of SVs in human adaptive evolution, we sought to combine the accuracy of long-read sequencing with the scale and population diversity of short-read sequencing data. To this end, we reanalyzed long-read sequencing data of 15 individuals from five continents (Audano et al., 2019) to discover a set of 107,866 SVs (Figure 1—figure supplement 1). The diversity of the long-read sample set (sample ancestries: three African, two American, three East Asian, two European, two South Asian, two hydatidiform moles [likely European]) enables SV discovery across the five continental superpopulations of 1KGP. We found that 89,979 (83.4%) of the long-read-discovered variants, including 30,229 (72.3%) common SVs (allele frequency [AF] ≥ 0.05), are not represented in sets of SVs discovered with short-read sequencing by 1KGP or the Human Genome Diversity Project (HGDP) (Almarri et al., 2020; Sudmant et al., 2015). Despite the much smaller size of the long-read sample set, we were also able to rediscover 66.0% and 17.7% of common SVs found with short reads in the 1KGP and HGDP datasets, respectively, in agreement with results reported in previous studies (Zhao et al., 2021; Figure 1—figure supplement 2 and Figure 1—figure supplement 3; see Materials and methods). We expect that the SVs unique to the short-read datasets reflect both differences in the discovery sample set (i.e., many of the long-read sequenced individuals are in 1KGP, while none are in HGDP) and high rates of false positives in short-read SV detection (Nattestad et al., 2018). We additionally discovered 2.3× and 3.5× more insertions than 1KGP and HGDP, respectively, with an improved insertion-to-deletion ratio (1.46 in the long-read dataset; 0.40 in 1KGP; 0.37 in HGDP). The dearth of insertions in the short-read SV sets reflects the challenge of insertion calling with short reads, which have difficulty mapping sequences not present in the reference genome (Aganezov et al., 2020), while the excess of insertions among long-read SVs reflects a known variant calling bias caused by missing sequences (e.g., collapsed tandem repeats) in the GRCh38 reference (Aganezov et al., 2021).

We then used the graph genotyping software Paragraph (Chen et al., 2019) to genotype this set of long-read discovered SVs in 2504 high-coverage (30×) short-read-sequenced samples from 1KGP (Byrska-Bishop et al., 2021; Figure 1A; see Materials and methods). Results from benchmarking, using ground truth SV data from the Genome in a Bottle Consortium (Zook et al., 2020), show that among both graph-based and non-graph-based SV genotyping tools, Paragraph consistently attains the best balance of precision and recall for both insertions and deletions (Chen et al., 2019; Hickey et al., 2020). Paragraph averages precision and recall rates of 0.72 and 0.70, respectively, across all SV types and genomic regions, and an average precision and recall of 0.86 and 0.79 when repeat regions are excluded (Hickey et al., 2020), underscoring its effectiveness for SV genotyping. Paragraph achieves such accuracy by generating graph representations of SV loci, which include diverging paths for known alternative alleles such as SVs. Short reads are aligned to the graph along the path of best fit, facilitating genotyping even in structurally complex and repetitive regions (Figure 1B). Informed by a large catalog of candidate SV alleles discovered by long-read sequencing, graph genotyping thus permits the study of variants that would be difficult or impossible to discover with short-read data alone (Chen et al., 2019; Hickey et al., 2020; Sibbesen et al., 2018; Sirén et al., 2021).

Figure 1. — (A) Genotyping of SVs was performed using a graph-based approach that represents the reference and alternative alleles of known SVs as edges. The SVs used for graph construction were originally identified from long-read sequencing of 15 individuals (Audano et al., 2019). (B) At candidate SV loci, samples sequenced with short reads are aligned to the graph along the path of best fit, and individuals are genotyped as heterozygous (middle), homozygous for the reference allele (right), or homozygous for the alternative allele (not depicted). We applied this method to the 1KGP dataset to generate population-scale SV genotypes. (C) Allele frequency spectra of SVs genotyped with Paragraph. The left-most bin represents SVs where the alternative allele is absent from the 1KGP sample (AF = 0). Samples are stratified by their 1KGP superpopulation.

Figure 1—source data 1. Superpopulation-level allele frequencies of structural variants used to construct allele frequency spectra in Figure 1C.
The third column (‘sum_AC’) provides allele counts, the fourth column (‘sum_NCHROBS’) provides the number of genotyped chromosomes, and the fifth column (‘AF’) provides the alternative allele frequency.

elife-67615-fig1-data1.txt^{(18MB, txt)}

Figure 1—figure supplement 1. — (A) Genotyping of SVs was performed using a graph-based approach that represents the reference and alternative alleles of known SVs as edges. The SVs used for graph construction were originally identified from long-read sequencing of 15 individuals (Audano et al., 2019). (B) At candidate SV loci, samples sequenced with short reads are aligned to the graph along the path of best fit, and individuals are genotyped as heterozygous (middle), homozygous for the reference allele (right), or homozygous for the alternative allele (not depicted). We applied this method to the 1KGP dataset to generate population-scale SV genotypes. (C) Allele frequency spectra of SVs genotyped with Paragraph. The left-most bin represents SVs where the alternative allele is absent from the 1KGP sample (AF = 0). Samples are stratified by their 1KGP superpopulation.

Figure 1—source data 1. Superpopulation-level allele frequencies of structural variants used to construct allele frequency spectra in Figure 1C.
The third column (‘sum_AC’) provides allele counts, the fourth column (‘sum_NCHROBS’) provides the number of genotyped chromosomes, and the fifth column (‘AF’) provides the alternative allele frequency.

elife-67615-fig1-data1.txt^{(18MB, txt)}

As quality control, we filtered the resulting data based on genotyping rates and adherence to Hardy–Weinberg equilibrium, in accordance with Chen et al., 2019 (see Materials and methods). Specifically, we removed SVs that were not successfully genotyped in ≥50% of samples, as well as SVs that violated one-sided Hardy–Weinberg equilibrium expectations (excess of heterozygotes) in more than half of the 1KGP populations. The latter scenario, whereby nearly all individuals are genotyped as heterozygous, is a common artifact caused by slightly divergent repetitive sequences that are falsely interpreted as alternative alleles at a single locus (Graffelman et al., 2017). These filtering steps removed 15,580 SVs, leaving 92,286 variants for downstream analysis.

Using this remaining set, we examined the allele frequency spectrum both to provide a general description of the data, as well as to quantify the impacts of our unique ascertainment approach (Figure 1C, Figure 1—figure supplement 5). Global alternative allele frequencies were strongly correlated with the number of long-read-sequenced samples in which they were originally discovered (Audano et al., 2019), broadly supporting the accuracy of our graph genotyping results (Figure 1—figure supplement 4). A total of 25,201 (27.3%) alternative SV alleles were absent from all 1KGP samples, likely reflecting ultra-rare variation within the panel of long-read sequenced genomes, as well as the imperfect recall (i.e., true positive rate) of SV genotyping (Chen et al., 2019). This abundance of rare variation is a known feature of human populations, which have experienced rapid demographic expansion over recent history (Keinan and Clark, 2012). In contrast, a total of 1139 (1.2%) alternative SV alleles were observed to be fixed in the 1KGP sample and likely reflect a combination of assembly errors and rare variation in the reference genome. Relaxing the criterion for fixation to 90% alternative allele frequency (thus allowing for a modest rate of false negatives) resulted in a total of 4947 (5.4%) fixed or nearly fixed alternative SV alleles.

The frequency spectrum of segregating variation was meanwhile shaped by an ascertainment bias caused by variant discovery within a small sample and genotyping within a separate and larger sample. This bias, which is similar to that previously described for genotyping microarrays (Lachance and Tishkoff, 2013), is characterized by an apparent depletion of rare variation, given that such variants are not shared with the discovery sample. While important to note, our study should be largely unaffected by this bias, due to our focus on individual variants rather than the shape of the allele frequency spectrum. Moreover, positively selected variants are by definition locally common and thus enriched within a small but globally diverse sample (Audano et al., 2019).

We next sought to test whether different classes of SVs exhibited differences in allele frequencies as a potential consequence of purifying selection. Among SVs that were polymorphic within the 1KGP sample, we observed a negative correlation between SV length and minor allele frequency (Kendall’s τ = –0.140, p-value < 1 × 10^–10), consistent with selection against longer SVs (Figure 1—figure supplement 6). We similarly observed that deletions segregated at lower average minor allele frequencies than insertions (β = –0.555, p-value < 1 ×10^–10).

Quantifying linkage disequilibrium with known SNPs and short indels

One intriguing question regards the extent to which SV genotypes are correlated with those of known SNPs and indels, which has implications for the novelty of any potential discoveries based on the SV data. To address this question, we quantified linkage disequilibrium (LD) between the catalog of long-read discovered SVs and known SNPs and short indels from 1KGP. For each of the 26 1KGP populations, we computed the maximum observed LD between each SV and the nearest 100 variants within a 1 Mb window. Depending on the population subject to measurement, 36–41% of segregating SVs were in strong LD (r² > 0.8) with any known SNP or short indel, and 52–57% were in moderate LD (r² > 0.5) with one of these variants (Figure 1—figure supplement 7). These observations of maximum LD with SNPs are slightly lower than those computed by Jakubosky et al., 2020a, which may reflect our ability to discover SVs in regions that are less accessible to short reads. Levels of LD were lowest for African populations, in accordance with known patterns of haplotype structure (Conrad et al., 2006). As with SNPs, these lower levels of LD suggest that the accuracy of SV genotype imputation based on known SNPs will be lowest in African populations, but that resolution for fine-mapping SVs that are potentially causal in phenotypic associations will also be highest (Wojcik et al., 2019).

We emphasize that our LD observations are strongly affected by the challenge of small variant discovery and genotyping in repetitive regions of the genome that are enriched for SVs. Specifically, 50,917 of all SVs (47.2%) intersect at least partially with one or more genomic intervals deemed inaccessible based on the 1KGP pilot mask, and 5214 SVs (4.8%) occur within regions of the genome identified as problematic by the ENCODE Consortium (Amemiya et al., 2019). Nevertheless, LD with known SNPs and indels is a meaningful metric for our study, because it quantifies the extent to which SVs represent independent and unexplored variation relative to previous evolutionary studies. Low observed levels of LD between a substantial fraction of SVs and known variants presents an opportunity to discover novel functional associations and signatures of adaptation at loci poorly tagged by easily genotyped markers.

Expression quantitative trait locus (eQTL) mapping

Seeking to first test the functional impacts of common structural variation, we intersected the SV genotype data with RNA-seq data from an overlapping set of 447 samples from the Geuvadis Consortium, which was generated from lymphoblastoid cell lines (LCLs) derived from individuals from four European and one African population (Lappalainen et al., 2013). We tested for associations between levels of gene expression and genotypes for SVs within 1 Mb from the transcription start site (TSS). After filtering the data on genotyping call rate, minor allele frequency, and gene expression level (see Materials and methods), we identified a total of 1121 SV-gene pairs with significant gene expression associations at a 10% false discovery rate (FDR; Figure 2A), broadly consistent with expectations from previous eQTL studies when scaled to the number of tested SVs (Chiang et al., 2017). SVs with significant impacts on expression tended to occur near the genes that they regulate, with 62% of significant SV eQTLs occurring within 100 kb of the corresponding TSS (Figure 2B).

We used RNA-seq data from the Geuvadis Consortium (Lappalainen et al., 2013), obtained from LCLs derived from individuals from four European and one African population of the 1KGP dataset, to test for associations between SV genotypes and gene expression. SV-eQTL pairs that were significant at a 10% FDR are depicted in purple. (A) Q–Q plot of permutation p-values for all SV-gene pairs tested. (B) Volcano plot of eQTLs and the estimated effect of the alternative allele on expression (β). (C) Distribution of the distance of significant SV eQTLs from the transcription start site (TSS) of their associated genes. (D) Enrichment or depletion of expression associations for SVs that overlap various ChromHMM chromatin state annotations from the Roadmap Epigenomics Project (GM12878 Lymphoblastoid Cells). Chromatin states with ≤0.1% genome-wide representation were omitted for visual clarity.

Figure 2—source data 1. Summarized eQTL data underlying Figure 2.
SV IDs, gene IDs and names, distance to the TSS, slopes, p-values, and q-values are provided.

elife-67615-fig2-data1.txt^{(2.9MB, txt)}

Figure 2—figure supplement 1. — We used RNA-seq data from the Geuvadis Consortium (Lappalainen et al., 2013), obtained from LCLs derived from individuals from four European and one African population of the 1KGP dataset, to test for associations between SV genotypes and gene expression. SV-eQTL pairs that were significant at a 10% FDR are depicted in purple. (A) Q–Q plot of permutation p-values for all SV-gene pairs tested. (B) Volcano plot of eQTLs and the estimated effect of the alternative allele on expression (β). (C) Distribution of the distance of significant SV eQTLs from the transcription start site (TSS) of their associated genes. (D) Enrichment or depletion of expression associations for SVs that overlap various ChromHMM chromatin state annotations from the Roadmap Epigenomics Project (GM12878 Lymphoblastoid Cells). Chromatin states with ≤0.1% genome-wide representation were omitted for visual clarity.

Figure 2—source data 1. Summarized eQTL data underlying Figure 2.
SV IDs, gene IDs and names, distance to the TSS, slopes, p-values, and q-values are provided.

elife-67615-fig2-data1.txt^{(2.9MB, txt)}

Top gene expression associations include several genes in the HLA complex (HLA-DQB2, HLA-DQA2, HLA-DRB6, and HLA-P), as well as ERAP2, which encodes an endoplasmic reticulum aminopeptidase that processes HLA ligands to lengths suitable for their binding (Figure 2C). While gene expression data from LCLs provides a limited snapshot of the functional implications of SVs, it is notable that these immune loci also exhibit strong gene expression diversity mediated by genetic variation.

We next investigated the mechanisms through which SV eQTLs may impact gene expression. Using epigenetic state annotations generated by the Roadmap Epigenetics Consortium (Roadmap Epigenomics Consortium et al., 2015), we found a strong enrichment of significant gene expression associations for SVs that intersect annotated enhancer elements, as well as a strong depletion of expression associations for SVs that intersect ‘quiescent’ sequences that are devoid of important epigenetic marks (Figure 2D). We also further investigated the 13 significant SV eQTLs that intersected one or more exons of their associated gene. Of the nine such deletions, 8 exhibited associations consistent with direct dosage effects on expression (i.e., a linear association between expression and deletion copy number; Figure 2—figure supplement 1A). Meanwhile, the three exon-intersecting insertions exhibited associations in both directions (one positive and two negative associations; Figure 2—figure supplement 1B). Together, these intuitive results shed light on functional mechanisms through which SVs may mediate impacts on phenotypes and fitness.

Lastly, we performed fine-mapping of SV and SNP eQTLs with CAVIAR (Hormozdiari et al., 2014) with the goal of homing in on candidate causal variants at SV eQTL loci. Of the 1121 loci with a significant SV eQTL, 345 (30.8%) contained the SV within the 90% credible causal set. While the median size of the credible causal sets was 19 variants, we identified 15 loci where an SV occurred among a credible causal set of five or fewer variants (Figure 2—figure supplement 2). These include a 37 Kb deletion (27005_CHM13_del) that is among just two variants in a credible causal set for expression effects on MIF-AS1, a lncRNA partially encompassed by the deletion. The expression of this lncRNA has been previously associated with breast and gastric cancer proliferation (Ding et al., 2019; Li et al., 2018). Similarly, a 3 kb insertion (20290_HX1_ins) is among two variants in a credible causal set for expression effects on CSF2RB, the dysfunction of which has been associated with pulmonary disease (Suzuki et al., 2011). While this approach offers a roadmap for future fine-mapping studies investigating the relative impacts of different variant classes on gene expression, we caution that the present analysis may still underestimate the potential causal role of SVs, due to a modestly higher error rate for genotyping of SVs versus SNPs (Chen et al., 2019).

Admixture-aware scan for signatures of local adaptation

We next sought to use our set of population-scale SV genotypes to search for SVs with signatures of local adaptation. Such locus-specific scans for local adaptation can be broadly classified into frequency differentiation-based and LD-based approaches (Vitti et al., 2013). While powerful, LD-based methods generally require haplotype phasing and/or dense and accurate genotyping – a requirement that is infeasible for many of the complex and repetitive regions of the genome investigated in our study. In contrast, frequency differentiation-based approaches can be applied to individual loci and are based on the logic that positive selection tends to cause a particular allele to increase in frequency only in the population(s) where it became established and was advantageous. However, a limitation of these methods is the requirement that individuals are grouped into pre-defined populations that may not reflect genetic patterns or the substantial admixture exhibited by many human populations.

To overcome these limitations, we used Ohana (Cheng et al., 2017; Ilardo et al., 2018; Cheng et al., 2019), a maximum likelihood-based method that models individuals as possessing ancestry from combinations of k ancestry components, inspired by related methods (Alexander et al., 2009; Falush et al., 2003). The method then tests whether individual variants adhere to this genome-wide null model, or are better explained by an alternative model in which frequencies are allowed to vary in one or more populations by consequence of local adaptation. Following the precedent set by 1KGP (The 1000 Genomes Project Consortium, 2015), we modeled individual genomes as combinations of eight ancestry components, replicating known patterns of population structure at continental scales as well as prominent signatures of admixture within a subset of populations (e.g., African ancestry in Southwest US [ASW] and admixed American populations [AMR]; Figure 3A). The distribution of ancestry components across populations was qualitatively unaffected by the choice of k (Figure 3—figure supplement 1).

Figure 3. — (A) To conduct an admixture-aware scan for local adaptation, we used Ohana (Cheng et al., 2017; Ilardo et al., 2018) to infer genome-wide patterns of ancestry in the 1KGP samples. This method models each individual as a combination of k ancestry components and then searches for evidence of local adaptation on these component lineages. Admixture proportions (k = 8) for all samples in the 1KGP dataset, grouped by population. Vertical bars represent individual genomes. (B) Using Ohana, we searched for evidence of local adaptation by testing whether the allele frequencies of individual variants were better explained by the genome-wide covariance matrix, or by an alternative covariance matrix where allele frequencies were allowed to vary in one ancestry component. The likelihood ratio statistic (LRS) reflects the relative support for the latter selection hypothesis. For each ancestry component, SVs with LRS > 32 are colored by their maximum linkage disequilibrium (r²) with any known SNP in the corresponding 1KGP superpopulation.

Figure 3—source data 1. Ancestry component results underlying Figure 3A.
Ancestry proportions for each individual are provided, along with that individual’s 1KGP population code.

elife-67615-fig3-data1.txt^{(630.6KB, txt)}

Figure 3—figure supplement 1. — (A) To conduct an admixture-aware scan for local adaptation, we used Ohana (Cheng et al., 2017; Ilardo et al., 2018) to infer genome-wide patterns of ancestry in the 1KGP samples. This method models each individual as a combination of k ancestry components and then searches for evidence of local adaptation on these component lineages. Admixture proportions (k = 8) for all samples in the 1KGP dataset, grouped by population. Vertical bars represent individual genomes. (B) Using Ohana, we searched for evidence of local adaptation by testing whether the allele frequencies of individual variants were better explained by the genome-wide covariance matrix, or by an alternative covariance matrix where allele frequencies were allowed to vary in one ancestry component. The likelihood ratio statistic (LRS) reflects the relative support for the latter selection hypothesis. For each ancestry component, SVs with LRS > 32 are colored by their maximum linkage disequilibrium (r²) with any known SNP in the corresponding 1KGP superpopulation.

Figure 3—source data 1. Ancestry component results underlying Figure 3A.
Ancestry proportions for each individual are provided, along with that individual’s 1KGP population code.

elife-67615-fig3-data1.txt^{(630.6KB, txt)}

Across all ancestry components, we identified 220 unique SVs exhibiting significant deviation from genome-wide patterns of allele frequency differentiation (99.9th percentile of matched distribution for SNPs and short indels; see Materials and methods; Figure 3B; Supplementary file 1; Figure 3—figure supplement 1). These included 139 SVs with coordinates overlapping with annotated genes, of which 13 intersected with annotated exons. Seven SVs at these frequency-differentiated loci were also significant eQTLs based on our previous analysis of the Geuvadis LCL data (Supplementary file 2). Through comparison to results for SNPs and indels, we found that 88 of the outlier SVs we identified were among the top 10 most frequency-differentiated variants within a 1 Mb window, suggesting that these represent examples where SVs may be causal targets of selection at adaptive loci. Finally, only 119 (54.1%) highly differentiated SVs possessed strong LD with known SNPs or short indels (r² > 0.8; Figure 1—figure supplement 2), indicating that many of these loci constitute novel candidate targets of selection that may have been missed by previous scans.

Known and novel targets of local adaptation

Notable examples of highly differentiated loci included rs333, the Δ32 allele of the chemokine receptor CCR5, which is known to confer resistance to HIV infection and progression (Dean et al., 1996). Among our results, this deletion polymorphism is the 14th most frequency-differentiated SV with respect to ancestry component 3, which is highly represented in Europe. The CCR5-Δ32 allele segregates at moderate frequencies in European populations (MAF = 10.9%) and achieves its highest frequency of 15.6% in the Finnish population, but segregates at low frequencies elsewhere. The case for historical positive selection at this locus has been contentious, with initial studies citing a geographic cline in allele frequencies and strong LD with adjacent microsatellite markers (Stephens et al., 1998), potentially driven by epidemics such as the bubonic plague or smallpox (Galvani and Slatkin, 2003). However, subsequent studies argued that patterns of long-range LD (Sabeti et al., 2005) and temporal allele frequency changes based on ancient DNA (aDNA) samples (Bollback et al., 2008) could not exclude models of neutral evolution.

Our study also identified numerous novel hits, such as a 309 bp intronic insertion in TMEM131 (ancestry component 3), which segregates at a frequency of 63% in Finnish populations, but at a mean frequency of 24% in non-European populations. This gene encodes proteins involved in collagen cargo trafficking from the endoplasmic reticulum to the Golgi (Zhang et al., 2020). Collagen is the most abundant protein in the human body and the major component of human skin. Recurrent positive selection shaping skin pigmentation and other phenotypes is one of the best described examples of local adaptation across human populations (Crawford et al., 2017; Jablonski, 2004).

We also identified a 2.8 kb insertion in an intron of CLEC16A, inferred to be under selection in ancestry component 4, which corresponds to the Peruvian (PEL), Mexican (MXL), and Colombian (CLM) populations of 1KGP. The insertion, which is among the strongest frequency-differentiated variants at its respective locus and segregates in LD with several SNPs and short indels (maximum r² = 0.81; Figure 3—figure supplement 3), occurs at frequencies of 52.4%, 38.3%, and 15.4%, respectively, in these three populations, but is rare in others (AF < 0.04). CLEC16A is thought to impact susceptibility to autoimmune disorders, and SNPs in this gene have been associated with diseases such as type I diabetes, multiple sclerosis, and rheumatoid arthritis (Pandey et al., 2019). Notably, this same SV was also identified in a recent study of structural variation, which similarly reported that it achieves high frequency in Peruvian populations (Ebert et al., 2021).

In rare cases, multiple SVs in LD with one another captured the same underlying signature of frequency differentiation. One such example from South Asian populations (ancestry component 5) involved a linked intronic insertion and deletion in the cellular growth and morphogenesis related gene CSNK1G1. In this case, the reference genome carries the global minor allele, such that the signature of local adaptation presents as a lower frequency of the alternative allele in South Asian populations (60–71% for 22980_HG00514_ins; 60%–72% for 25014_HG02106_del) compared to other global populations (91% and 92%, respectively).

Extreme signatures of adaptation at the immunoglobulin locus in southeast Asian populations

The SV with the strongest evidence of local adaptation across all populations was an insertion polymorphism in an intron of IGHG4, which codes for a constant domain of the immunoglobulin heavy chain. These heavy chains pair with light chains, the latter of which include a domain composed of variable (V), diversity (D), and joining (J) segments. Complementing their substantial germline variation, V(D)J loci experience somatic recombination and hypermutation to generate vast antibody repertoires – the defining feature of the adaptive immune system (Watson et al., 2017). This insertion polymorphism identified by our scan exhibits strong allele frequency differentiation in ancestry component 2, which is highly represented in the Chinese Dai in Xishuangbanna, China (CDX) and Kinh in Ho Chi Minh City, Vietnam (KHV) populations, where it achieves frequencies of 88% and 65%, respectively, while remaining at much lower allele frequencies in other global populations (Figure 4A,B).

Figure 4. — (A) Local analysis of likelihood ratio statistics (LRS) in the region near the 33 bp insertion (red point) reveals a 325 kb haplotype encompassing 94 SNPs with strong allele frequency differentiation within ancestry component 2. Points where the alternative allele matches an allele observed in the Chagryskaya Neanderthal genome but at a frequency of 1% or less in African populations are highlighted in purple. (B) Individual haplotypes defined by the highly differentiated SNPs (LRS > 450). Four archaic hominin genomes are plotted at the top, while 30 randomly sampled haplotypes from each of 6 populations from 1KGP are plotted below. Archaic hominins samples are colored according to whether they possess more than one aligned read supporting the alternative allele at a given site. ESN refers to the Esan in Nigeria population of 1KGP. Other 1KGP population codes are provided in the main text.

Figure 4—figure supplement 1. — (A) Local analysis of likelihood ratio statistics (LRS) in the region near the 33 bp insertion (red point) reveals a 325 kb haplotype encompassing 94 SNPs with strong allele frequency differentiation within ancestry component 2. Points where the alternative allele matches an allele observed in the Chagryskaya Neanderthal genome but at a frequency of 1% or less in African populations are highlighted in purple. (B) Individual haplotypes defined by the highly differentiated SNPs (LRS > 450). Four archaic hominin genomes are plotted at the top, while 30 randomly sampled haplotypes from each of 6 populations from 1KGP are plotted below. Archaic hominins samples are colored according to whether they possess more than one aligned read supporting the alternative allele at a given site. ESN refers to the Esan in Nigeria population of 1KGP. Other 1KGP population codes are provided in the main text.

The SV was originally reported as a 34 bp insertion based on long-read sequencing of the genome of a Vietnamese individual (HG02059) (Audano et al., 2019). Based on realignment to a modified version of the reference genome that includes the alternative allele, we revised the sequence of this insertion to 33 bp, but found that it is well supported by patterns of coverage, split reads, and soft clipped alignments at the SV breakpoints (Figure 4—figure supplement 1). The sequence of the insertion itself is repetitive within the human reference genome, with two identical copies of the sequence occurring ~44 and ~117 kb downstream, respectively, also within the IGH gene cluster. This SV is included among sets of indels called by gnomAD (Karczewski et al., 2020), 1KGP, and HGDP, with a slightly altered position and insertion sequence. However, it exhibits a highly skewed allele balance (centered on 0.2, in contrast to the expectation of 0.5) – a symptom of reference bias in read mapping (Chen et al., 2019) – as well as low genotyping rates (31% for CDX in 1KGP, compared with 85% in our study). These biases result in lower estimates of the insertion’s allele frequency in East Asian populations (AF = 0.73 in CDX) and underscore the technical challenges of genotyping at this locus with traditional short-read approaches (Chin et al., 2020; Zhang et al., 2021).

Interestingly, the second strongest signature of adaptation across all populations traced to a nearby 135 bp deletion, which overlaps with two transcription factor binding sites for IGHE, another component of the constant region of immunoglobulin heavy chains. This deletion, which lies only 33 kb upstream of the IGHG4 insertion (Figure 4A), again achieves high frequencies in the CDX and KHV populations (70% and 54%, respectively) but segregates at low frequencies in most other global populations. Despite their genomic proximity and similar patterns of frequency differentiation, these two SVs exhibit only modest levels of LD (r² = 0.17), likely reflecting recombination occurring during or after the episode of selection.

Evidence of Neanderthal-introgressed origin of the high-frequency IGH haplotype

Approximately 2% of the genome of all non-African individuals traces to admixture with Neanderthals between 47 and 65 kya, while Oceanian populations (and Asian populations to a lesser extent) also possess sequences introgressed from Denisovans (Green et al., 2010; Sankararaman et al., 2012; Sankararaman et al., 2016; Vernot et al., 2016). Introgressed alleles, including some SVs (Almarri et al., 2020; Hsieh et al., 2019), are thought to have conferred both beneficial and deleterious effects on modern human populations, especially with respect to immune-related phenotypes (Rotival and Quintana-Murci, 2020). Motivated by these findings, we tested our set of highly differentiated SVs for evidence of archaic hominin introgression. Using results from the method Sprime (Browning et al., 2018), which leverages patterns of divergence and haplotype structure to classify archaic introgressed sequences, we identified SVs that segregate in LD with putative introgressed SNPs (see Materials and methods). Of the 220 highly differentiated SVs, 26 (12%) exhibited moderate LD (r² > 0.5) with putative introgressed haplotypes, while simultaneously exhibiting low allele frequencies (AF < 0.01) within African populations of 1KGP (Figure 4—figure supplement 2; Supplementary file 3). Notably, this set of candidate introgressed SVs included the IGHG4 insertion and nearby deletion, which, despite their low LD with one another, each tag multiple putative introgressed SNPs within the CDX population. Indeed, the original Sprime publication reported that putative introgressed variants of IGHA1, IGHG1, and IGHG3 achieve high frequencies in Eurasian populations (Browning et al., 2018). The variants identified by Browning et al., 2018 are located in a subregion of the broader IGH locus, downstream of the introgressed SVs, and segregate at high allele frequency in East Asian, European, and American populations (Figure 4—figure supplement 3). The southeast Asian-specific haplotype we identify, which includes the IGHG4 insertion and nearby deletion, may have been challenging to discover due to the difficulties of short-read alignment and genotyping in this region of the genome (Zhang et al., 2021). Indeed, 80.7% of the sequence in the broader IGH locus was filtered out by Browning et al., 2018 through strict masking of aDNA genotypes to remove low-coverage, poorly mapping, or repeat-associated reads (Figure 4—figure supplement 4).

Most candidate introgressed SVs fall within complex regions of the genome, such that genotyping directly from fragmented and degraded aDNA data remains infeasible. However, the IGHG4 intronic insertion is short enough to be spanned by individual sequencing reads, thus allowing us to validate our genotyping results and further investigate introgression at this locus by directly searching for an exact match to a diagnostic sequence. Specifically, we identified a 48 bp sequence that is unique to individuals with the insertion, but not observed in the reference genome (see Materials and methods), thereby facilitating rapid searches for the insertion allele in any sequencing dataset. Application of this approach to the 1KGP data broadly validated our graph genotyping results and supported the strong allele frequency differences that we had previously described (Figure 4—figure supplement 5). We then extended this scan to published short-read sequencing data from four high-coverage (~30×) archaic hominin genomes (Mafessoni et al., 2020; Meyer et al., 2012; Prüfer et al., 2017; Prüfer et al., 2014). While the diagnostic sequence was absent from the Denisovan genome, we observed one or more perfect matches in the sequences of each of three Neanderthals, which we found notable given its absence among African populations (Figure 4—figure supplement 5).

We next expanded our analysis to the genomic region around the IGHG4 insertion, investigating the archaic hominin allelic states at each of the 109 highly differentiated variants (2 SVs, 14 short indels, and 93 SNPs) defining the 325 kb LRS peak at the IGH locus. Focusing on the 93 SNPs where archaic alleles are easily compared, we observed that the Denisovan genome exhibited a modest degree of matching (39 shared alleles [42%]), while the Altai, Vindija, and Chagyrskaya Neanderthal genomes exhibited near perfect matching over the entirety of the differentiated region (77 [83%], 88 [95%], and 89 [96%] shared alleles, respectively) (Figure 4B). Additionally, of these 93 highly differentiated SNPs, 64 were called as introgressed in CDX by Sprime (Figure 4—figure supplement 6). Combined with the near absence of these alleles from other global populations and the length of the differentiated haplotype, our observations strongly support the conclusion that this sequence originated via ancient introgression from a Neanderthal population related to the Chagyrskaya and Vindija Neanderthals.

Examination of the haplotype structure at the IGH locus revealed four discrete blocks of LD within the highly differentiated region (Figure 4—figure supplement 7), consistent with substantial recombination after the original haplotype achieved high frequency. The deletion SV (22231_HG02059_del) falls within the largest LD block, while the insertion SV (22237_HG02059_ins) falls within a smaller LD block that exhibits the greatest allele frequency differences. The latter block includes the tag SNP rs150526114, where the global minor allele matches the Neanderthal genomes and segregates at 91% frequency in CDX, 73% frequency in KHV, and 59% frequency in CHS, but is rare or absent in most other populations from 1KGP. Data from HGDP (Bergström et al., 2020) and the Simons Genome Diversity Panel (SGDP) (Mallick et al., 2016) shed additional light on the geographic distribution of the putative Neanderthal-introgressed allele and confirmed its extreme pattern of frequency differentiation specific to southeast Asian populations (Figure 4—figure supplement 8). Notably, the allele is absent from the HGDP populations from the Americas, which are thought to have split from East Asian populations approximately 26 kya (Moreno-Mayar et al., 2018), further supporting the recency and geographically restricted nature of this positive selection event.

Strength and timing of adaptive introgression at the IGH locus

The challenge of dense genotyping and phasing within the broader IGH region (Zhang et al., 2021) hinders haplotype-based approaches for inferring the timing of selection. Nevertheless, the global patterns of allele frequencies we observe (Figure 5A,B) can be interpreted in the context of known population demographic histories, providing a rough estimate of such timing. For example, pairwise divergence times among East Asian populations inferred by Wang et al., 2018 constrain the plausible timing of selection at the IGH locus between approximately 60 generations (1740 years) and 400 generations (11,600 years) ago (assuming a 29-year generation time Tremblay and Vézina, 2000): after the divergence of the Chinese Dai and Vietnamese populations from the Japanese (JPT) population, but before the divergence of CDX and KHV. These divergence time estimates are roughly consistent with those inferred using an alternative approach based on the joint allele frequency spectrum by Jouganous et al., 2017.

Figure 5. — (A) Population-specific frequencies of the putative Neanderthal-introgressed insertion allele in each of the 1KGP populations, in the style of the Geography of Genetic Variants browser (Marcus and Novembre, 2017). (B) Tree representation of the best-fit selection hypothesis for Neanderthal-introgressed haplotype tagged by the *IGHG4* insertion polymorphism, as computed by Ohana. (C) Five-population demographic model used for simulation and parameter inference via approximate Bayesian computation (ABC). Population sizes and split times are further described in Materials and methods. (D) Posterior distribution of the selection coefficient (s). (E) Posterior distribution of the timing of the onset of selection (T_adaptive). (F) Posterior distribution of the initial allele frequency at the beginning of the simulation (p₀). (G) Negative relationship between s and T_adaptive for simulations retained by ABC.

Figure 5—source data 1. Parameters and summary statistics for the simulations retained in the ABC analysis, underlying Figure 5D–G.
Simulation parameters are provided in the first three columns. Population-specific allele frequencies output by the simulation are provided in the last five columns, labeled by their corresponding 1KGP population codes.

elife-67615-fig5-data1.txt^{(581.6KB, txt)}

Figure 5—figure supplement 1. — (A) Population-specific frequencies of the putative Neanderthal-introgressed insertion allele in each of the 1KGP populations, in the style of the Geography of Genetic Variants browser (Marcus and Novembre, 2017). (B) Tree representation of the best-fit selection hypothesis for Neanderthal-introgressed haplotype tagged by the *IGHG4* insertion polymorphism, as computed by Ohana. (C) Five-population demographic model used for simulation and parameter inference via approximate Bayesian computation (ABC). Population sizes and split times are further described in Materials and methods. (D) Posterior distribution of the selection coefficient (s). (E) Posterior distribution of the timing of the onset of selection (T_adaptive). (F) Posterior distribution of the initial allele frequency at the beginning of the simulation (p₀). (G) Negative relationship between s and T_adaptive for simulations retained by ABC.

Figure 5—source data 1. Parameters and summary statistics for the simulations retained in the ABC analysis, underlying Figure 5D–G.
Simulation parameters are provided in the first three columns. Population-specific allele frequencies output by the simulation are provided in the last five columns, labeled by their corresponding 1KGP population codes.

elife-67615-fig5-data1.txt^{(581.6KB, txt)}

To achieve further insight into this episode of selection, we conducted forward genetic simulations of a simplified five-population demographic model based on parameters obtained from the literature (The 1000 Genomes Project Consortium, 2015; Jouganous et al., 2017; Wang et al., 2018; Figure 5C). For simplicity, we began the simulations 46 kya, after Neanderthal introgression has occurred and the introgressed haplotype is segregating within the modern human population. Our model makes an additional simplification by not including migration, but we note that this omission should make our estimates of the selection coefficient conservative, as stronger selection is required to generate allele frequency differences between populations exchanging migrants. We first performed 100,000 neutral simulations (selection coefficient [s] = 0), finding that the magnitudes of allele frequency differentiation, especially in southeast Asia, far exceed those expected under the null (Figure 5—figure supplement 1). In order to formally estimate the parameters of the selection event, we next applied approximate Bayesian computation (ABC) to an additional 175,000 simulations in which we varied relevant parameters (see Materials and methods). In line with our previous intuition, the results of this analysis indicated a recent onset of extremely strong selection, with the selection coefficient parameter (s) inferred to be 0.06 (95% credible interval [CI] [0.02, 0.12]), the selection onset time parameter (T_adaptive) inferred to be 4400 years ago (95% CI [1700, 8400]), and the initial frequency of the adaptive allele in the ancestral Eurasian population (p₀) to be 0.18 (95% CI [0.04, 0.35]) (Figure 5D–F). We observed a strong correlation between estimates of the selection coefficient and timing of selection. Specifically, older, weaker selection could produce the same frequency differences as stronger and more recent selection (Figure 5G), though even the lower bound on our estimate of s places it among the strongest episodes of positive selection documented in humans.

Discussion

Long-read sequencing is starting to provide more comprehensive views of the landscape of human genetic variation, drawing novel links to phenotypes and diseases. However, long-read sequencing methods remain impractical for most population-scale studies due to their low throughput and high cost. We sought to overcome these limitations by applying variant graph genotyping of a large catalog of long-read-discovered SVs to short-read sequencing data from globally diverse individuals of 1KGP. By mapping eQTLs and scanning for evidence of local adaptation and adaptive introgression, we highlighted the role of SVs as largely unexplored contributors to variation in human genome function and fitness.

Despite the scale and diversity of SVs and samples used in our study, we anticipate that additional SV targets of selection remain undiscovered, either because they were not present in the set of long-read sequenced individuals or because they remain inaccessible to genotyping using graph-based approaches (e.g., tandem repeats) (Chen et al., 2019). Studying the evolutionary impacts of such SVs will require the application of long-read sequencing at population scales (Ebert et al., 2021), which is still infeasible for most genetics studies. Moreover, SV loci exhibiting expression associations or signatures of selection may not themselves be the causal targets, but rather tag nearby causal variation by consequence of LD. Methods such as fine mapping (Schaid et al., 2018) and multiplex reporter assays (Tewhey et al., 2018; van Arensbergen et al., 2019) will be invaluable for disentangling LD to reveal causal relationships and contrast the relative impacts of various forms of genetic variation.

The two strongest signatures of local adaptation in our study traced to the immunoglobulin heavy chain locus. While the precise phenotypic impacts of these variants remain unknown (Figure 5—figure supplement 2), their potential effects on adaptive immunity is intriguing given the established role of immune-related genes as common targets of local adaptation in human populations (Barreiro and Quintana-Murci, 2020). The human IGH locus is highly polymorphic (Mikocziova et al., 2020), with examples of SNPs and copy number variants exhibiting frequency differences between populations (Watson et al., 2013). In developing lymphocytes, this locus undergoes somatic V(D)J recombination and hypermutation to produce antibodies that drive the immune response—processes that may be influenced by nearby germline variation (Watson et al., 2017). The combination of these forms of variation makes the region difficult to probe with traditional sequencing methods (Chin et al., 2020; Zhang et al., 2021), in turn highlighting the power of long-read sequencing and graph genotyping.

Our observation of the IGHG4 insertion within the Neanderthal genomes allowed us to connect and build upon initial evidence of selection and introgression at the IGH locus. Specifically, 1KGP previously reported SNPs in several immunoglobulin genes as allele frequency outliers with respect to the CDX population (The 1000 Genomes Project Consortium, 2015). Browning et al., 2018 later noted that a Neanderthal-introgressed haplotype at the IGH locus achieves high frequencies in Eurasian populations, although they focused on a separate haplotype in the region, split by recombination, that is not specific to southeast Asia. Our findings further expand the evolutionary history of this region of the genome, as well as adding to a growing list of examples of adaptive introgression from archaic hominins (Gittelman et al., 2016; Hsieh et al., 2019; Huerta-Sánchez et al., 2014; Racimo et al., 2017), several of which are thought to have targeted immune-related phenotypes (Abi-Rached et al., 2011; Dannemann et al., 2016; Enard and Petrov, 2018; Gouy and Excoffier, 2020; Mendez et al., 2012a; Mendez et al., 2012b; Sams et al., 2021).

Based on forward genetic simulations, we estimated that selection on the introgressed IGH haplotype initiated between 1700 and 8400 years ago, before the divergence of the CDX and KHV populations and in line with our intuition based on patterns of allele frequencies. This recent onset of selection is intriguing given that introgression from Neanderthals into the ancestors of Eurasian modern human populations dates to 47–65 kya (Sankararaman et al., 2012). Our findings thus suggest that persisting archaic introgressed haplotypes provided a reservoir of functional variation to the ancestors of CDX and KHV that proved adaptive during a period of environmental change, for example in response to local pathogens (Rasmussen et al., 2015). Recent reports that Neanderthal-introgressed sequences mediate individual outcomes of SARS-CoV-2 infections to this day lend plausibility to this hypothesis (Zeberg and Pääbo, 2021; Zeberg and Pääbo, 2020; Zhou et al., 2021), as does polygenic evidence of adaptation in response to ancient viral epidemics, including in East Asia (Souilmi et al., 2021). Moreover, the delayed onset of selection inferred in our study is consistent with recent evidence of time lags between introgression and adaptation favoring other Neanderthal-introgressed alleles (Yair et al., 2021).

Our simulations additionally demonstrated that a selection coefficient between 0.02 and 0.12 best explains the observed frequency differences, comparable to other episodes of strong selection in humans. These include examples such as lactase persistence mutations near LCT (0.01–0.15) (Bersaglieri et al., 2004) and malaria resistance mutations affecting DARC (0.08) (Hamid et al., 2021) and HBB (0.1) (Elguero et al., 2015). We caution against overinterpretation of these parameter estimates given the uncertainty in the underlying demographic model, which relies on estimates of population size and divergence time from two studies (Jouganous et al., 2017; Wang et al., 2018) and does not incorporate admixture between populations. However, our results are broadly consistent with the observation of extreme allele frequency differences among closely related populations in East Asia. Future studies incorporating more complex evolutionary models and fully resolved IGH haplotypes (Rodriguez et al., 2020) will be essential for further refining the complex evolutionary history of selection and recombination at the immunoglobulin locus. Nevertheless, the coexistence of V(D)J recombination, somatic hypermutation, and local adaptation at this locus presents a remarkable example of diversifying selection at multiple scales of biological organization, generating allelic diversity both within individuals and across populations.

Together, our study demonstrates how new sequencing technologies and bioinformatic algorithms are facilitating understanding of complex and repetitive regions of the genome – a new frontier for human population genetics. Our study also demonstrates how the integration of short and long-read datasets allows for investigation of these regions at population scales. Combined with studies of diverse populations, these technologies are providing a more complete picture of human genomes and the evolutionary forces by which they are shaped.

Materials and methods

Structural variant calling and comparison with short-read datasets

We used published long-read sequencing data from 15 individuals to generate a set of 107,866 SVs for graph genotyping (Audano et al., 2019). Raw reads were downloaded using the accessions provided in the original publication and aligned with NGM-LR (Sedlazeck et al., 2018b) using default PacBio parameters to the main chromosomes of GRCh38. Variants were called with Sniffles (Sedlazeck et al., 2018b), requiring a minimum SV length of 30 bp and a minimum of 10 supporting reads. Resulting VCFs were refined with Iris (Alonge et al., 2020) to polish the reported SV sequences. Variants were then merged with SURVIVOR v1.0.7 (Jeffares et al., 2017), using a merge distance of 50 bp and requiring strand and type to match. For each merged variant, a representative variant was then obtained from the original pre-merged call set to improve accuracy. Such representative variants were selected by first prioritizing homozygous over heterozygous calls, and then by prioritizing variants with greater proportions of reads supporting the non-reference allele. To prepare the variants for input into Paragraph, translocations, mitochondrial DNA variants, inversions and duplications over 5 kb, and variants without a ‘PASS’ filter were removed from the VCF. This resulted in a set of 107,866 SVs.

We compared this set of long-read discovered SVs with SVs discovered from short-read sequencing in 1KGP (Sudmant et al., 2015; http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/integrated_sv_map/supporting/GRCh38_positions/) and HGDP (Almarri et al., 2020; ftp://ngs.sanger.ac.uk/production/hgdp/hgdp_structural_variation). We merged long-read and short-read SV sets using parameters from Aganezov et al., 2020 (maximum distance of 500 bp between SV start sites, maximum difference of 10% between SV lengths). We chose not to require matching on SV type, strand orientation, or insertion sequence due to inconsistencies in SV representations across the short-read and long-read variant callers (Aganezov et al., 2020). In order to match the minimum SV length used by the short-read datasets, we also removed all long-read discovered SVs shorter than 50 bp. For the purpose of broad allele frequency binning for SVs called in each of these sets, we then calculated dataset-wide allele frequencies with PLINK v1.90b6.4 (Purcell et al., 2007).

Graph genotyping of structural variation

High-coverage 30× short-read sequencing data for the core 2504 individuals in 1KGP, sequenced by the New York Genome Center (Byrska-Bishop et al., 2021), was obtained from ENA (PRJEB31736). We genotyped SVs in these samples with Paragraph v2.2 (Chen et al., 2019). In accordance with Paragraph’s recommendations, we set the maximum permitted read count for variants to 20 times the mean sample depth in order to limit runtime for repetitive regions. Genotypes from all samples were combined using bcftools v1.9 (Danecek et al., 2021).

To obtain a high-quality set of genotyped SVs, we filtered the resulting data based on dataset-wide genotyping rates and within-population Hardy–Weinberg equilibrium. We determined an SV’s overall genotyping rate with cyvcf2 (Pedersen and Quinlan, 2017) and removed variants that were not genotyped in ≥50% of samples. We additionally calculated one-sided Hardy–Weinberg equilibrium p-values (excess of heterozygotes) for variants within each of the 26 1KGP populations, using the Hardy–Weinberg package from R (Graffelman, 2015). We filtered out SVs that violated equilibrium expectations (Fisher’s exact test, p < 1 × 10^–4) in ≥13 populations. Unfolded, within-population allele frequencies were calculated with PLINK.

Quantifying linkage disequilibrium with known SNPs and short indels

To calculate linkage disequilibrium (LD) between SVs and SNPs or short indels in 1KGP samples, we used small variant genotypes produced by the 1000 Genomes Consortium. These genotypes were generated by aligning the 1KGP Phase 3 data to GRCh38 and then calling variants against the GRCh38 reference and are restricted to biallelic SNVs and indels (http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/). Y chromosome genotypes, not included in the former data release, were obtained from Phase 3 variant calls lifted over to GRCh38 (http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/). We combined the 1KGP SNP and indel genotypes with our SV genotypes and calculated r² within each population, using a 100-variant and 100 Mb window, with PLINK v1.90b6.4.

Genome accessibility masks from 1KGP were obtained from http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/working/20160622_genome_mask_GRCh38/, and ENCODE blacklisted regions were obtained from https://www.encodeproject.org/files/ENCFF356LFX/.

We note peaks of r² around discrete values of 1, 0.5, 0.33, 0.25, etc. (Figure 1—figure supplement 7), which occur due to the properties of the r² statistic for various allele frequency combinations. Specifically, the maximum possible value of r² for a pair of rare alleles with frequencies p_a and p_b is described by VanLiere and Rosenberg, 2008 as:

r_{m a x}^{2} (p_{a}, p_{b}) = \frac{p_{a} (1 - p_{b})}{(1 - p_{a}) p_{b}}

Assuming a sample size of 198 chromosomes (the most common sample size for a given 1KGP population), the maximum possible value of r² for a pair of singletons is therefore:

p_{a} = 1 / 198

p_{b} = 1 / 198

r_{m a x}^{2} = 1

Meanwhile, the maximum possible value of r² for a singleton-doubleton pair is:

p_{a} = 1 / 198

p_{b} = 2 / 198

r_{m a x}^{2} = 0.497

And the maximum possible value of a singleton-tripleton pair is:

p_{a} = 1 / 198

p_{b} = 3 / 198

r_{m a x}^{2} = 0.330

Given that the most common allele frequency of a SNP or SV is 1/N (i.e., singletons), followed by 2/N (doubletons), 3/N (tripletons), etc., the aforementioned allele frequency combinations occur frequently, resulting in some common discrete values of the maximum r².

eQTL mapping

To conduct eQTL analysis, we used gene expression data generated by the Geuvadis Consortium (Lappalainen et al., 2013), which includes 447 intersecting samples from the following 1KGP populations: CEU, FIN, GBR, TSI, and YRI. Using the recount3 package from R/Bioconductor (Collado-Torres et al., 2017), we extracted gene expression counts for all corresponding samples. Counts were normalized across samples using the trimmed mean of M values (TMM) method from edgeR (Robinson et al., 2010). TPM values were also computed from raw counts. In accordance with the methods employed for eQTL mapping in the GTEx Project (The GTEx Consortium, 2020), we retained all genes with TPM values greater than or equal to 0.1, as well as raw read counts greater than or equal to six in at least 20% of samples. We then applied rank normalization to the TMM values for each remaining gene. We then performed cis-eQTL mapping with a modified version of fastQTL (https://github.com/hall-lab/fastqtl, Colby, 2016, Ongen et al., 2016), which accounts for SV size when determining the appropriate cis window. We conducted nominal and permutation passes with genotype principal components and sex included as covariates. Beta-distribution-approximated permutation p-values from fastQTL were used as input to estimate q-values and control the false discovery rate (FDR) with the qvalue package (http://github.com/jdstorey/qvalue, Storey et al., 2020, Storey and Tibshirani, 2003).

Next, we used the ChromHMM epigenetic state annotations generated by the Roadmap Epigenomics Project (Roadmap Epigenomics Consortium et al., 2015) for the E116 B lymphocyte cell line to determine whether SVs eQTLs were enriched for regulatory elements. For each of the 15 ChromHMM epigenetic states, we identified the SVs that overlapped with its epigenetic state annotations and determined its enrichment for SV eQTLs using Fisher’s exact test.

To identify potential causal variants for the 1121 significant gene-SV pairs, we performed eQTL fine-mapping using CAVIAR v2.2 (Hormozdiari et al., 2014). Nominal pass output of fastQTL was combined for SNPs and SVs, and p-values were back-converted to z-scores while also accounting for the sign of the association (i.e., positive or negative). For each significant eQTL, we extracted all tested variants within a 1 Mb window, excluding any variants with a nominal p-value > 0.05. We then used PLINK to calculate the LD matrix relating this set of variants. Using these two inputs, we then ran CAVIAR for each of the 1121 significant gene-SV pairs, recording the 90% credible causal set for each eQTL.

Admixture-aware scan for signatures of local adaptation

Common examples of frequency differentiation-based metrics to search for selection include Wright’s fixation index (F_ST) (Wright, 1949), as well as tree-based extensions of this concept such as the locus-specific branch length (LSBL) (Shriver et al., 2004) and population branch statistic (PBS) (Yi et al., 2010). While useful for polarizing frequency changes on particular lineages, these tree-based tests still require the specification of (typically three) populations for comparison. The number of possible comparisons thus grows in a combinatorial manner with the number of populations in the study. Specifically, for the 26 populations of 1KGP, there are 2600 (26 choose 3) possible comparisons. A second limitation of such tests is the definition of population, which may or may not reflect genetic patterns of population structure that occur at multiple scales. Moreover, many human populations exhibit substantial admixture, which is ignored by, and may confound, some frequency differentiation-based tests.

We overcame these limitations by using the software package Ohana (Cheng et al., 2019) to scan SVs for signatures of positive selection, that is extreme frequency differentiation between populations. Ohana uses allele frequency to model individuals as an admixed combination of k ancestral populations, constructing a genome-wide covariance matrix to describe the relationship between these ancestry components. Variants are then assessed to determine whether their allele frequencies are better explained by the genome-wide ‘neutral’ matrix, or by an alternative matrix that allows allele frequencies to vary in one ancestry component.

In accordance with Ohana’s recommended workflow, we conducted admixture inference on the 1KGP dataset with a set of ~100,000 variants, downsampled from Chromosome 21 of the 1KGP SNV/indel callset used for LD calculations above. Downsampling was performed with PLINK’s variant pruning function. Inference of admixture proportions in the dataset was allowed to continue until increased iterations produced qualitatively similar results (50 iterations). The covariance matrix generated from this downsampled dataset was used as a neutral input for downstream scans for selection on SVs.

In order to generate ‘selection hypothesis’ matrices to search for selection in a specific ancestry component, we modified the neutral covariance matrix by allowing one component at a time to have a greater covariance. A scalar value of 10, representing the furthest possible deviation that a variant could have in a population under selection, was added to elements of the neutral covariance matrix depending on the population of interest. For each variant of interest, Ohana then computes the likelihood of the observed ancestry-component-specific allele frequencies under the selection and neutral models, then compares them by computing a likelihood ratio statistic (LRS), which quantifies relative support for the selection hypothesis. We note that the ‘selection hypothesis’ matrices we generated for our study enable us to perform tests for frequency differentiation in one ancestry component at a time. Testing for selection in multiple ancestry components would require modifying multiple components in the neutral covariance matrix. While some variants have outlier LRS values in multiple ancestry components, this reflects a combined signature of extremely high allele frequency in one ancestry component and extremely low allele frequency in another (e.g., the IGHG4 insertion we highlight is an outlier in both ancestry component 2, where it reaches high allele frequencies, and ancestry component 6, where it is much rarer than expected from the null model [Figure 3B, panel for ancestry component 6]).

We filtered our results to remove extreme outliers in null model log-likelihoods (global log likelihood [LLE] < −1000), which were unremarkable in their patterns of allele frequency and instead indicated a failure of the neutral model to converge for a small subset of rare variants. To calculate p-values, we then compared the LRS to a chi-square distribution with one degree of freedom (Wojcik et al., 2019) and adjusted for multiple hypothesis testing using a Bonferroni correction. To further refine the list of candidate selected loci, we also compared the ancestry component-specific LRS computed for each SV to the observed distribution of LRS computed from SNPs and short indels matched on global minor allele frequency (in 1% frequency bins). Specifically, we identified SVs with LRS exceeding the 99.9 percentile of the empirical LRS distribution for frequency-matched SNPs and short indels as calculated using identical methods. These background SNPs and indels were limited to Chromosome 1 for computational efficiency, but results were qualitatively unaffected by the choice of chromosome.

Assessment of archaic introgression

To identify candidate archaic introgressed SVs among the set of highly differentiated variants, we tested each SV for LD with putative introgressed SNPs as identified based on published results from the method Sprime, which is based on signatures of LD and divergence from an African outgroup population (Browning et al., 2018). Specifically, we computed pairwise LD between the SV and all SNPs in a 100 kb window, matching the ancestry component to a corresponding population from 1KGP. We additionally computed the allele frequencies of SVs in African populations, excluding the ASW (Americans of African Ancestry in SW USA) and ACB (African Caribbeans in Barbados) populations which exhibit substantial non-African admixture. We reported all highly differentiated SVs possessing r² > 0.5 with any putative introgressed SNP and AF < 0.01 in non-admixed African populations (Supplementary file 2).

We obtained masks for the Altai Neanderthal sequence, which were used in filtering by Browning et al., 2018, from http://cdna.eva.mpg.de/neandertal/Vindija/FilterBed/. These masked regions filter for coverage (stratified by GC content), require a minimum coverage of 10×, require all 35-mers in the region to be unique, and exclude tandem repeats and indels.

To efficiently search for the IGHG4 insertion in additional datasets, we designed a 48 bp sequence (TGGAGAGAGTGGGGGACAGCGTCAGGGACAGGTGGGGACAGCCTGGGG) that spans the insertion breakpoint, extending across the entire 33 bp of the insertion itself as well as 11 bp and 4 bp, into the respective upstream and downstream flanking regions. BLAST searches for this sequence in the hg19 and GRCh38 human reference genomes returned no exact matches, but the diagnostic sequence was sufficiently similar to the reference sequence that reads containing it still mapped to the same locus. Sequence alignments from four high-coverage archaic hominin samples (Altai Neanderthal, Vindija Neanderthal, Chagyrskaya Neanderthal, and Denisovan) were obtained from http://cdna.eva.mpg.de/neandertal and http://cdna.eva.mpg.de/denisova. Forward strand sequences of unique reads were extracted from reads aligned to the hg19 reference genome, and exact matches to the 48 bp query sequence were identified using grep.

Evidence of introgression at the IGH locus was further examined by counting observed alleles at highly differentiated SNPs from sequenced alignments for each high-coverage archaic sample (see above). Sites with two or more reads supporting the alternative allele were used to define matching and color Figure 4. further conditions on alternative allele frequency ≤ 1% within African populations of 1KGP.

Inference of selection parameters with approximate Bayesian computation

We used a sequential algorithm for approximate Bayesian computation (Lenormand et al., 2013; Pritchard et al., 1999), implemented with the R package EasyABC (Jabot et al., 2013), to infer the strength and timing of selection at the IGHG4 locus, as well as the initial frequency of the adaptive allele. This approach consisted of drawing model parameters from prior distributions as input to the forward evolutionary simulation software package SLiM (Haller and Messer, 2019), computing summary statistics from each simulation, and comparing to those observed from our data. Simulations with summary statistics most closely matching the observed data are then used to construct posterior distributions of the model parameters. The sequential algorithm automatically determines the tolerance level and uses a predetermined stopping criterion (p_accmin = 0.05), thereby reducing the necessary number of simulations and improving estimates of the posterior distribution (Lenormand et al., 2013).

We constructed a simplified five-population demographic model based on parameters obtained from Jouganous et al., 2017 and Wang et al., 2018, as well as population size estimates from 1KGP (The 1000 Genomes Project Consortium, 2015; Figure 5C). Specifically, our model consisted of an initial divergence event between the CEU and East Asian populations at 46 kya, subsequent divergence between the population ancestral to CHB/JPT and the population ancestral to KHV/CHX at 9.8 kya, and final splits between CHB/JPT and KHV/CHX at 9.0 kya and 1.7 kya, respectively. Each new population was drawn from its originating population at the same size of the originating population. The size of the ancestral population was set at 2831 (Jouganous et al., 2017) and allowed to expand exponentially at a rate of 1.25 × 10^–3 per generation, resulting in a final size of 17,883 in each subpopulation, broadly consistent with pairwise sequential Markovian coalescent (PSMC) results presented by 1KGP for these populations (The 1000 Genomes Project Consortium, 2015). We allowed the initial allele frequency of the selected variant (p₀) to vary between 0 and 1, the timing of the onset of selection (T_adaptive) to vary between 1 and 1686 generations ago (i.e., the entire simulated timespan), and the selection coefficient (s) to vary between –0.01 and 0.2 – all drawn from uniform distributions. We used the selected variant’s end-of-simulation allele frequencies as a summary statistic for the model, comparing these frequencies to the observed allele frequencies of the IGHG4 insertion in these five populations.

To obtain a background distribution of the expected frequencies of the introgressed haplotype in the absence of selection, we performed 100,000 of the simulations described above with a selection coefficient of 0. For ease of comparison to the observed data, we used the end-of-simulation population branch statistic (PBS) (Yi et al., 2010), calculated between the CDX, JPT, and CEU populations, as a summary statistic. We compared these PBS values, as well as the inter-population F_ST, to those observed for the IGHG4 insertion in these populations.

Phenotype-wide association analysis

We examined potential phenotype associations with the putative Neanderthal-introgressed haplotype at the IGH locus by extracting summary statistics from the pan-ancestry analysis of the UK Biobank (https://pan.ukbb.broadinstitute.org/). Specifically, we obtained association p-values for two SNPs, which each tag one of the two major LD blocks (rs115091999 and rs150526114). We restricted analysis to individuals of East Asian ancestry. No variants were significant after Bonferroni correction (Figure 5—figure supplement 2).

Acknowledgements

We thank Tim O’Connor, Sai Chen, John Kim, and members of the McCoy lab for feedback and helpful discussions. We also thank the staff at the Maryland Advanced Research Computing Center for computing support. This work is supported by the National Institutes of Health (NIH) grant R35GM133747 to RCM and the US National Science Foundation grant DBI-1350041 to MCS.

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Contributor Information

Rajiv C McCoy, Email: rajiv.mccoy@jhu.edu.

George H Perry, Pennsylvania State University, United States.

Funding Information

This paper was supported by the following grants:

National Institutes of Health R35GM133747 to Rajiv C McCoy.
National Science Foundation DBI-1350041 to Michael C Schatz.

Additional information

Competing interests

No competing interests declared.

Author contributions

Conceptualization, Formal analysis, Investigation, Methodology, Software, Visualization, Writing – original draft, Writing – review and editing.

Data curation, Investigation, Software, Writing – review and editing.

Formal analysis, Investigation, Methodology, Software, Visualization, Writing – review and editing.

Investigation, Software.

Investigation, Software, Visualization.

Conceptualization, Funding acquisition, Writing – review and editing.

Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Software, Visualization, Writing – original draft, Writing – review and editing.

Additional files

Supplementary file 1. Highly differentiated SV loci across ancestry components.

The top three SVs per ancestry component are reported. The IGH insertion and deletion are highlighted in bold text.

elife-67615-supp1.docx^{(16.4KB, docx)}

Supplementary file 2. Highly differentiated SVs that are also significant eQTLs.

elife-67615-supp2.docx^{(13.8KB, docx)}

Supplementary file 3. Highly differentiated SVs in LD (r² >0.5) with putative archaic introgressed haplotypes called by Sprime.

The IGH insertion and deletion are highlighted in bold text.

elife-67615-supp3.docx^{(16.2KB, docx)}

Transparent reporting form

elife-67615-transrepform1.docx^{(112.9KB, docx)}

Data availability

All code necessary for reproducing our analysis is available on GitHub (https://github.com/mccoy-lab/sv_selection; copy archivedat https://archive.softwareheritage.org/swh:1:rev:2e105bf050fd2acb9f1e0d1833566c0b1ca5b41f). SV genotypes, eQTL results, and selection scan results are available on Zenodo (https://doi.org/10.5281/zenodo.5509980).

The following dataset was generated:

Yan SM, McCoy RC. 2021. Data from: Local adaptation and archaic introgression shape global diversity at human structural variant loci. Zenodo.

The following previously published datasets were used:

Audano PA. 2019. Long-read sequencing of a Puerto Rican individual (HG00733) NCBI BioProject. PRJNA300840

Audano PA. 2019. Long-read sequencing of a Yoruban individual (NA19240) NCBI BioProject. PRJNA288807

Audano PA. 2019. Long-read sequencing of a Gambian individual (HG02818) NCBI BioProject. PRJNA339722

Audano PA. 2019. Long-read sequencing of a Luhya individual (NA19434) NCBI BioProject. PRJNA385272

Audano PA. 2019. Long-read sequencing of a Vietnamese individual (HG02059) NCBI BioProject. PRJNA339726

Audano PA. 2019. Long-read sequencing of a Utah individual with Northern and Western European ancestry (NA12878) NCBI BioProject. PRJNA323611

Audano PA. 2019. Long-read sequencing of a Telugu individual (HG04217) NCBI BioProject. PRJNA481794

Audano PA. 2019. Long-read sequencing of a Peruvian individual (HG02106) NCBI BioProject. PRJNA480858

Audano PA. 2019. Long-read sequencing of a Finnish individual (HG00268) NCBI BioProject. PRJNA480712

Audano PA. 2019. Long-read sequencing of a Han Chinese individual (HG00514) NCBI BioProject. PRJNA300843

Byrska-Bishop M. 2019. 1000 Genomes Project phase 3: 30X coverage whole genome sequencing. NCBI BioProject. PRJEB31736

Huddleston J. 2017. Long-read sequencing of a hydatidiform mole (CHM1) NCBI BioProject. PRJNA246220

Huddleston J. 2017. Long-read sequencing of a hydatidiform mole (CHM13) NCBI BioProject. PRJNA269593

Lappalainen T. 2013. RNA-sequencing of 465 lymphoblastoid cell lines from the 1000 Genomes. ArrayExpress. E-GEUV-1

Seo J-S. 2016. Long-read sequencing of a Korean individual (AK1) NCBI BioProject. PRJNA298944

Shi L. 2016. Long-read sequencing of a Chinese individual (HX1) NCBI BioProject. PRJNA301527

References

Abi-Rached L, Jobin MJ, Kulkarni S, McWhinnie A, Dalva K, Gragert L, Babrzadeh F, Gharizadeh B, Luo M, Plummer FA, Kimani J, Carrington M, Middleton D, Rajalingam R, Beksac M, Marsh SGE, Maiers M, Guethlein LA, Tavoularis S, Little AM, Green RE, Norman PJ, Parham P. The shaping of modern human immune systems by multiregional admixture with archaic humans. Science. 2011;334:89–94. doi: 10.1126/science.1209202. [DOI] [PMC free article] [PubMed] [Google Scholar]
Aganezov S, Goodwin S, Sherman RM, Sedlazeck FJ, Arun G, Bhatia S, Lee I, Kirsche M, Wappel R, Kramer M, Kostroff K, Spector DL, Timp W, McCombie WR, Schatz MC. Comprehensive analysis of structural variants in breast cancer genomes using single-molecule sequencing. Genome Research. 2020;30:1258–1273. doi: 10.1101/gr.260497.119. [DOI] [PMC free article] [PubMed] [Google Scholar]
Aganezov S, Yan SM, Soto DC, Kirsche M, Zarate S, Avdeyev P, Taylor DJ, Shafin K, Shumate A, Xiao C, Wagner J, McDaniel J, Olson ND, Sauria MEG, Vollger MR, Rhie A, Meredith M, Martin S, Lee J, Koren S, Rosenfeld JA, Paten B, Layer R, Chin CS, Sedlazeck FJ, Hansen NF, Miller DE, Phillippy AM, Miga KH, McCoy RC, Dennis MY, Zook JM, Schatz MC. A Complete Reference Genome Improves Analysis of Human Genetic Variation. bioRxiv. 2021 doi: 10.1101/2021.07.12.452063. [DOI] [PMC free article] [PubMed]
Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Research. 2009;19:1655–1664. doi: 10.1101/gr.094052.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
Almarri MA, Bergström A, Prado-Martinez J, Yang F, Fu B, Dunham AS, Chen Y, Hurles ME, Tyler-Smith C, Xue Y. Population structure, stratification, and introgression of human structural variation. Cell. 2020;182:189–199. doi: 10.1016/j.cell.2020.05.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
Alonge M, Wang X, Benoit M, Soyk S, Pereira L, Zhang L, Suresh H, Ramakrishnan S, Maumus F, Ciren D, Levy Y, Harel TH, Shalev-Schlosser G, Amsellem Z, Razifard H, Caicedo AL, Tieman DM, Klee H, Kirsche M, Aganezov S, Ranallo-Benavidez TR, Lemmon ZH, Kim J, Robitaille G, Kramer M, Goodwin S, McCombie WR, Hutton S, Van Eck J, Gillis J, Eshed Y, Sedlazeck FJ, van der Knaap E, Schatz MC, Lippman ZB. Major Impacts of Widespread Structural Variation on Gene Expression and Crop Improvement in Tomato. Cell. 2020;182:145–161. doi: 10.1016/j.cell.2020.05.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
Amemiya HM, Kundaje A, Boyle AP. The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Scientific Reports. 2019;9:9354. doi: 10.1038/s41598-019-45839-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
Audano PA, Sulovari A, Graves-Lindsay TA, Cantsilieris S, Sorensen M, Welch AE, Dougherty ML, Nelson BJ, Shah A, Dutcher SK, Warren WC, Magrini V, McGrath SD, Yi L, Wilson RK, Eichler EE. Characterizing the Major Structural Variant Alleles of the Human Genome. Cell. 2019;176:663–675. doi: 10.1016/j.cell.2018.12.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
Barreiro LB, Quintana-Murci L. Evolutionary and population (epi)genetics of immunity to infection. Human Genetics. 2020;139:723–732. doi: 10.1007/s00439-020-02167-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bergström A, McCarthy SA, Hui R, Almarri MA, Ayub Q, Danecek P, Chen Y, Felkel S, Hallast P, Kamm J, Blanché H, Deleuze JF, Cann H, Mallick S, Reich D, Sandhu MS, Skoglund P, Scally A, Xue Y, Durbin R, Tyler-Smith C. Insights into human genetic variation and population history from 929 diverse genomes. Science. 2020;367:eaay5012. doi: 10.1126/science.aay5012. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bersaglieri T, Sabeti PC, Patterson N, Vanderploeg T, Schaffner SF, Drake JA, Rhodes M, Reich DE, Hirschhorn JN. Genetic Signatures of Strong Recent Positive Selection at the Lactase Gene. The American Journal of Human Genetics. 2004;74:1111–1120. doi: 10.1086/421051. [DOI] [PMC free article] [PubMed] [Google Scholar]
Beyter D, Ingimundardottir H, Oddsson A, Eggertsson HP, Bjornsson E, Jonsson H, Atlason BA, Kristmundsdottir S, Mehringer S, Hardarson MT, Gudjonsson SA, Magnusdottir DN, Aslaug J, Adalbjorg J, Kristjansson RP, Sverrisson ST, Holley G, Palsson G, Stefansson OA, Eyjolfsson G, Olafsson I, Sigurdardottir O, Torfason B, Masson G, Helgason A, Thorsteinsdottir U, Holm H, Gudbjartsson DF, Sulem P, Magnusson OT, Halldorsson BV, Stefansson K. Long-read sequencing of 3,622 Icelanders provides insight into the role of structural variants in human diseases and other traits. Nature Genetics. 2021;53:779–786. doi: 10.1038/s41588-021-00865-4. [DOI] [PubMed] [Google Scholar]
Bollback JP, York TL, Nielsen R. Estimation of 2Nes from temporal allele frequency data. Genetics. 2008;179:497–502. doi: 10.1534/genetics.107.085019. [DOI] [PMC free article] [PubMed] [Google Scholar]
Browning SR, Browning BL, Zhou Y, Tucci S, Akey JM. Analysis of Human Sequence Data Reveals Two Pulses of Archaic Denisovan Admixture. Cell. 2018;173:53–61. doi: 10.1016/j.cell.2018.02.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
Byrska-Bishop M, Evani US, Zhao X, Basile AO, Abel HJ, Regier AA, Corvelo A, Clarke WE, Musunuri R, Nagulapalli K, Fairley S, Runnels A, Winterkorn L, Lowy-Gallego E, Flicek P, Germer S, Brand H, Hall IM, Talkowski ME, Narzisi G, Zody MC. High Coverage Whole Genome Sequencing of the Expanded 1000 Genomes Project Cohort Including 602 Trios. bioRxiv. 2021 doi: 10.1101/2021.02.06.430068. [DOI] [PMC free article] [PubMed]
Chaisson MJP, Huddleston J, Dennis MY, Sudmant PH, Malig M, Hormozdiari F, Antonacci F, Surti U, Sandstrom R, Boitano M, Landolin JM, Stamatoyannopoulos JA, Hunkapiller MW, Korlach J, Eichler EE. Resolving the complexity of the human genome using single-molecule sequencing. Nature. 2015;517:608–611. doi: 10.1038/nature13907. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chaisson MJP, Sanders AD, Zhao X, Malhotra A, Porubsky D, Rausch T, Gardner EJ, Rodriguez OL, Guo L, Collins RL, Fan X, Wen J, Handsaker RE, Fairley S, Kronenberg ZN, Kong X, Hormozdiari F, Lee D, Wenger AM, Hastie AR, Antaki D, Anantharaman T, Audano PA, Brand H, Cantsilieris S, Cao H, Cerveira E, Chen C, Chen X, Chin CS, Chong Z, Chuang NT, Lambert CC, Church DM, Clarke L, Farrell A, Flores J, Galeev T, Gorkin DU, Gujral M, Guryev V, Heaton WH, Korlach J, Kumar S, Kwon JY, Lam ET, Lee JE, Lee J, Lee WP, Lee SP, Li S, Marks P, Viaud-Martinez K, Meiers S, Munson KM, Navarro FCP, Nelson BJ, Nodzak C, Noor A, Kyriazopoulou-Panagiotopoulou S, Pang AWC, Qiu Y, Rosanio G, Ryan M, Stütz A, Spierings DCJ, Ward A, Welch AE, Xiao M, Xu W, Zhang C, Zhu Q, Zheng-Bradley X, Lowy E, Yakneen S, McCarroll S, Jun G, Ding L, Koh CL, Ren B, Flicek P, Chen K, Gerstein MB, Kwok PY, Lansdorp PM, Marth GT, Sebat J, Shi X, Bashir A, Ye K, Devine SE, Talkowski ME, Mills RE, Marschall T, Korbel JO, Eichler EE, Lee C. Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nature Communications. 2019;10:1784. doi: 10.1038/s41467-018-08148-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen S, Krusche P, Dolzhenko E, Sherman RM, Petrovski R, Schlesinger F, Kirsche M, Bentley DR, Schatz MC, Sedlazeck FJ, Eberle MA. Paragraph: a graph-based structural variant genotyper for short-read sequence data. Genome Biology. 2019;20:291. doi: 10.1186/s13059-019-1909-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cheng JY, Mailund T, Nielsen R. Fast admixture analysis and population tree estimation for SNP and NGS data. Bioinformatics. 2017;33:2148–2155. doi: 10.1093/bioinformatics/btx098. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cheng JY, Racimo F, Nielsen R. Ohana: Detecting Selection in Multiple Populations by Modelling Ancestral Admixture Components. bioRxiv. 2019 doi: 10.1101/546408. [DOI] [PMC free article] [PubMed]
Chiang C, Scott AJ, Davis JR, Tsang EK, Li X, Kim Y, Hadzic T, Damani FN, Ganel L, GTEx Consortium. Montgomery SB, Battle A, Conrad DF, Hall IM. The impact of structural variation on human gene expression. Nature Genetics. 2017;49:692–699. doi: 10.1038/ng.3834. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chin CS, Wagner J, Zeng Q, Garrison E, Garg S, Fungtammasan A, Rautiainen M, Aganezov S, Kirsche M, Zarate S, Schatz MC, Xiao C, Rowell WJ, Markello C, Farek J, Sedlazeck FJ, Bansal V, Yoo B, Miller N, Zhou X, Carroll A, Barrio AM, Salit M, Marschall T, Dilthey AT, Zook JM. A diploid assembly-based benchmark for variants in the major histocompatibility complex. Nature Communications. 2020;11:4794. doi: 10.1038/s41467-020-18564-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Colby C. FastQTL. 4d03819GitHub. 2016 https://github.com/hall-lab/fastqtl
Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT. Reproducible RNA-seq analysis using recount2. Nature Biotechnology. 2017;35:319–321. doi: 10.1038/nbt.3838. [DOI] [PMC free article] [PubMed] [Google Scholar]
Conrad DF, Jakobsson M, Coop G, Wen X, Wall JD, Rosenberg NA, Pritchard JK. A worldwide survey of haplotype variation and linkage disequilibrium in the human genome. Nature Genetics. 2006;38:1251–1260. doi: 10.1038/ng1911. [DOI] [PubMed] [Google Scholar]
Crawford NG, Kelly DE, Hansen MEB, Beltrame MH, Fan S, Bowman SL, Jewett E, Ranciaro A, Thompson S, Lo Y, Pfeifer SP, Jensen JD, Campbell MC, Beggs W, Hormozdiari F, Mpoloka SW, Mokone GG, Nyambo T, Meskel DW, Belay G, Haut J, Rothschild H, Zon L, Zhou Y, Kovacs MA, Xu M, Zhang T, Bishop K, Sinclair J, Rivas C, Elliot E, Choi J, Sa L, Hicks B, Burgess S, Abnet C, Watkins-Chow DE, Oceana E, Song YS, Eskin E, Brown KM, Marks MS, Loftus SK, Pavan WJ, Yeager M, Chanock S, Tishkoff SA. Loci associated with skin pigmentation identified in African populations. Science. 2017;358:eaan8433. doi: 10.1126/science.aan8433. [DOI] [PMC free article] [PubMed] [Google Scholar]
Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM, Li H. Twelve years of SAMtools and BCFtools. GigaScience. 2021;10:giab008. doi: 10.1093/gigascience/giab008. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dannemann M, Andrés AM, Kelso J. Introgression of Neandertal- and Denisovan-like Haplotypes Contributes to Adaptive Variation in Human Toll-like Receptors. American Journal of Human Genetics. 2016;98:22–23. doi: 10.1016/j.ajhg.2015.11.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dean M, Carrington M, Winkler C, Huttley GA, Smith MW, Allikmets R, Goedert JJ, Buchbinder SP, Vittinghoff E, Gomperts E, Donfield S, Vlahov D, Kaslow R, Saah A, Rinaldo C, Detels R, O’Brien SJ. Genetic restriction of HIV-1 infection and progression to AIDS by a deletion allele of the CKR5 structural gene Hemophilia Growth and Development Study, Multicenter AIDS Cohort Study, Multicenter Hemophilia Cohort Study, San Francisco City Cohort, ALIVE Study. Science. 1996;273:1856–1862. doi: 10.1126/science.273.5283.1856. [DOI] [PubMed] [Google Scholar]
Ding J, Wu W, Yang J, Wu M. Long non-coding RNA MIF-AS1 promotes breast cancer cell proliferation, migration and EMT process through regulating miR-1249-3p/HOXB8 axis. Pathology, Research and Practice. 2019;215:005. doi: 10.1016/j.prp.2019.03.005. [DOI] [PubMed] [Google Scholar]
Ebert P, Audano PA, Zhu Q, Rodriguez-Martin B, Porubsky D, Bonder MJ, Sulovari A, Ebler J, Zhou W, Serra Mari R, Yilmaz F, Zhao X, Hsieh P, Lee J, Kumar S, Lin J, Rausch T, Chen Y, Ren J, Santamarina M, Höps W, Ashraf H, Chuang NT, Yang X, Munson KM, Lewis AP, Fairley S, Tallon LJ, Clarke WE, Basile AO, Byrska-Bishop M, Corvelo A, Evani US, T-Y L, Chaisson MJP, Chen J, Li C, Brand H, Wenger AM, Ghareghani M, Harvey WT, Raeder B, Hasenfeld P, Regier AA, Abel HJ, Hall IM, Flicek P, Stegle O, Gerstein MB, Tubio JMC, Mu Z, Yi L, Shi X, Hastie AR, Ye K, Chong Z, Sanders AD, Zody MC, Talkowski ME, Mills RE, Devine SE, Lee C, Korbel JO, Marschall T, Eichler EE. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science. 2021;372:eabf7117. doi: 10.1126/science.abf7117. [DOI] [PMC free article] [PubMed] [Google Scholar]
Elguero E, Délicat-Loembet LM, Rougeron V, Arnathau C, Roche B, Becquart P, Gonzalez JP, Nkoghe D, Sica L, Leroy EM, Durand P, Ayala FJ, Ollomo B, Renaud F, Prugnolle F. Malaria continues to select for sickle cell trait in Central Africa. PNAS. 2015;112:7051–7054. doi: 10.1073/pnas.1505665112. [DOI] [PMC free article] [PubMed] [Google Scholar]
Enard D, Petrov DA. Evidence that RNA Viruses Drove Adaptive Introgression between Neanderthals and Modern Humans. Cell. 2018;175:360–371. doi: 10.1016/j.cell.2018.08.034. [DOI] [PMC free article] [PubMed] [Google Scholar]
Falush D, Stephens M, Pritchard JK. Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics. 2003;164:1567–1587. doi: 10.1093/genetics/164.4.1567. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan S, Hansen MEB, Lo Y, Tishkoff SA. Going global by adapting local: A review of recent human adaptation. Science. 2016;354:54–59. doi: 10.1126/science.aaf5098. [DOI] [PMC free article] [PubMed] [Google Scholar]
Galvani AP, Slatkin M. Evaluating plague and smallpox as historical selective pressures for the CCR5-Delta 32 HIV-resistance allele. PNAS. 2003;100:15276–15279. doi: 10.1073/pnas.2435085100. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gittelman RM, Schraiber JG, Vernot B, Mikacenic C, Wurfel MM, Akey JM. Archaic Hominin Admixture Facilitated Adaptation to Out-of-Africa Environments. Current Biology. 2016;26:3375–3382. doi: 10.1016/j.cub.2016.10.041. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gouy A, Excoffier L. Polygenic Patterns of Adaptive Introgression in Modern Humans Are Mainly Shaped by Response to Pathogens. Molecular Biology and Evolution. 2020;37:1420–1433. doi: 10.1093/molbev/msz306. [DOI] [PubMed] [Google Scholar]
Graffelman J. Exploring Diallelic Genetic Markers: The HardyWeinberg Package. Journal of Statistical Software. 2015;64:i03. doi: 10.18637/jss.v064.i03. [DOI] [Google Scholar]
Graffelman J, Jain D, Weir B. A genome-wide study of Hardy-Weinberg equilibrium with next generation sequence data. Human Genetics. 2017;136:727–741. doi: 10.1007/s00439-017-1786-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
Green RE, Krause J, Briggs AW, Maricic T, Stenzel U, Kircher M, Patterson N, Li H, Zhai W, Fritz MH-Y, Hansen NF, Durand EY, Malaspinas A-S, Jensen JD, Marques-Bonet T, Alkan C, Prüfer K, Meyer M, Burbano HA, Good JM, Schultz R, Aximu-Petri A, Butthof A, Höber B, Höffner B, Siegemund M, Weihmann A, Nusbaum C, Lander ES, Russ C, Novod N, Affourtit J, Egholm M, Verna C, Rudan P, Brajkovic D, Kucan Ž, Gušic I, Doronichev VB, Golovanova LV, Lalueza-Fox C, de la Rasilla M, Fortea J, Rosas A, Schmitz RW, Johnson PLF, Eichler EE, Falush D, Birney E, Mullikin JC, Slatkin M, Nielsen R, Kelso J, Lachmann M, Reich D, Pääbo S. A draft sequence of the Neandertal genome. Science. 2010;328:710–722. doi: 10.1126/science.1188021. [DOI] [PMC free article] [PubMed] [Google Scholar]
Haller BC, Messer PW. SLiM 3: Forward Genetic Simulations Beyond the Wright–Fisher Model. Molecular Biology and Evolution. 2019;36:632–637. doi: 10.1093/molbev/msy228. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hamid I, Korunes KL, Beleza S, Goldberg A. Rapid adaptation to malaria facilitated by admixture in the human population of Cabo Verde. eLife. 2021;10:e63177. doi: 10.7554/eLife.63177. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hickey G, Heller D, Monlong J, Sibbesen JA, Sirén J, Eizenga J, Dawson ET, Garrison E, Novak AM, Paten B. Genotyping structural variants in pangenome graphs using the vg toolkit. Genome Biology. 2020;21:35. doi: 10.1186/s13059-020-1941-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hormozdiari F, Kostem E, Kang EY, Pasaniuc B, Eskin E. Identifying Causal Variants at Loci with Multiple Signals of Association. Genetics. 2014;198:497–508. doi: 10.1534/genetics.114.167908. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hsieh P, Vollger MR, Dang V, Porubsky D, Baker C, Cantsilieris S, Hoekzema K, Lewis AP, Munson KM, Sorensen M, Kronenberg ZN, Murali S, Nelson BJ, Chiatante G, Maggiolini FAM, Blanché H, Underwood JG, Antonacci F, Deleuze JF, Eichler EE. Adaptive archaic introgression of copy number variants and the discovery of previously unknown human genes. Science. 2019;366:eaax2083. doi: 10.1126/science.aax2083. [DOI] [PMC free article] [PubMed] [Google Scholar]
Huerta-Sánchez E, Jin X, Asan B, Peter Z, Vinckenbosch BM, Liang N, Yi Y, He X, Somel M, Ni M, Wang P, Ou B, Huasang X, Luosang J, Zxp C, Li K, Gao G, Yin Y, Wang W, Zhang X, Xu X, Yang H, Li Y, Jian W, Jun W, Nielsen R. Altitude adaptation in Tibetans caused by introgression of Denisovan-like DNA. Nature. 2014;512:194–197. doi: 10.1038/nature13408. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ilardo MA, Moltke I, Korneliussen TS, Cheng J, Stern AJ, Racimo F, de Barros Damgaard P, Sikora M, Seguin-Orlando A, Rasmussen S, van den Munckhof ICL, Ter Horst R, Joosten LAB, Netea MG, Salingkat S, Nielsen R, Willerslev E. Physiological and Genetic Adaptations to Diving in Sea Nomads. Cell. 2018;173:569–580. doi: 10.1016/j.cell.2018.03.054. [DOI] [PubMed] [Google Scholar]
Jablonski NG. The Evolution of Human Skin and Skin Color. Annual Review of Anthropology. 2004;33:585–623. doi: 10.1146/annurev.anthro.33.070203.143955. [DOI] [Google Scholar]
Jabot F, Faure T, Dumoulin N. EasyABC: Performing efficient approximate Bayesian computation sampling schemes using R. Methods in Ecology and Evolution. 2013;10:12050. doi: 10.1111/2041-210x.12050. [DOI] [Google Scholar]
Jakubosky D, D’Antonio M, Bonder MJ, Smail C, Donovan MKR, Greenwald WWY, Matsui H, i2QTL Consortium. D’Antonio-Chronowska A, Stegle O, Smith EN, Montgomery SB, DeBoever C, Frazer KA. Properties of structural variants and short tandem repeats associated with gene expression and complex traits. Nature Communications. 2020a;11:2927. doi: 10.1038/s41467-020-16482-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jakubosky D, Smith EN, D’Antonio M, Bonder MJ, Young Greenwald WW, D’Antonio-Chronowska A, Matsui H, i2QTL Consortium. Stegle O, Montgomery SB, DeBoever C, Frazer KA. Discovery and quality analysis of a comprehensive set of structural variants and short tandem repeats. Nature Communications. 2020b;11:2928. doi: 10.1038/s41467-020-16481-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jeffares DC, Jolly C, Hoti M, Speed D, Shaw L, Rallis C, Balloux F, Dessimoz C, Bähler J, Sedlazeck FJ. Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast. Nature Communications. 2017;8:14061. doi: 10.1038/ncomms14061. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jouganous J, Long W, Ragsdale AP, Gravel S. Inferring the Joint Demographic History of Multiple Populations: Beyond the Diffusion Approximation. Genetics. 2017;206:1549–1567. doi: 10.1534/genetics.117.200493. [DOI] [PMC free article] [PubMed] [Google Scholar]
Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alföldi J, Wang Q, Collins RL, Laricchia KM, Ganna A, Birnbaum DP, Gauthier LD, Brand H, Solomonson M, Watts NA, Rhodes D, Singer-Berk M, England EM, Seaby EG, Kosmicki JA, Walters RK, Tashman K, Farjoun Y, Banks E, Poterba T, Wang A, Seed C, Whiffin N, Chong JX, Samocha KE, Pierce-Hoffman E, Zappala Z, O’Donnell-Luria AH, Minikel EV, Weisburd B, Lek M, Ware JS, Vittal C, Armean IM, Bergelson L, Cibulskis K, Connolly KM, Covarrubias M, Donnelly S, Ferriera S, Gabriel S, Gentry J, Gupta N, Jeandet T, Kaplan D, Llanwarne C, Munshi R, Novod S, Petrillo N, Roazen D, Ruano-Rubio V, Saltzman A, Schleicher M, Soto J, Tibbetts K, Tolonen C, Wade G, Talkowski ME, Neale BM, Daly MJ, MacArthur DG. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581:434–443. doi: 10.1038/s41586-020-2308-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
Keinan A, Clark AG. Recent Explosive Human Population Growth Has Resulted in an Excess of Rare Genetic Variants. Science. 2012;336:740–743. doi: 10.1126/science.1217283. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kosugi S, Momozawa Y, Liu X, Terao C, Kubo M, Kamatani Y. Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing. Genome Biology. 2019;20:117. doi: 10.1186/s13059-019-1720-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kothapalli KSD, Ye K, Gadgil MS, Carlson SE, O’Brien KO, Zhang JY, Park HG, Ojukwu K, Zou J, Hyon SS, Joshi KS, Gu Z, Keinan A, Brenna JT. Positive Selection on a Regulatory Insertion-Deletion Polymorphism in FADS2 Influences Apparent Endogenous Synthesis of Arachidonic Acid. Molecular Biology and Evolution. 2016;33:1726–1739. doi: 10.1093/molbev/msw049. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lachance J, Tishkoff SA. SNP ascertainment bias in population genetic analyses: Why it is important, and how to correct it. BioEssays. 2013;35:780–786. doi: 10.1002/bies.201300014. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lappalainen T, Sammeth M, Friedländer MR, ’t Hoen PAC, Monlong J, Rivas MA, Gonzàlez-Porta M, Kurbatova N, Griebel T, Ferreira PG, Barann M, Wieland T, Greger L, van Iterson M, Almlöf J, Ribeca P, Pulyakhina I, Esser D, Giger T, Tikhonov A, Sultan M, Bertier G, MacArthur DG, Lek M, Lizano E, Buermans HPJ, Padioleau I, Schwarzmayr T, Karlberg O, Ongen H, Kilpinen H, Beltran S, Gut M, Kahlem K, Amstislavskiy V, Stegle O, Pirinen M, Montgomery SB, Donnelly P, McCarthy MI, Flicek P, Strom TM, Geuvadis Consortium. Lehrach H, Schreiber S, Sudbrak R, Carracedo A, Antonarakis SE, Häsler R, Syvänen A-C, van Ommen G-J, Brazma A, Meitinger T, Rosenstiel P, Guigó R, Gut IG, Estivill X, Dermitzakis ET. Transcriptome and genome sequencing uncovers functional variation in humans. Nature. 2013;501:506–511. doi: 10.1038/nature12531. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lenormand M, Jabot F, Deffuant G. Adaptive approximate Bayesian computation for complex models. Computational Statistics. 2013;28:2777–2796. doi: 10.1007/s00180-013-0428-3. [DOI] [Google Scholar]
Li L, Li Y, Huang Y, Ouyang Y, Zhu Y, Wang Y, Guo X, Yuan Y, Gong K. Long non‐coding RNA MIF‐AS1 promotes gastric cancer cell proliferation and reduces apoptosis to upregulate NDUFA4. Cancer Science. 2018;109:3714–3725. doi: 10.1111/cas.13801. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mafessoni F, Grote S, de Filippo C, Slon V, Kolobova KA, Viola B, Markin SV, Chintalapati M, Peyrégne S, Skov L, Skoglund P, Krivoshapkin AI, Derevianko AP, Meyer M, Kelso J, Peter B, Prüfer K, Pääbo S. A high-coverage Neandertal genome from Chagyrskaya Cave. PNAS. 2020;117:15132–15136. doi: 10.1073/pnas.2004944117. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mallick S, Li H, Lipson M, Mathieson I, Gymrek M, Racimo F, Zhao M, Chennagiri N, Nordenfelt S, Tandon A, Skoglund P, Lazaridis I, Sankararaman S, Fu Q, Rohland N, Renaud G, Erlich Y, Willems T, Gallo C, Spence JP, Song YS, Poletti G, Balloux F, van Driem G, de Knijff P, Romero IG, Jha AR, Behar DM, Bravi CM, Capelli C, Hervig T, Moreno-Estrada A, Posukh OL, Balanovska E, Balanovsky O, Karachanak-Yankova S, Sahakyan H, Toncheva D, Yepiskoposyan L, Tyler-Smith C, Xue Y, Abdullah MS, Ruiz-Linares A, Beall CM, Di Rienzo A, Jeong C, Starikovskaya EB, Metspalu E, Parik J, Villems R, Henn BM, Hodoglugil U, Mahley R, Sajantila A, Stamatoyannopoulos G, Jts W, Khusainova R, Khusnutdinova E, Litvinov S, Ayodo G, Comas D, Hammer MF, Kivisild T, Klitz W, Winkler CA, Labuda D, Bamshad M, Jorde LB, Tishkoff SA, Watkins WS, Metspalu M, Dryomov S, Sukernik R, Singh L, Thangaraj K, Pääbo S, Kelso J, Patterson N, Reich D. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature. 2016;538:201–206. doi: 10.1038/nature18964. [DOI] [PMC free article] [PubMed] [Google Scholar]
Marcus JH, Novembre J. Visualizing the geography of genetic variants. Bioinformatics. 2017;33:594–595. doi: 10.1093/bioinformatics/btw643. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mendez FL, Watkins JC, Hammer MF. A haplotype at STAT2 Introgressed from neanderthals and serves as a candidate of positive selection in Papua New Guinea. American Journal of Human Genetics. 2012a;91:265–274. doi: 10.1016/j.ajhg.2012.06.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mendez FL, Watkins JC, Hammer MF. Global genetic variation at OAS1 provides evidence of archaic admixture in Melanesian populations. Molecular Biology and Evolution. 2012b;29:1513–1520. doi: 10.1093/molbev/msr301. [DOI] [PubMed] [Google Scholar]
Meyer M, Kircher M, Gansauge MT, Li H, Racimo F, Mallick S, Schraiber JG, Jay F, Prüfer K, de Filippo C, Sudmant PH, Alkan C, Fu Q, Do R, Rohland N, Tandon A, Siebauer M, Green RE, Bryc K, Briggs AW, Stenzel U, Dabney J, Shendure J, Kitzman J, Hammer MF, Shunkov MV, Derevianko AP, Patterson N, Andrés AM, Eichler EE, Slatkin M, Reich D, Kelso J, Pääbo S. A high-coverage genome sequence from an archaic Denisovan individual. Science. 2012;338:222–226. doi: 10.1126/science.1224344. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mikocziova I, Gidoni M, Lindeman I, Peres A, Snir O, Yaari G, Sollid LM. Polymorphisms in human immunoglobulin heavy chain variable genes and their upstream regions. Nucleic Acids Research. 2020;48:5499–5510. doi: 10.1093/nar/gkaa310. [DOI] [PMC free article] [PubMed] [Google Scholar]
Moreno-Mayar JV, Potter BA, Vinner L, Steinrücken M, Rasmussen S, Terhorst J, Kamm JA, Albrechtsen A, Malaspinas AS, Sikora M, Reuther JD, Irish JD, Malhi RS, Orlando L, Song YS, Nielsen R, Meltzer DJ, Willerslev E. Terminal Pleistocene Alaskan genome reveals first founding population of Native Americans. Nature. 2018;553:203–207. doi: 10.1038/nature25173. [DOI] [PubMed] [Google Scholar]
Nattestad M, Goodwin S, Ng K, Baslan T, Sedlazeck FJ, Rescheneder P, Garvin T, Fang H, Gurtowski J, Hutton E, Tseng E, Chin CS, Beck T, Sundaravadanam Y, Kramer M, Antoniou E, McPherson JD, Hicks J, McCombie WR, Schatz MC. Complex rearrangements and oncogene amplifications revealed by long-read DNA and RNA sequencing of a breast cancer cell line. Genome Research. 2018;28:1126–1135. doi: 10.1101/gr.231100.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ongen H, Buil A, Brown AA, Dermitzakis ET, Delaneau O. Fast and efficient QTL mapper for thousands of molecular phenotypes. Bioinformatics. 2016;32:1479–1485. doi: 10.1093/bioinformatics/btv722. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pandey R, Bakay M, Hain HS, Strenkowski B, Yermakova A, Kushner JA, Orange JS, Hakonarson H. The Autoimmune Disorder Susceptibility Gene Restrains NK Cell Function in YTS NK Cell Line and Knockout Mice. Frontiers in Immunology. 2019;10:68. doi: 10.3389/fimmu.2019.00068. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pedersen BS, Quinlan AR. cyvcf2: fast, flexible variant analysis with Python. Bioinformatics. 2017;33:1867–1869. doi: 10.1093/bioinformatics/btx057. [DOI] [PMC free article] [PubMed] [Google Scholar]
Perry GH, Dominy NJ, Claw KG, Lee AS, Fiegler H, Redon R, Werner J, Villanea FA, Mountain JL, Misra R, Carter NP, Lee C, Stone AC. Diet and the evolution of human amylase gene copy number variation. Nature Genetics. 2007;39:1256–1260. doi: 10.1038/ng2123. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pritchard JK, Seielstad MT, Perez-Lezaun A, Feldman MW. Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Molecular Biology and Evolution. 1999;16:1791–1798. doi: 10.1093/oxfordjournals.molbev.a026091. [DOI] [PubMed] [Google Scholar]
Prüfer K, Racimo F, Patterson N, Jay F, Sankararaman S, Sawyer S, Heinze A, Renaud G, Sudmant PH, de Filippo C, Li H, Mallick S, Dannemann M, Fu Q, Kircher M, Kuhlwilm M, Lachmann M, Meyer M, Ongyerth M, Siebauer M, Theunert C, Tandon A, Moorjani P, Pickrell J, Mullikin JC, Vohr SH, Green RE, Hellmann I, Johnson PLF, Blanche H, Cann H, Kitzman JO, Shendure J, Eichler EE, Lein ES, Bakken TE, Golovanova LV, Doronichev VB, Shunkov MV, Derevianko AP, Viola B, Slatkin M, Reich D, Kelso J, Pääbo S. The complete genome sequence of a Neanderthal from the Altai Mountains. Nature. 2014;505:43–49. doi: 10.1038/nature12886. [DOI] [PMC free article] [PubMed] [Google Scholar]
Prüfer K, de Filippo C, Grote S, Mafessoni F, Korlević P, Hajdinjak M, Vernot B, Skov L, Hsieh P, Peyrégne S, Reher D, Hopfe C, Nagel S, Maricic T, Fu Q, Theunert C, Rogers R, Skoglund P, Chintalapati M, Dannemann M, Nelson BJ, Key FM, Rudan P, Kućan Ž, Gušić I, Golovanova LV, Doronichev VB, Patterson N, Reich D, Eichler EE, Slatkin M, Schierup MH, Andrés AM, Kelso J, Meyer M, Pääbo S. A high-coverage Neandertal genome from Vindija cave in Croatia. Science. 2017;358:655–658. doi: 10.1126/science.aao1887. [DOI] [PMC free article] [PubMed] [Google Scholar]
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ, Sham PC. PLINK: a tool set for whole-genome association and population-based linkage analyses. American Journal of Human Genetics. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
Racimo F, Gokhman D, Fumagalli M, Ko A, Hansen T, Moltke I, Albrechtsen A, Carmel L, Huerta-Sánchez E, Nielsen R. Archaic adaptive introgression in TBX15/WARS2. Molecular Biology and Evolution. 2017;34:509–524. doi: 10.1093/molbev/msw283. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rasmussen S, Allentoft ME, Nielsen K, Orlando L, Sikora M, Sjögren KG, Pedersen AG, Schubert M, Van Dam A, Kapel CMO, Nielsen HB, Brunak S, Avetisyan P, Epimakhov A, Khalyapin MV, Gnuni A, Kriiska A, Lasak I, Metspalu M, Moiseyev V, Gromov A, Pokutta D, Saag L, Varul L, Yepiskoposyan L, Sicheritz-Pontén T, Foley RA, Lahr MM, Nielsen R, Kristiansen K, Willerslev E. Early divergent strains of Yersinia pestis in Eurasia 5,000 years ago. Cell. 2015;163:571–582. doi: 10.1016/j.cell.2015.10.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
Roadmap Epigenomics Consortium. Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, Heravi-Moussavi A, Kheradpour P, Zhang Z, Wang J, Ziller MJ, Amin V, Whitaker JW, Schultz MD, Ward LD, Sarkar A, Quon G, Sandstrom RS, Eaton ML, Wu Y-C, Pfenning AR, Wang X, Claussnitzer M, Liu Y, Coarfa C, Harris RA, Shoresh N, Epstein CB, Gjoneska E, Leung D, Xie W, Hawkins RD, Lister R, Hong C, Gascard P, Mungall AJ, Moore R, Chuah E, Tam A, Canfield TK, Hansen RS, Kaul R, Sabo PJ, Bansal MS, Carles A, Dixon JR, Farh K-H, Feizi S, Karlic R, Kim A-R, Kulkarni A, Li D, Lowdon R, Elliott G, Mercer TR, Neph SJ, Onuchic V, Polak P, Rajagopal N, Ray P, Sallari RC, Siebenthall KT, Sinnott-Armstrong NA, Stevens M, Thurman RE, Wu J, Zhang B, Zhou X, Beaudet AE, Boyer LA, De Jager PL, Farnham PJ, Fisher SJ, Haussler D, Jones SJM, Li W, Marra MA, McManus MT, Sunyaev S, Thomson JA, Tlsty TD, Tsai L-H, Wang W, Waterland RA, Zhang MQ, Chadwick LH, Bernstein BE, Costello JF, Ecker JR, Hirst M, Meissner A, Milosavljevic A, Ren B, Stamatoyannopoulos JA, Wang T, Kellis M. Integrative analysis of 111 reference human epigenomes. Nature. 2015;518:317–330. doi: 10.1038/nature14248. [DOI] [PMC free article] [PubMed] [Google Scholar]
Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26:139–140. doi: 10.1093/bioinformatics/btp616. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rodriguez OL, Gibson WS, Parks T, Emery M, Powell J, Strahl M, Deikus G, Auckland K, Eichler EE, Marasco WA, Sebra R, Sharp AJ, Smith ML, Bashir A, Watson CT. A Novel Framework for Characterizing Genomic Haplotype Diversity in the Human Immunoglobulin Heavy Chain Locus. Frontiers in Immunology. 2020;11:2136. doi: 10.3389/fimmu.2020.02136. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rotival M, Quintana-Murci L. Functional consequences of archaic introgression and their impact on fitness. Genome Biology. 2020;21:3. doi: 10.1186/s13059-019-1920-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sabeti PC, Walsh E, Schaffner SF, Varilly P, Fry B, Hutcheson HB, Cullen M, Mikkelsen TS, Roy J, Patterson N, Cooper R, Reich D, Altshuler D, O’Brien S, Lander ES. The case for selection at CCR5-Delta32. PLOS Biology. 2005;3:e378. doi: 10.1371/journal.pbio.0030378. [DOI] [PMC free article] [PubMed] [Google Scholar]
Saitou M, Gokcumen O. Resolving the Insertion Sites of Polymorphic Duplications Reveals a HERC2 Haplotype under Selection. Genome Biology and Evolution. 2019;11:1679–1690. doi: 10.1093/gbe/evz107. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sams AJ, Dumaine A, Nédélec Y, Yotova V, Alfieri C, Tanner JE, Messer PW, Barreiro LB. Adaptively introgressed Neandertal haplotype at the OAS locus functionally impacts innate immune responses in humans. Genome Biology. 2021;17:51466. doi: 10.1101/051466. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sankararaman S, Patterson N, Li H, Pääbo S, Reich D. The date of interbreeding between Neandertals and modern humans. PLOS Genetics. 2012;8:e1002947. doi: 10.1371/journal.pgen.1002947. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sankararaman S, Mallick S, Patterson N, Reich D. The Combined Landscape of Denisovan and Neanderthal Ancestry in Present-Day Humans. Current Biology. 2016;26:1241–1247. doi: 10.1016/j.cub.2016.03.037. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schaid DJ, Chen W, Larson NB. From genome-wide associations to candidate causal variants by statistical fine-mapping. Nature Reviews Genetics. 2018;19:491–504. doi: 10.1038/s41576-018-0016-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sedlazeck FJ, Lee H, Darby CA, Schatz MC. Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nature Reviews Genetics. 2018a;19:329–346. doi: 10.1038/s41576-018-0003-4. [DOI] [PubMed] [Google Scholar]
Sedlazeck FJ, Rescheneder P, Smolka M, Fang H, Nattestad M, von Haeseler A, Schatz MC. Accurate detection of complex structural variations using single-molecule sequencing. Nature Methods. 2018b;15:461–468. doi: 10.1038/s41592-018-0001-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shriver MD, Kennedy GC, Parra EJ, Lawson HA, Sonpar V, Huang J, Akey JM, Jones KW. The genomic distribution of population substructure in four populations using 8,525 autosomal SNPs. Human Genomics. 2004;1:274–286. doi: 10.1186/1479-7364-1-4-274. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sibbesen JA, Maretty L, Krogh A. Accurate genotyping across variant classes and lengths using variant graphs. Nature Genetics. 2018;50:1054–1059. doi: 10.1038/s41588-018-0145-5. [DOI] [PubMed] [Google Scholar]
Sirén J, Monlong J, Chang X, Novak AM, Eizenga JM, Markello C, Sibbesen JA, Hickey G, Chang PC, Carroll A, Gupta N, Gabriel S, Blackwell TW, Ratan A, Taylor KD, Rich SS, Rotter JI, Haussler D, Garrison E, Paten B. Genotyping Common, Large Structural Variations in 5,202 Genomes Using Pangenomes, the Giraffe Mapper, and the vg Toolkit. bioRxiv. 2021 doi: 10.1101/2020.12.04.412486. [DOI]
Souilmi Y, Lauterbur ME, Tobler R, Huber CD, Johar AS, Moradi SV, Johnston WA, Krogan NJ, Alexandrov K, Enard D. An ancient viral epidemic involving host coronavirus interacting genes more than 20,000 years ago in East Asia. Current Biology. 2021;31:3704. doi: 10.1016/j.cub.2021.07.052. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stephens JC, Reich DE, Goldstein DB, Shin HD, Smith MW, Carrington M, Winkler C, Huttley GA, Allikmets R, Schriml L, Gerrard B, Malasky M, Ramos MD, Morlot S, Tzetis M, Oddoux C, di Giovine FS, Nasioulas G, Chandler D, Aseev M, Hanson M, Kalaydjieva L, Glavac D, Gasparini P, Kanavakis E, Claustres M, Kambouris M, Ostrer H, Duff G, Baranov V, Sibul H, Metspalu A, Goldman D, Martin N, Duffy D, Schmidtke J, Estivill X, O’Brien SJ, Dean M. Dating the origin of the CCR5-Delta32 AIDS-resistance allele by the coalescence of haplotypes. American Journal of Human Genetics. 1998;62:1507–1515. doi: 10.1086/301867. [DOI] [PMC free article] [PubMed] [Google Scholar]
Storey JD, Tibshirani R. Statistical significance for genomewide studies. PNAS. 2003;100:9440–9445. doi: 10.1073/pnas.1530509100. [DOI] [PMC free article] [PubMed] [Google Scholar]
Storey JD, Bass AJ, Dabney A, Robinson D, Warnes G. qvalue: Q-value estimation for false discovery rate control. 2.21.0GitHub. 2020 http://github.com/jdstorey/qvalue
Sudmant PH, Mallick S, Nelson BJ, Hormozdiari F, Krumm N, Huddleston J, Coe BP, Baker C, Nordenfelt S, Bamshad M, Jorde LB, Posukh OL, Sahakyan H, Watkins WS, Yepiskoposyan L, Abdullah MS, Bravi CM, Capelli C, Hervig T, Jts W, Tyler-Smith C, van Driem G, Romero IG, Jha AR, Karachanak-Yankova S, Toncheva D, Comas D, Henn B, Kivisild T, Ruiz-Linares A, Sajantila A, Metspalu E, Parik J, Villems R, Starikovskaya EB, Ayodo G, Beall CM, Di Rienzo A, Hammer MF, Khusainova R, Khusnutdinova E, Klitz W, Winkler C, Labuda D, Metspalu M, Tishkoff SA, Dryomov S, Sukernik R, Patterson N, Reich D, Eichler EE. Global diversity, population stratification, and selection of human copy-number variation. Science. 2015;349:aab3761. doi: 10.1126/science.aab3761. [DOI] [PMC free article] [PubMed] [Google Scholar]
Suzuki T, Maranda B, Sakagami T, Catellier P, Couture CY, Carey BC, Chalk C, Trapnell BC. Hereditary pulmonary alveolar proteinosis caused by recessive CSF2RB mutations. The European Respiratory Journal. 2011;37:201–204. doi: 10.1183/09031936.00090610. [DOI] [PubMed] [Google Scholar]
Tewhey R, Kotliar D, Park DS, Liu B, Winnicki S, Reilly SK, Andersen KG, Mikkelsen TS, Lander ES, Schaffner SF, Sabeti PC. Direct Identification of Hundreds of Expression-Modulating Variants using a Multiplexed Reporter Assay. Cell. 2018;172:1132–1134. doi: 10.1016/j.cell.2018.02.021. [DOI] [PubMed] [Google Scholar]
The 1000 Genomes Project Consortium A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
The GTEx Consortium The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science. 2020;369:1318–1330. doi: 10.1126/science.aaz1776. [DOI] [PMC free article] [PubMed] [Google Scholar]
The International HapMap Consortium A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007;449:851–861. doi: 10.1038/nature06258. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tremblay M, Vézina H. New estimates of intergenerational time intervals for the calculation of age and origins of mutations. American Journal of Human Genetics. 2000;66:651–658. doi: 10.1086/302770. [DOI] [PMC free article] [PubMed] [Google Scholar]
van Arensbergen J, Pagie L, FitzPatrick VD, de Haas M, Baltissen MP, Comoglio F, van der Weide RH, Teunissen H, Võsa U, Franke L, de Wit E, Vermeulen M, Bussemaker HJ, van Steensel B. High-throughput identification of human SNPs affecting regulatory element activity. Nature Genetics. 2019;51:1160–1169. doi: 10.1038/s41588-019-0455-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
VanLiere JM, Rosenberg NA. Mathematical properties of the r2 measure of linkage disequilibrium. Theoretical Population Biology. 2008;74:130–137. doi: 10.1016/j.tpb.2008.05.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vernot B, Tucci S, Kelso J, Schraiber JG, Wolf AB, Gittelman RM, Dannemann M, Grote S, McCoy RC, Norton H, Scheinfeldt LB, Merriwether DA, Koki G, Friedlaender JS, Wakefield J, Pääbo S, Akey JM. Excavating Neandertal and Denisovan DNA from the genomes of Melanesian individuals. Science. 2016;352:235–239. doi: 10.1126/science.aad9416. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vitti JJ, Grossman SR, Sabeti PC. Detecting natural selection in genomic data. Annual Review of Genetics. 2013;47:97–120. doi: 10.1146/annurev-genet-111212-133526. [DOI] [PubMed] [Google Scholar]
Wang Y, Lu D, Chung YJ, Xu S. Genetic structure, divergence and admixture of Han Chinese, Japanese and Korean populations. Hereditas. 2018;155:19. doi: 10.1186/s41065-018-0057-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Watson CT, Steinberg KM, Huddleston J, Warren RL, Malig M, Schein J, Willsey AJ, Joy JB, Scott JK, Graves TA, Wilson RK, Holt RA, Eichler EE, Breden F. Complete haplotype sequence of the human immunoglobulin heavy-chain variable, diversity, and joining genes and characterization of allelic and copy-number variation. American Journal of Human Genetics. 2013;92:530–546. doi: 10.1016/j.ajhg.2013.03.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
Watson CT, Glanville J, Marasco WA. The Individual and Population Genetics of Antibody Immunity. Trends in Immunology. 2017;38:459–470. doi: 10.1016/j.it.2017.04.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wojcik GL, Graff M, Nishimura KK, Tao R, Haessler J, Gignoux CR, Highland HM, Patel YM, Sorokin EP, Avery CL, Belbin GM, Bien SA, Cheng I, Cullina S, Hodonsky CJ, Hu Y, Huckins LM, Jeff J, Justice AE, Kocarnik JM, Lim U, Lin BM, Lu Y, Nelson SC, Park SSL, Poisner H, Preuss MH, Richard MA, Schurmann C, Setiawan VW, Sockell A, Vahi K, Verbanck M, Vishnu A, Walker RW, Young KL, Zubair N, Acuña-Alonso V, Ambite JL, Barnes KC, Boerwinkle E, Bottinger EP, Bustamante CD, Caberto C, Canizales-Quinteros S, Conomos MP, Deelman E, Do R, Doheny K, Fernández-Rhodes L, Fornage M, Hailu B, Heiss G, Henn BM, Hindorff LA, Jackson RD, Laurie CA, Laurie CC, Li Y, Lin DY, Moreno-Estrada A, Nadkarni G, Norman PJ, Pooler LC, Reiner AP, Romm J, Sabatti C, Sandoval K, Sheng X, Stahl EA, Stram DO, Thornton TA, Wassel CL, Wilkens LR, Winkler CA, Yoneyama S, Buyske S, Haiman CA, Kooperberg C, Le Marchand L, Loos RJF, Matise TC, North KE, Peters U, Kenny EE, Carlson CS. Genetic analyses of diverse populations improves discovery for complex traits. Nature. 2019;570:514–518. doi: 10.1038/s41586-019-1310-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wright S. The genetical structure of populations. Annals of Eugenics. 1949;15:323–354. doi: 10.1111/j.1469-1809.1949.tb02451.x. [DOI] [PubMed] [Google Scholar]
Yair S, Lee KM, Coop G. The timing of human adaptation from Neanderthal introgression. Genetics. 2021;218:iyab052. doi: 10.1093/genetics/iyab052. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yi X, Liang Y, Huerta-Sanchez E, Jin X, Zxp C, Pool JE, Xu X, Jiang H, Vinckenbosch N, Korneliussen TS, Hancheng Z, Liu T, He W, Li K, Luo R, Nie X, Wu H, Zhao M, Cao H, Zou J, Shan Y, Shuzheng L, Yang Q, Asan N, Tian P, Xu G, Liu J, Jiang X, Wu T, Zhou R, Tang G, Qin M, Wang J, Feng T, Li S, Huasang G, Luosang J, Wang W, Chen F, Wang Y, Zheng X, Li Z, Bianba Z, Yang G, Wang X, Tang S, Gao G, Chen Y, Luo Z, Gusang L, Cao Z, Zhang Q, Ouyang W, Ren X, Liang H, Huisong Z, Huang Y, Li J, Bolund L, Kristiansen K, Li Y, Zhang Y, Zhang X, Li R, Songgang L, Yang H, Nielsen R, Jun W, Jian W. Sequencing of 50 human exomes reveals adaptation to high altitude. Science. 2010;329:75–78. doi: 10.1126/science.1190371. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zeberg H, Pääbo S. The major genetic risk factor for severe COVID-19 is inherited from Neanderthals. Nature. 2020;587:610–612. doi: 10.1038/s41586-020-2818-3. [DOI] [PubMed] [Google Scholar]
Zeberg H, Pääbo S. A genomic region associated with protection against severe COVID-19 is inherited from Neandertals. PNAS. 2021;118:e2026309118. doi: 10.1073/pnas.2026309118. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang Z, Bai M, Barbosa GO, Chen A, Wei Y, Luo S, Wang X, Wang B, Tsukui T, Li H, Sheppard D, Kornberg TB, Dk M. Broadly conserved roles of TMEM131 family proteins in intracellular collagen assembly and secretory cargo trafficking. Science Advances. 2020;6:eaay7667. doi: 10.1126/sciadv.aay7667. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang JY, Roberts H, Flores DSC, Cutler AJ, Brown AC, Whalley JP, Mielczarek O, Buck D, Lockstone H, Xella B, Oliver K, Corton C, Betteridge E, Bashford-Rogers R, Knight JC, Todd JA, Band G. Using de novo assembly to identify structural variation of eight complex immune system gene regions. PLOS Computational Biology. 2021;17:e1009254. doi: 10.1371/journal.pcbi.1009254. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao X, Collins RL, Lee WP, Weber AM, Jun Y, Zhu Q, Weisburd B, Huang Y, Audano PA, Wang H, Walker M, Lowther C, Fu J, Gerstein MB, Devine SE, Marschall T, Korbel JO, Eichler EE, Chaisson MJP, Lee C, Mills RE, Brand H, Talkowski ME. Expectations and blind spots for structural variation detection from long-read assemblies and short-read genome sequencing technologies. American Journal of Human Genetics. 2021;108:919–928. doi: 10.1016/j.ajhg.2021.03.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhou S, Butler-Laporte G, Nakanishi T, Morrison DR. A Neanderthal OAS1 isoform protects individuals of European ancestry against COVID-19 susceptibility and severity. Nature Medicine. 2021;27:659–667. doi: 10.1038/s41591-021-01281-1. [DOI] [PubMed] [Google Scholar]
Zook JM, Hansen NF, Olson ND. A robust benchmark for detection of germline large deletions and insertions. Nature Biotechnology. 2020;38:1347–1355. doi: 10.1038/s41587-020-0538-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

eLife. doi: 10.7554/eLife.67615.sa1

Decision letter

Editor: George H Perry¹

Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Acceptance summary:

The technical challenges of identifying and quantifying the frequency of structural variants (SV) on a population scale has been a major limitation to the study of recent human adaptation. This manuscript applies a recent graph-based genotyping method that leverages a library of SVs identified by long-read sequencing to identify SVs in large short-read based cohorts. This is a sensible and powerful approach that highlights several examples of likely adaptive SV evolution in different human populations.

Decision letter after peer review:

Thank you for submitting your article "Local adaptation and archaic introgression shape global diversity at human structural variant loci" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by George Perry as the Senior Editor. The reviewers have opted to remain anonymous.

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

1) Clearer connection and contextualization of the contribution of this work given previous studies, both for the SV catalog of results and the highlighted IGH, including analyses of overlap with previous datasets and a clearer set up to enable this paper to serve a larger purpose as a roadmap for future studies that aim to link datasets for SVs. Related to this, the language in a number of sections is unclear or too broad to scientifically interpret, e.g. suggesting analyses are broadly consistent with previous work; specific quantitative comparisons are needed to make or interpret these claims.

2) More depth to the biological interpretation of the study, with specific ideas presented below, including further support/interpretation of the IGH locus and potential tests of SV patterns more broadly.

Reviewer #1 (Recommendations for the authors):

More minor or specific comments:

1) The populations and ancestries are sometimes hard to follow with so many population abbreviations and the 8 ancestries in Ohana figures not quite linked to a populations. e.g. What ancestry backgrounds are the SVs of interest on pg 14?

2) In general, while clear to specialists, early in the paper the writing is difficult, with many stats tests before the biological question you are attempting to test.

3) ABC demography model is quite complex; does this matter for the inference? Importantly, how does it mix with Neanderthal introgression scenario? It is also unclear what summary statistics are being used in the ABC model, is it the Ohana/PBS-style only? If other measures, are they admixture-aware? The point estimate of "strong" selection is hard to trust with so much model uncertainty and identifiability issues with timing.

4) Further support for the claim of selection would strengthen the argument, i.e. other summary statistics that show signatures of selection.

Reviewer #2 (Recommendations for the authors):

1. The authors do a good job motivating the importance of SVs in human evolution and the previous technical limitations that have limited comprehensive analysis. They also clearly motivate their work using recently generated population-level SV datasets. However, clarification of how this paper's results compare to and expand these previous findings is needed. Previous analyses of SV evolution were not "comprehensive" due to the limitations of short-read approaches. However, as noted by the authors, this and other long-read-based studies also currently cannot "comprehensively" study human SV. I do not see a problem with the "focus on individual variants" obtained in a different way, but more description of how their set of SVs is similar or different from previous work would aid interpretation. A few more supplementary figures describing characteristics of the 92,286 SVs used for downstream analysis (length, type of SV, genome distribution, SV overlap, frequency, etc) are needed. Even though they are already cited, some papers the authors could consider further contextualizing their results in include: Hsieh et al., 2019., Sudmant et al., 2015, Almarri et al., 2020, Ebert et al., 2020, Audano et al., 2019 (there are also others that consider SV with direct reference to introgression events).

2. Related to the above point, the use of the graph genotyping software, Paragraph, is s sensible approach, and I suspect that this paper may provide a template for future analyses merging long-read and short-read SV data. Given this potential contribution, more should be done to discuss the benefits and limitations of this approach. There is brief discussion of "ascertainment bias" based on the 15 individuals with long-read data used from Audano et al., 2019. How does the ancestry of these 15 individuals influence the ability to comprehensively identify SVs in thousands of individuals with different ancestries? A few more details in the main text regarding the sensitivity and specificity of Paragraph would also help to aid result interpretation (rather than just "recent benchmarking…support the accuracy of Paragraph"). Does the Paragraph approach better identify longer SVs, deletions, duplications, inversions, SVs in more repetitive regions of the genome?

3. The authors identify 1121 significant associations between SVs and the expression of nearby genes in LCLs. This analysis is reasonable and clear, but the implications are not clear. Prior to this section, the authors discuss LD with previously investigated SNPs/indels, yet they do not report this information for the eQTL 1121 SVs. How many of these 1121 are in high LD with common SNPs/indels? And how many are not linked to previously investigated variation? Even if most are linked, this is still potentially interesting, but it would be helpful to know if they are identified as eQTL in other datasets (GTEx?). Do these potentially SV mediated eQTL have greater effect sizes than eQTL caused by SNVs? Is there a way to fine map one or two (potentially one related to the subsequently immunoglobulin story or the ones highlighted in Figure 2C) to show that the SV is likely driving the association, rather than other linked variants?

4. The authors use Ohana to test for local adaptation. This method seems appropriate for the challenges of the SV data; the results are interesting; and I appreciate their new individual examples. However, these results are largely descriptive. Testing larger hypotheses about whether SVs are targets of selection more often than SNVs or if these patterns differ between populations or parts of the genome would substantially increase the impact of these results. Figure 4-S1 may be relevant to this kind of comparison; however, it only shows a few loci. While differences in power and ascertainment may make direct comparison challenging, I encourage the authors to think about how to move beyond a list of examples to more general conclusions. Additionally, the y-axis range on Figure 4 across the different ancestries are quite different. It is hard to interpret to the significance and magnitude of the LRS. Can the multiple testing control be integrated here?

5. Finally, the authors explore one example, the immunoglobulin heavy chain locus, in detail. The example and analysis are compelling; however, it is complicated and sometimes hard to follow given the multiple SVs in this region. Readers would benefit from a cartoon schematic of this locus, perhaps expanding on the supplemental figure. I would also appreciate more discussion of why this region might have been filtered out in Browning et al., 2018.

Reviewer #3 (Recommendations for the authors):

The abstract and introduction focus on the size of SVs and the relative "ease" of analysis of SNVs and indels, however, the highlighted variants are indels by the 1000 genomes definition. They also seem to have been discovered in other short-read datasets, (e.g. gnomAD).

Paragraph has been previously shown to work well for genotyping SVs, however it is not clear what the accuracy in the low-coverage 1000 genomes samples is. After filtering, while 92,286 variants remain, 25,201 (27%) of these are absent from any of the 1000 genomes samples. It is unclear how the authors distinguish "true negatives" from "false negatives."

I don't think the LD section adds anything new. This is well reported.

And LD cutoff of 0.5 R² is used for introgressed haplotypes versus 0.8² for tag haplotypes in the original motivation. Suggest picking one to be consistent throughout.

eLife. 2021 Sep 16;10:e67615. doi: 10.7554/eLife.67615.sa2

Author response

Essential revisions:

1) Clearer connection and contextualization of the contribution of this work given previous studies, both for the SV catalog of results and the highlighted IGH, including analyses of overlap with previous datasets and a clearer set up to enable this paper to serve a larger purpose as a roadmap for future studies that aim to link datasets for SVs. Related to this, the language in a number of sections is unclear or too broad to scientifically interpret, e.g. suggesting analyses are broadly consistent with previous work; specific quantitative comparisons are needed to make or interpret these claims.

Thank you very much for the thorough and thoughtful reviews, which were invaluable for improving our manuscript. We agree that the previous draft placed too much responsibility on readers to extract and compile information from cited studies. We made several changes with the goals of placing our work in context and offering clear, quantitative comparisons:

1. One of the main claims of our manuscript is the distinction between short-read- and long-read-discovered SVs, with the statement that the latter set is more accurate and comprehensive. Our set of SVs was carefully curated based on PacBio long-read sequencing data from 15 diverse samples from Audano et al., (2019). We now provide a detailed comparison of these curated SVs to two sets of SVs discovered from short-read sequencing of thousands of globally diverse human samples from the 1000 Genomes Project and Human Genome Diversity Project (HGDP) (Almarri et al., 2020; Sudmant et al., 2015) (lines 80-97; Figure 1 – S2, Figure 1 – S3). We show that we are able to discover significantly more insertions with long-read sequencing and re-discover a large proportion of common SVs despite the small size of the long-read-sequenced discovery sample.

2. Another claim of our study regards the accuracy of graph genotyping with Paragraph. Our focus on eQTL mapping and selection scans offers some inherent robustness to genotyping errors, because in order for false positive associations to arise, erroneous genotypes would have to be unequally distributed with respect to gene expression phenotypes or ancestry. This is implausible unless driven by stratified genetic variation elsewhere in the genome (i.e., by true positives, albeit mis-attributed). False negatives, meanwhile, would make our eQTL mapping and selection scans conservative. We now provide additional results supporting the accuracy of graph genotyping on lines 100-107, including important details of previous benchmarking studies.

3. We also include a more detailed comparison of our introgression results to those of Browning et al., on lines 407-417. The Browning et al., study highlights a subregion of the IGH locus, which reaches high frequencies in European and Asian populations and is downstream of the region we investigate. This is displayed in Figure 4 – S3, where the Browning et al., haplotype is represented in the EUR, EAS, and AMR populations, while the one we investigate is only in the EAS populations (with the exception of JPT). The precise reason why Browning et al., focused on the Eurasian haplotype, rather than the one at high frequency in southeast Asia alone, is unclear. We hypothesize that this may involve their filtering process, which used a mask for the Altai Neanderthal sequence, generated by (Prüfer et al., 2017), that removes sites that have low coverage or mapping quality or are located within tandem repeats or indels (Browning et al., 2018). We now add a supplementary figure showing the masking of the broader IGH region that we highlight (Figure 4 – S4), where only 19.3% of the sequence remains after masking (in contrast to the background rate of 62.6% across the remainder of the q-arm of chromosome 14).

2) More depth to the biological interpretation of the study, with specific ideas presented below, including further support/interpretation of the IGH locus and potential tests of SV patterns more broadly.

1. We added more biological context to our eQTL analysis by intersecting SVs with annotated regulatory elements (lines 215-220). We observed a strong enrichment of significant gene expression associations for SVs that intersect annotated enhancer elements, as well as a strong depletion of expression associations for SVs that intersect “quiescent” sequences that are devoid of important epigenetic marks (Figure 2D). The intuitive nature of these results supports the robustness of our analysis, while also shedding light on functional mechanisms through which SVs mediate impacts on phenotypes and fitness.

2. We investigated direct dosage effects of exon-spanning SVs, showing that the magnitude and direction of these effects are broadly consistent with expectations (lines 220-225). We also performed a fine-mapping analysis of SV and SNP eQTLs (lines 228-244), homing in on candidate causal variants at SV eQTL loci. While noting important technical caveats, both of these analyses clarify the causal link between a subset of SVs and the observed gene expression changes.

3. We additionally identified seven frequency-differentiated SVs from our study that were also significant eQTLs (Supplementary Table 2). One major limitation of this analysis is the lack of diversity in the Geuvadis dataset, which restricts us to SVs that are common in European or Yoruban populations. Consequently, we were unable to directly test gene expression associations for highlighted SVs at the IGH locus in East Asian samples, because this ancestry group is not represented in Geuvadis.

4. We performed extensive neutral simulations to construct null distributions of test statistics and strongly support our conclusion of positive selection at the IGH locus in the ancestors of certain southeast Asian populations. Specifically, observed magnitudes of allele frequency differentiation far exceed those expected under the neutral model (lines 494-496; Figure 5 – S1).

Please find our point-by-point response to the reviewers appended below.

Reviewer #1 (Recommendations for the authors):

More minor or specific comments:

1) The populations and ancestries are sometimes hard to follow with so many population abbreviations and the 8 ancestries in Ohana figures not quite linked to a populations. e.g. What ancestry backgrounds are the SVs of interest on pg 14?

To address this comment, which was shared with that of another reviewer, we have combined Figures 3A and 4 into a revised Figure 3, such that ancestry patterns for the 1000 Genomes populations are now shown along with selection results for each of the 8 ancestry components.

2) In general, while clear to specialists, early in the paper the writing is difficult, with many stats tests before the biological question you are attempting to test.

Thank you for this suggestion. We have attempted to address this by adding clear topic sentences and restructuring several of these initial paragraphs to describe the motivating questions before describing the results.

3) ABC demography model is quite complex; does this matter for the inference? Importantly, how does it mix with Neanderthal introgression scenario? It is also unclear what summary statistics are being used in the ABC model, is it the Ohana/PBS-style only? If other measures, are they admixture-aware? The point estimate of "strong" selection is hard to trust with so much model uncertainty and identifiability issues with timing.

Thank you for pointing out these concerns. We agree that our model is fairly complex and relies on estimates of demographic parameters from the literature. We chose to incorporate five populations in order to have the resolution for capturing differences in the allele frequency of the IGHG4 insertion in East Asia (i.e., near fixation in the CDX and KHV Southeast Asian populations, intermediate frequency in CHB, almost absent in JPT, and CEU as an outgroup). Because the estimates of population divergence time and size used in our model come from two sources (Jouganous et al., 2017; Wang et al., 2018), and are broadly consistent between these two methods, we believe that the model is sufficiently accurate to provide a coarse estimate of selection parameters at this locus. We now include a more in-depth discussion of these caveats in the discussion (lines 583-586).

We apologize that the previous description of ABC summary statistics was unclear. We used the end-of-simulation allele frequencies of the simulated variant in each of the five populations, and compared them to the known allele frequencies of the IGHG4 insertion in these populations. We clarify this choice in the Methods (lines 803-805).

4) Further support for the claim of selection would strengthen the argument, i.e. other summary statistics that show signatures of selection.

We currently only evaluate selection at the IGH locus using methods based on allele frequencies. We are hesitant to incorporate haplotype-based summary statistics due to the known challenge of genotyping and phasing within the broader immunoglobulin region (outside of the IGH genes), which exhibits high levels of genetic variation exacerbated by the phenomena of somatic recombination and hypermutation that occurs in developing B-cells (Chin et al., 2020; Zhang et al., 2021).

We now provide further support for our conclusion of selection at this locus by performing 100,000 neutral simulations using the 5-population demographic model based on (Jouganous et al., 2017; Wang et al., 2018). We show that the magnitudes of allele frequency differences, quantified as pairwise frequency differences as well as with the population branch statistic, greatly exceed those expected based on genetic drift alone (lines 494-496; Fig. 5 - S1). This approach is similar to that employed in previous literature (e.g., Ilardo et al., 2018). The omission of migration from the simulations makes our conclusion conservative, in that gene flow would act to reduce allele frequency differences between populations.

Reviewer #2 (Recommendations for the authors):

1. The authors do a good job motivating the importance of SVs in human evolution and the previous technical limitations that have limited comprehensive analysis. They also clearly motivate their work using recently generated population-level SV datasets. However, clarification of how this paper's results compare to and expand these previous findings is needed. Previous analyses of SV evolution were not "comprehensive" due to the limitations of short-read approaches. However, as noted by the authors, this and other long-read-based studies also currently cannot "comprehensively" study human SV. I do not see a problem with the "focus on individual variants" obtained in a different way, but more description of how their set of SVs is similar or different from previous work would aid interpretation. A few more supplementary figures describing characteristics of the 92,286 SVs used for downstream analysis (length, type of SV, genome distribution, SV overlap, frequency, etc) are needed. Even though they are already cited, some papers the authors could consider further contextualizing their results in include: Hsieh et al., 2019., Sudmant et al., 2015, Almarri et al., 2020, Ebert et al., 2020, Audano et al., 2019 (there are also others that consider SV with direct reference to introgression events).

Thank you very much for these suggestions, which are closely related to comments by Reviewer 1. We now provide a detailed comparison of these curated SVs to two sets of SVs discovered from short-read sequencing of diverse human samples (Almarri et al., 2020; Sudmant et al., 2015) (lines 80-97). We find that this long-read-discovered SV set includes 89,979 variants (83.4% of long-read SVs) that are not represented in the 1000 Genomes Project (1KGP) or the Human Genome Diversity Project (HGDP). These long-read specific variants include 30,229 that are “common” (AF ≥ 0.05), representing 72.3% of all common SVs. We were also able to rediscover a large proportion of the short-read-discovered SVs in these two datasets, including 66.0% and 17.7% of common SVs in 1KGP and HGDP, respectively (Figure 1 – S2 and Figure 1 – S3). These results are consistent with reports from previous studies (Zhao et al., 2021).

The overlap we describe above is notable given that the much smaller size of the long-read sample set (15 individuals vs. 2,504 for 1KGP and 911 for HGDP), and that the sample sets do not overlap completely (i.e., we expect that many rare or singleton SVs should not be represented in both datasets). We expect that the SVs unique to the short-read datasets reflect both differences in the discovery sample set (i.e., many of the long-read sequenced individuals are also in 1KGP, while none are in HGDP) and false positives in short-read SV detection (Nattestad et al., 2018).

We have also added supplementary figures further characterizing SV lengths and allele frequencies (Figure 1 – S5, Figure 1 – S6), as well as their distribution throughout the genome (Figure 1 – S1). We have expanded our discussion of these findings in the text of the paper (lines 147-153).

2. Related to the above point, the use of the graph genotyping software, Paragraph, is s sensible approach, and I suspect that this paper may provide a template for future analyses merging long-read and short-read SV data. Given this potential contribution, more should be done to discuss the benefits and limitations of this approach. There is brief discussion of "ascertainment bias" based on the 15 individuals with long-read data used from Audano et al., 2019. How does the ancestry of these 15 individuals influence the ability to comprehensively identify SVs in thousands of individuals with different ancestries? A few more details in the main text regarding the sensitivity and specificity of Paragraph would also help to aid result interpretation (rather than just "recent benchmarking…support the accuracy of Paragraph"). Does the Paragraph approach better identify longer SVs, deletions, duplications, inversions, SVs in more repetitive regions of the genome?

Thank you for these questions. Regarding ascertainment bias, the diversity of the long-read discovery set we used (sample ancestries: 3 African, 2 American, 3 East Asian, 2 European, 2 South Asian, 2 hydatidiform moles [likely European]) allows us to discover approximately equivalent numbers of SVs across the five continental superpopulations of 1KGP. We appreciate the importance of including this information in the paper and now discuss the ancestries of the long-read sequenced individuals in lines 77-80.

Thank you also for the questions about the accuracy of Paragraph, which are again closely related to questions from Reviewer 1. We now provide additional results supporting the accuracy of graph genotyping on lines 100-107, including important details of previous benchmarking studies. Briefly, we find that among graph-based and non-graph based genotyping tools, Paragraph consistently attains the best balance of precision and recall for both insertions and deletions. Paragraph averages precision and recall rates of 0.72 and 0.70, respectively, across all SV types and genomic regions, and an average precision and recall of 0.86 and 0.79 when repeat regions are excluded. These metrics support its use for genotyping SVs even in repetitive regions of the genome that are typically difficult to map with short reads (Chen et al., 2019; Hickey et al., 2020).

3. The authors identify 1121 significant associations between SVs and the expression of nearby genes in LCLs. This analysis is reasonable and clear, but the implications are not clear. Prior to this section, the authors discuss LD with previously investigated SNPs/indels, yet they do not report this information for the eQTL 1121 SVs. How many of these 1121 are in high LD with common SNPs/indels? And how many are not linked to previously investigated variation? Even if most are linked, this is still potentially interesting, but it would be helpful to know if they are identified as eQTL in other datasets (GTEx?). Do these potentially SV mediated eQTL have greater effect sizes than eQTL caused by SNVs? Is there a way to fine map one or two (potentially one related to the subsequently immunoglobulin story or the ones highlighted in Figure 2C) to show that the SV is likely driving the association, rather than other linked variants?

In response to these questions, we have added more biological context to our eQTL analysis by intersecting SVs with annotated regulatory elements (lines 215-220). We observed a strong enrichment of significant gene expression associations for SVs that intersect annotated enhancer elements, as well as a strong depletion of expression associations for SVs that intersect “quiescent” sequences that are devoid of important epigenetic marks (Figure 2D). The intuitive nature of these results supports the robustness of our analysis, while also shedding light on functional mechanisms through which SVs mediate impacts on phenotypes and fitness.

We also investigated direct dosage effects of exon-spanning SVs, showing that the magnitude and direction of these effects are broadly consistent with expectations (lines 220-225). We also performed a fine-mapping analysis of SV and SNP eQTLs (lines 228-244). While noting important technical caveats, both of these analyses clarify the causal link between a subset of SVs and the observed gene expression changes.

Finally, we checked whether any frequency-differentiated SVs identified in our study were also identified as eQTLs, and found seven such SVs (Supplementary Table 2). One major limitation of this analysis is the lack of diversity in the Geuvadis dataset, which restricts us to SVs that are common in European or Yoruban populations. Consequently, we were unable to directly test gene expression associations for SVs in the IGH locus in East Asian samples, because this ancestry group is not represented in Geuvadis.

4. The authors use Ohana to test for local adaptation. This method seems appropriate for the challenges of the SV data; the results are interesting; and I appreciate their new individual examples. However, these results are largely descriptive. Testing larger hypotheses about whether SVs are targets of selection more often than SNVs or if these patterns differ between populations or parts of the genome would substantially increase the impact of these results. Figure 4-S1 may be relevant to this kind of comparison; however, it only shows a few loci. While differences in power and ascertainment may make direct comparison challenging, I encourage the authors to think about how to move beyond a list of examples to more general conclusions.

Thank you for this suggestion. We explored the idea of “positive selection fine-mapping” during our initial investigation of the Ohana results, including attempting to adapt CAVIAR (Hormozdiari et al., 2014) toward this goal. However, we were ultimately unsatisfied with the reliability and interpretability of this analysis, which extends beyond the intended use of CAVIAR (fine-mapping of genotype-phenotype associations) and was also sensitive to even low rates of genotyping error, which remains higher for SVs than SNPs. We believe that such selection fine-mapping and enrichment analyses are an interesting area for methodological development that lies outside the scope of our current study, but we agree with the broader goal of moving beyond lists of examples to more general conclusions.

In this spirit, we have identified 133 of the 220 outlier SVs where the SV is among the top 0.1% of variants within a 1 Mb window (lines 290-293). Several other aspects of our revision also extend beyond specific loci (e.g., quantitative comparisons of SV catalogs, eQTL functional enrichments) and should help address this point.

Additionally, the y-axis range on Figure 4 across the different ancestries are quite different. It is hard to interpret to the significance and magnitude of the LRS. Can the multiple testing control be integrated here?

Stratifying the analysis by ancestry component is important, as the test has different power for detecting allele frequency changes on different branches of the tree. A combined comparison of LRS would cause certain ancestry components to dominate the results. See (Cheng et al., 2021) for a description of the method and a demonstration on stratified ancestry groups, which we followed in our work.

Prior to our initial submission, we very carefully considered multiple testing correction and the interpretation of LRS p-values, which could naively be adjusted to control the family-wise error rate or false discovery rate using standard methods. Even using very stringent thresholds, we found this approach to identify implausible numbers of selected loci. However, we note that the very large dataset (thousands of samples) means that even very small improvements in the fit of the more complex model cause it to be favored over the reduced model. In this context, the p-value is not particularly meaningful, except for ranking of results for a given ancestry component.

Drawing a dividing line between positively selected and neutrally evolving loci is a general challenge in the field, which we believe is important but beyond the scope of our work. While evolutionary simulation offers one useful approach, it requires us to specify the parameters of the demographic model including changes in population sizes and migration, which remain uncertain even for individual populations, let alone for a global sample.

Given these challenges, we ultimately settled on an outlier-based approach, which we still find valuable given the extreme signatures observed at loci such as IGH, especially given its known role in adaptive immunity.

Nevertheless, the reviewer’s point is very well taken, and we have bolstered our selection results at IGH by performing 100,000 neutral simulations under the simplified demographic model of East Asia described by (Jouganous et al., 2017; Wang et al., 2018) (lines 494-496; Figure 5 – S1), demonstrating that the observed magnitudes of allele frequency differentiation drastically exceed those expected under genetic drift alone.

5. Finally, the authors explore one example, the immunoglobulin heavy chain locus, in detail. The example and analysis are compelling; however, it is complicated and sometimes hard to follow given the multiple SVs in this region. Readers would benefit from a cartoon schematic of this locus, perhaps expanding on the supplemental figure. I would also appreciate more discussion of why this region might have been filtered out in Browning et al., 2018.

We thank the reviewer for pointing out sources of confusion in our discussion of the IGH locus. We have clarified the structure of the locus by adding a schematic in Figure 4A.

We also further investigated the filtering of this region by Browning et al., Their filtering involved a mask for the Altai Neanderthal sequence, generated by (Prüfer et al., 2017), that removes sites that have low coverage or mapping quality, or are located within tandem repeats or indels (Browning et al., 2018). We now add a supplementary figure showing the masking of the broader IGH region that we highlight (Figure 4 – S4), where 19.3% of the sequence remains after masking (in comparison with 62.6% in the remainder of the q-arm of chromosome 14). We also describe the Browning et al., filtering in more detail in the text (lines 415-417).

The Browning et al., study highlighted a subregion of the IGH locus that is downstream of the region we focus on. The haplotype they discuss reaches high frequencies in European and Asian populations, and is distinct from the haplotype we investigate, which reaches high frequency only in southeast Asia. This is displayed in Figure 4 – S3, where the Browning et al., haplotype is represented in the EUR, EAS, and AMR populations, while the one we investigate is only in the EAS populations (with the exception of JPT). The precise reason why Browning et al., focused on this Eurasian haplotype, rather than the one at high frequency in southeast Asia alone, is unclear. However, we hypothesize that this may be due to differences in how our studies chose populations to search for frequency differences between, or the exclusion of variants on the southeast Asian haplotype by the Altai Neanderthal mask (for example, variants within 10 bp of the insertion SV we highlight or its most frequency differentiated tag SNP are removed by masking). While we believe that an in-depth discussion of the multiple haplotypes at the IGH locus would overly complicate the paper, we provide a more detailed description of the distinction between our result and the Browning et al., paper in lines 407-417.

Reviewer #3 (Recommendations for the authors):

The abstract and introduction focus on the size of SVs and the relative "ease" of analysis of SNVs and indels, however, the highlighted variants are indels by the 1000 genomes definition. They also seem to have been discovered in other short-read datasets, (e.g. gnomAD).

Thank you for raising this point, which we agree is vital to address in the paper. While the highlighted SV is present in the gnomAD database and sets of small indels called by 1000 Genomes and HGDP (https://gnomad.broadinstitute.org/variant/14-105622991-G-GGGGGACAGCGTCAGGGACAGGTGGGGACAGCCT?dataset=gnomad_r3), it exhibits a highly skewed allele balance (centered on 0.2, in contrast to the expectation of 0.5). The insertion also has low genotyping rates (31% for CDX in 1000 Genomes, compared with 85% in our study) that lead to lower estimates of its allele frequency in East Asian populations (AF = 0.73 in CDX). These caveats to the SV’s discovery in other datasets underscore the technical challenges of genotyping at this locus with traditional short-read approaches. We have incorporated this information into the text of the paper in lines 374-381.

Paragraph has been previously shown to work well for genotyping SVs, however it is not clear what the accuracy in the low-coverage 1000 genomes samples is. After filtering, while 92,286 variants remain, 25,201 (27%) of these are absent from any of the 1000 genomes samples. It is unclear how the authors distinguish "true negatives" from "false negatives."

Our graph genotyping analysis was based on the more recent high coverage (30x) sequencing of 1000 Genomes samples by the New York Genome Center (Byrska-Bishop et al., n.d.) and is therefore similar to the data on which Paragraph was benchmarked. The description of the 25,201 SVs that are absent from any 1000 Genomes sample is simply a description of the genotyping results (the “0” bin of the allele frequency spectrum) and does not attempt to separate false negatives from true negatives. This large number of “0-frequency” variants is expected, however, due to the abundance of rare variation in human populations and the ascertainment scheme of identifying variants in a distinct sample from that in which they are genotyped. We thank the reviewer for raising this point of confusion and have rephrased the text to clarify (lines 128-131).

In support of Paragraph’s genotyping accuracy, we (1) direct readers to the Paragraph and subsequent benchmarking papers (Hickey et al., 2020), where the method was thoroughly evaluated and shown to be more accurate and faster than other SV genotyping methods (lines 100-107); and (2) demonstrate strong concordance in allele frequencies of graph genotyped variants and frequencies of the same variants detected via long-read sequencing (Figure 1 – S4).

I don't think the LD section adds anything new. This is well reported.

And LD cutoff of 0.5 R² is used for introgressed haplotypes versus 0.8² for tag haplotypes in the original motivation. Suggest picking one to be consistent throughout.

While we appreciate that this is not the first study to measure LD between SVs and SNPs, we still think it is important to perform this analysis on this specific dataset, both as a general description of the data and to motivate the goal of identifying novel phenotype associations and positive selection targets that were not tagged by known SNPs. However, we recognize that we could better inform readers of the previous context for these LD results, and have discussed other papers with a similar analysis in lines 177-179.

We thank the reviewer for noting the differences in LD thresholds used in different analyses. We have revised the text (lines 174-177) to also include the lower LD threshold (r² = 0.5) in the tag SNP analysis. Use of this threshold in the introgression analysis serves to capture a broad set of candidate introgressed SVs that can then be verified through further analysis. For the tag SNP analysis, we additionally report results for the more stringent threshold (r² = 0.8).

References

1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, Abecasis GR. 2015. A global reference for human genetic variation. Nature 526:68–74.

Almarri MA, Bergström A, Prado-Martinez J, Yang F, Fu B, Dunham AS, Chen Y, Hurles ME, Tyler-Smith C, Xue Y. 2020. Population Structure, Stratification, and Introgression of Human Structural Variation. Cell 182:189–199.e15.

Audano PA, Sulovari A, Graves-Lindsay TA, Cantsilieris S, Sorensen M, Welch AE, Dougherty ML, Nelson BJ, Shah A, Dutcher SK, Warren WC, Magrini V, McGrath SD, Li YI, Wilson RK, Eichler EE. 2019. Characterizing the Major Structural Variant Alleles of the Human Genome. Cell 176:663–675.e19.

Browning SR, Browning BL, Zhou Y, Tucci S, Akey JM. 2018. Analysis of Human Sequence Data Reveals Two Pulses of Archaic Denisovan Admixture. Cell 173:53–61.e9.

Byrska-Bishop M, Evani US, Zhao X, Basile AO, Abel HJ, Regier AA, Corvelo A, Clarke WE, Musunuri R, Nagulapalli K, Fairley S, Runnels A, Winterkorn L, Lowy-Gallego E, Flicek P, Germer S, Brand H, Hall IM, Talkowski ME, Narzisi G, Zody MC, The Human Genome Structural Variation Consortium. n.d. High coverage whole genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. doi:10.1101/2021.02.06.430068

Cheng JY, Racimo F, Nielsen R. n.d. Ohana: detecting selection in multiple populations by modelling ancestral admixture components. doi:10.1101/546408

Chen L, Wolf AB, Fu W, Li L, Akey JM. 2020. Identifying and Interpreting Apparent Neanderthal Ancestry in African Individuals. Cell 180:677–687.e16.

Chen S, Krusche P, Dolzhenko E, Sherman RM, Petrovski R, Schlesinger F, Kirsche M, Bentley DR, Schatz MC, Sedlazeck FJ, Eberle MA. 2019. Paragraph: a graph-based structural variant genotyper for short-read sequence data. Genome Biol 20:291.

Chin C-S, Wagner J, Zeng Q, Garrison E, Garg S, Fungtammasan A, Rautiainen M, Aganezov S, Kirsche M, Zarate S, Schatz MC, Xiao C, Rowell WJ, Markello C, Farek J, Sedlazeck FJ, Bansal V, Yoo B, Miller N, Zhou X, Carroll A, Barrio AM, Salit M, Marschall T, Dilthey AT, Zook JM. 2020. A diploid assembly-based benchmark for variants in the major histocompatibility complex. Nat Commun 11:4794.

Hickey G, Heller D, Monlong J, Sibbesen JA, Sirén J, Eizenga J, Dawson ET, Garrison E, Novak AM, Paten B. 2020. Genotyping structural variants in pangenome graphs using the vg toolkit. Genome Biol 21:35.

Hormozdiari F, Kostem E, Kang EY, Pasaniuc B, Eskin E. 2014. Identifying causal variants at loci with multiple signals of association. Genetics 198:497–508.

Ilardo MA, Moltke I, Korneliussen TS, Cheng J, Stern AJ, Racimo F, de Barros Damgaard P, Sikora M, Seguin-Orlando A, Rasmussen S, van den Munckhof ICL, Ter Horst R, Joosten LAB, Netea MG, Salingkat S, Nielsen R, Willerslev E. 2018. Physiological and Genetic Adaptations to Diving in Sea Nomads. Cell 173:569–580.e15.

Jouganous J, Long W, Ragsdale AP, Gravel S. 2017. Inferring the Joint Demographic History of Multiple Populations: Beyond the Diffusion Approximation. Genetics 206:1549–1567.

Nattestad M, Goodwin S, Ng K, Baslan T, Sedlazeck FJ, Rescheneder P, Garvin T, Fang H, Gurtowski J, Hutton E, Tseng E, Chin C-S, Beck T, Sundaravadanam Y, Kramer M, Antoniou E, McPherson JD, Hicks J, McCombie WR, Schatz MC. 2018. Complex rearrangements and oncogene amplifications revealed by long-read DNA and RNA sequencing of a breast cancer cell line. Genome Res 28:1126–1135.

Prüfer K, de Filippo C, Grote S, Mafessoni F, Korlević P, Hajdinjak M, Vernot B, Skov L, Hsieh P, Peyrégne S, Reher D, Hopfe C, Nagel S, Maricic T, Fu Q, Theunert C, Rogers R, Skoglund P, Chintalapati M, Dannemann M, Nelson BJ, Key FM, Rudan P, Kućan Ž, Gušić I, Golovanova LV, Doronichev VB, Patterson N, Reich D, Eichler EE, Slatkin M, Schierup MH, Andrés AM, Kelso J, Meyer M, Pääbo S. 2017. A high-coverage Neandertal genome from Vindija Cave in Croatia. Science 358:655–658.

Schaefer NK, Shapiro B, Green RE. 2021. An ancestral recombination graph of human, Neanderthal, and Denisovan genomes. Sci Adv 7. doi:10.1126/sciadv.abc0776

Sudmant PH, Rausch T, Gardner EJ, Handsaker RE, Abyzov A, Huddleston J, Zhang Y, Ye K, Jun G, Fritz MH-Y, Konkel MK, Malhotra A, Stütz AM, Shi X, Casale FP, Chen J, Hormozdiari F, Dayama G, Chen K, Malig M, Chaisson MJP, Walter K, Meiers S, Kashin S, Garrison E, Auton A, Lam HYK, Mu XJ, Alkan C, Antaki D, Bae T, Cerveira E, Chines P, Chong Z, Clarke L, Dal E, Ding L, Emery S, Fan X, Gujral M, Kahveci F, Kidd JM, Kong Y, Lameijer E-W, McCarthy S, Flicek P, Gibbs RA, Marth G, Mason CE, Menelaou A, Muzny DM, Nelson BJ, Noor A, Parrish NF, Pendleton M, Quitadamo A, Raeder B, Schadt EE, Romanovitch M, Schlattl A, Sebra R, Shabalin AA, Untergasser A, Walker JA, Wang M, Yu F, Zhang C, Zhang J, Zheng-Bradley X, Zhou W, Zichner T, Sebat J, Batzer MA, McCarroll SA, 1000 Genomes Project Consortium, Mills RE, Gerstein MB, Bashir A, Stegle O, Devine SE, Lee C, Eichler EE, Korbel JO. 2015. An integrated map of structural variation in 2,504 human genomes. Nature 526:75–81.

VanLiere JM, Rosenberg NA. 2008. Mathematical properties of the r2 measure of linkage disequilibrium. Theor Popul Biol 74:130–137.

Wang Y, Lu D, Chung Y-J, Xu S. 2018. Genetic structure, divergence and admixture of Han Chinese, Japanese and Korean populations. Hereditas 155:19.

Yair S, Lee KM, Coop G. 2021. The timing of human adaptation from Neanderthal introgression. Genetics 218. doi:10.1093/genetics/iyab052

Zhang J-Y, Roberts H, Flores DSC, Cutler AJ, Brown AC, Whalley JP, Mielczarek O, Buck D, Lockstone H, Xella B, Oliver K, Corton C, Betteridge E, Bashford-Rogers R, Knight JC, Todd JA, Band G. 2021. Using de novo assembly to identify structural variation of eight complex immune system gene regions. PLoS Comput Biol 17:e1009254.

Zhao X, Collins RL, Lee W-P, Weber AM, Jun Y, Zhu Q, Weisburd B, Huang Y, Audano PA, Wang H, Walker M, Lowther C, Fu J, Human Genome Structural Variation Consortium, Gerstein MB, Devine SE, Marschall T, Korbel JO, Eichler EE, Chaisson MJP, Lee C, Mills RE, Brand H, Talkowski ME. 2021. Expectations and blind spots for structural variation detection from long-read assemblies and short-read genome sequencing technologies. Am J Hum Genet 108:919–928.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

Yan SM, McCoy RC. 2021. Data from: Local adaptation and archaic introgression shape global diversity at human structural variant loci. Zenodo. [DOI] [PMC free article] [PubMed]
Audano PA. 2019. Long-read sequencing of a Puerto Rican individual (HG00733) NCBI BioProject. PRJNA300840
Audano PA. 2019. Long-read sequencing of a Yoruban individual (NA19240) NCBI BioProject. PRJNA288807
Audano PA. 2019. Long-read sequencing of a Gambian individual (HG02818) NCBI BioProject. PRJNA339722
Audano PA. 2019. Long-read sequencing of a Luhya individual (NA19434) NCBI BioProject. PRJNA385272
Audano PA. 2019. Long-read sequencing of a Vietnamese individual (HG02059) NCBI BioProject. PRJNA339726
Audano PA. 2019. Long-read sequencing of a Utah individual with Northern and Western European ancestry (NA12878) NCBI BioProject. PRJNA323611
Audano PA. 2019. Long-read sequencing of a Telugu individual (HG04217) NCBI BioProject. PRJNA481794
Audano PA. 2019. Long-read sequencing of a Peruvian individual (HG02106) NCBI BioProject. PRJNA480858
Audano PA. 2019. Long-read sequencing of a Finnish individual (HG00268) NCBI BioProject. PRJNA480712
Audano PA. 2019. Long-read sequencing of a Han Chinese individual (HG00514) NCBI BioProject. PRJNA300843
Byrska-Bishop M. 2019. 1000 Genomes Project phase 3: 30X coverage whole genome sequencing. NCBI BioProject. PRJEB31736
Huddleston J. 2017. Long-read sequencing of a hydatidiform mole (CHM1) NCBI BioProject. PRJNA246220
Huddleston J. 2017. Long-read sequencing of a hydatidiform mole (CHM13) NCBI BioProject. PRJNA269593
Lappalainen T. 2013. RNA-sequencing of 465 lymphoblastoid cell lines from the 1000 Genomes. ArrayExpress. E-GEUV-1
Seo J-S. 2016. Long-read sequencing of a Korean individual (AK1) NCBI BioProject. PRJNA298944
Shi L. 2016. Long-read sequencing of a Chinese individual (HX1) NCBI BioProject. PRJNA301527

Supplementary Materials

Figure 1—source data 1. Superpopulation-level allele frequencies of structural variants used to construct allele frequency spectra in Figure 1C.

The third column (‘sum_AC’) provides allele counts, the fourth column (‘sum_NCHROBS’) provides the number of genotyped chromosomes, and the fifth column (‘AF’) provides the alternative allele frequency.

elife-67615-fig1-data1.txt^{(18MB, txt)}

Figure 2—source data 1. Summarized eQTL data underlying Figure 2.

SV IDs, gene IDs and names, distance to the TSS, slopes, p-values, and q-values are provided.

elife-67615-fig2-data1.txt^{(2.9MB, txt)}

Figure 3—source data 1. Ancestry component results underlying Figure 3A.

Ancestry proportions for each individual are provided, along with that individual’s 1KGP population code.

elife-67615-fig3-data1.txt^{(630.6KB, txt)}

Figure 5—source data 1. Parameters and summary statistics for the simulations retained in the ABC analysis, underlying Figure 5D–G.

Simulation parameters are provided in the first three columns. Population-specific allele frequencies output by the simulation are provided in the last five columns, labeled by their corresponding 1KGP population codes.

elife-67615-fig5-data1.txt^{(581.6KB, txt)}