Abstract
India has been underrepresented in genomic surveys. We generated whole-genome sequences from 2,762 individuals in India, capturing the genetic diversity across most geographic regions, linguistic groups, and historically underrepresented communities. We find most Indians harbor ancestry primarily from three ancestral groups: South Asian hunter-gatherers, Eurasian Steppe pastoralists, and Neolithic farmers related to Iranian and Central Asian cultures. The extensive homozygosity and identity-by-descent sharing among individuals reflects strong founder events due to a recent shift toward endogamy. We uncover that most of the genetic variation in Indians stems from a single major migration out of Africa that occurred around 50,000 years ago, followed by 1%–2% gene flow from Neanderthals and Denisovans. Notably, Indians exhibit the largest variation and possess the highest amount of population-specific Neanderthal ancestry segments among worldwide groups. Finally, we discuss how this complex evolutionary history has shaped the functional and disease variation on the subcontinent.
Graphical Abstract

INTRODUCTION
With more than 1.4 billion people and approximately 5,000 anthropologically well-defined ethno-linguistic and religious communities, India is a region of extraordinary diversity.1 Yet, Indian populations remain underrepresented in genomic studies. Recent sequencing studies, such as the 1000 Genomes Project (1000G),2 UK Biobank,3 TopMed,4 Simons Genome Diversity Panel5, Human Genome Diversity Panel (HGDP)6 and GenomeAsia,7,8 have incorporated Indian populations. However, with the exception of GenomeAsia,7,8 these efforts have either included very few individuals or primarily sampled expatriate communities outside of India, leading to a limited (and biased) representation of the genetic variation seen in India. As a result, many open questions about India’s population history remain unanswered: When did people first migrate to India from Africa—as part of the major migration out of Africa or at earlier times along the southern coastal migration route?9,10 What is the contribution and legacy of gene flow from the archaic hominins, Neanderthals, and Denisovans to Indian populations? How have recent technological innovations, like Neolithic farming and the spread of languages, impacted genetic variation and disease in India?
Understanding population history—past migrations, population bottlenecks, and admixture events—is useful for tracing population origins and foundational for effective disease mapping. Recent studies have shown that inadequately accounting for population structure can lead to false positives in genome-wide association studies11,12. In turn, leveraging population history enhances the power to detect true associations and provides insights into the source and dynamics of disease-causing variants over time13,14. For instance, gene flow from archaic groups has significantly influenced human adaptation and fitness, affecting numerous traits from high-altitude adaptation to risk of diabetes and infectious diseases15–17. Thus, a comprehensive understanding of population history is an essential first step in effective disease mapping.
To obtain a detailed picture of genetic diversity in India, we generated deep coverage genome sequences of 2,762 individuals. Our data are part of the Longitudinal Aging Study in India Diagnostic Assessment of Dementia (LASI-DAD) 18, a population-based prospective genomic cohort study of individuals aged 60 years or older. LASI-DAD comprises nationally representative data of individuals from 18 different states and union territories across India (Figure 1A), with a median sample size of 157 individuals per state (see STAR Methods and Data S1, Section 1). It includes diverse geographic regions (including rural and urban areas), speakers of various language families (e.g., Indo-European, Dravidian, and Tibeto-Burman languages) and historically underrepresented communities (e.g., Scheduled Tribes [ST], Scheduled Castes [SC] and Other Backward Class [OBC]), providing the most comprehensive snapshot of genetic variation in India.
Figure 1. Population structure and admixture in India.

(A) Sampling locations of LASI-DAD individuals in India, colored by region (North, North-east, Central, South, East and West) used for analysis. (B) Principal component analysis (PCA) for Indians in LASI-DAD and 1000G individuals of West Eurasian (EUR), East Asian (EAS) and South Asian (SAS) ancestry. We show the projection of the first two principal components, colored by region of birth. We find similar results with West Eurasian and East Asians populations from HGDP (Figure S4.5) (C) Ancestry proportions for each individual on the ‘Indian cline’ using Sarazm_EN (in orange) as a proxy for Iranian farmer-related, Central_Steppe_MLBA (in green) as a proxy for Steppe pastoralist-related and AHG (Onge) (in blue) as a proxy for AASI-related ancestry. Ancestry proportions are compared by region (left), language family (middle), and social group (right) of each individual. Boxplot box limits represent the 25th and 75th percentiles; the center line, the median and whiskers, the minimum/maximum value in the data or the 25th/75th percentile + 1.5x interquartile range.
RESULTS
Data and catalog of genetic variants
A total of 2,762 LASI-DAD participants, including 22 trios (mother-father-child), were sequenced to an average read depth of 30× at MedGenome (Bangalore, India). The raw whole genome sequences were sent to the Genome Center for Alzheimer’s Disease (GCAD) at the University of Pennsylvania for joint calling and quality control (see STAR Methods; Data S1, Section 2). A total of 2,679 individuals and 73.2 million autosomal bi-allelic variants passed quality control filters, including 67.1 million single-nucleotide variants (SNVs) and 6.04 million insertion-deletions (indels). We identified 24 million SNVs and 2.2 million indels absent from existing human genomic variation databases, such as the 1000G and the Genome Aggregation Database (gnomAD)19, highlighting the limitations of these databases in representing diverse populations. The vast majority (>99%) of these variants are rare (with a frequency of less than 1%), including 68% singletons (Data S1, Table S2.1). Genome phasing was conducted using SHAPEIT420 (without a reference panel), and we estimated a low phase switch error rate of less than 1.13% in trios (Data S1, Table S3.1).
Our dataset is representative of the population diversity in India. It includes individuals born in 23 states and union territories, from both rural (63%) and urban (37%) areas. It consists of speakers of approximately 26 languages, with 74% of individuals reporting as Indo-European language speakers and 25% as Dravidian language speakers. It comprises individuals from diverse communities including tribal (ST: 4%) and caste (SC: 18% and OBC: 44%) groups.21,22 Nearly equal numbers of males and females were recruited, with females comprising 52% of our dataset. For many analyses, we grouped individuals based on their location of birth into six major geographic regions: North (n = 555), West (n = 385), Central (n = 373), South (n = 715), Northeast (n = 73), and East (n = 530) (Data S1, Section 1). After performing quality control checks and excluding first-degree relatives, we used a sample of 2,620 individuals for most of our analyses described below, unless specified otherwise (Data S1, Section 2).
Population structure and admixture
To study the population relationships of Indians with other worldwide populations, we combined the LASI-DAD with the 1000G23 and HGDP6 datasets and performed principal-component analysis (PCA)24 and ADMIXTURE25. Consistent with previous reports26,27, we find that the population structure of India is related to individuals of West Eurasian-related (1000G EUR) and East Asian-related (1000G EAS) ancestries (Figures 1B and Data S1, Figure S4.1). We obtain qualitatively similar results when including Central Asians and Middle Easterners from HGDP, highlighting that the patterns observed in PCA reflect West Eurasian-relatedness rather than European-relatedness alone (Data S1, Figure S4.5). Unlike previous studies that mainly sampled isolated endogamous communities in India26,28,29, we used a grid-based or population density-based sampling approach. Notably, we observed more continuous variation in ancestry along PC1 and PC2. There are three main clusters in PCA—one cluster that includes the majority of the individuals from North and South of India that exhibit varying relatedness to West Eurasians, referred to as the “Indian cline” (Figure 1B and Data S1, Figures S4.2 and S4.3). The Indian cline has previously been shown to reflect varying proportions of ancestry from two ancestral groups: the “Ancestral North Indians” (ANI), who harbor large proportions of ancestry related to West Eurasians, and the “Ancestral South Indians” (ASI), who are only distantly related to West Eurasians26,27. Recent ancient DNA analysis have shown that (unsampled) Indigenous South Asians (“Ancient Ancestral South Indians” [AASI]) distantly related to Andamanese hunter-gatherers (AHG) mixed with ancient Iranian farmers to form ASI, who in turn mixed with ancient Eurasian Steppe pastoralist-related groups to form ANI. Thus, both the ASI and ANI were admixed groups30 (Box 1).
Box 1. Description of source populations used in the study.
| AASI | Ancient Ancestral South Indians represent an Indigenous, unsampled South Asian population that is one of the most ancient lineages in South Asia |
| ASI | Ancestral South Indians represent a hypothetical group that has ancestry related to AASI and ancient Iranian farmers |
| ANI | Ancestral North Indians represent a hypothetical group that has ancestry related to ASI and Eurasian Steppe pastoralists from Central Steppe MLBA |
| AHG | AHG refers to present-day Indigenous Andaman Islanders, the Onge, who are related to the unsampled Indigenous South Asians30 (n = 15) |
| Indus Periphery Cline | Indus Periphery Cline is a heterogeneous group of 11 outlier samples from Bronze Age cultures of Shahr-i-Sokhta and Bactria Margiana Archaeological Complex that has been shown to have Iranian farmer-related and some AHG-related ancestry in an earlier study.30 Indus Periphery West (I8726) is a single individual with the highest Iranian farmer-related ancestry among the Indus Periphery Cline, dated to ~3,100–3,000 BCE |
| Central Steppe MLBA | The individuals from the Central Steppe Middle to late Bronze Age are considered as the source for Yamnaya Steppe pastoralist-derived ancestry in South Asia30 (n = 34) dated to ~2,000–900 BCE |
| Sarazm EN | 4th millennium BCE farmers and herders from Sarazm, Tajikistan, dated to ~3,600–3,500 BCE (n = 2) |
| Parkhai Anau EN | Eneolithic individuals from Tepe Anau and Parkhai in Turkmenistan dated to ~3,500–2,900 BCE (n = 9) |
| Namazga CA | Chalcolithic individuals from Namazga in Turkmenistan dated to ~3,300–3,200 BCE (n = 2) |
Beyond the Indian cline, we find two primary clusters of individuals (n = 494): one cluster close to the ASI-end of the cline and another located off the center but exhibiting clear relatedness to East Asian-related groups (1000G EAS) (Figure 1B). The former mainly includes individuals from Central and East India, with the majority from the state of Odisha where predominantly Indo-European and Austroasiatic languages are spoken. The second cluster includes individuals from East and Northeast regions of India. West Bengal is the most representative state in this cluster, with individuals deriving ~5% ancestry related to East Asians (Data S1, Section 4). By measuring the admixture related linkage disequilibrium using ALDER31, we inferred that the East Asian-related gene flow occurred 50 generations ago or around 520 CE (Data S1, Figure S4.15), consistent with estimates from an earlier study of individuals from Bangladesh32. This timing is concordant with the collapse of the Gupta Empire, though some mixture could have also occurred earlier with the spread of rice farming from East Asia33,34. The East Asian-related cluster also includes individuals from Assam that speak Tibeto-Burman languages. PCA shows significant heterogeneity in East Asian relatedness among individuals in this group, indicative of recent gene flow (Figure 1B). Our ADMIXTURE25 analysis mirrors the patterns observed in PCA (Data S1, Figure S4.8).
To model ancestry in India, we used qpAdm that compares allele frequency correlations between a population of interest and a set of reference and outgroup populations35,36. First, we examined how well the three-way model with ancient Iranian farmer-related, Eurasian Steppe pastoralist-related, and AHG-related groups explains the genetic variation of individuals on the Indian cline (Figure 1B). Following Narasimhan et al. 30, we used Indus Periphery West that is part of the Indus Periphery Cline—a heterogeneous group of 11 outlier samples from Bronze Age cultures of Shahr-i-Sokhta and Bactria Margiana Archaeological Complex—as the proxy for Iranian farmer-related ancestry, Central Steppe Middle to late Bronze Age (Central_Steppe_MLBA) as the source for Steppe pastoralist-related ancestry and AHG-related individuals to represent AASI ancestry30 (Box 1). We find the three-way model provides a good fit for the majority (>90%) of the individuals on the Indian cline, with some exceptions (we define “good fit” as models with qpAdm p value > 0.01, see STAR Methods). Notably, we find 22 individuals that can be fitted as a two-way mixture between Iranian farmer-related and AHG-related ancestries without Steppe pastoralist-related ancestry (referred to as ASI).
The archaeological context of the Indus Periphery Cline and their relationship to ancient Indian civilizations (e.g., Indus Valley civilization) is unclear as these were migrant individuals from Bronze Age Central Asian cultures. Moreover, like present-day Indians, they also derive some ancestry from AHG-related groups30. To identify the closest proxy of Iranian farmer-related ancestry in 22 ASI individuals and Indus Periphery West, we used qpAdm and examined 14 ancient Iranian-related groups from the Neolithic to Iron Age. We obtain good fits for all 22 ASI individuals when the Iranian-related ancestry derives from early Neolithic and Copper Age individuals from Central Asian cultures—4th millennium BCE farmers and herders from Tajikistan (Sarazm_EN, ~3,600–3,500 BCE) or Copper Age individuals from Turkmenistan (Namazga_CA, ~3,300–3,200 BCE) or a combination of Sarazm_EN and Parkhai_Anau_EN (~3,500–2,900 BCE, Turkmenistan) that were previously suggested as the sources for Indus Periphery Cline30 (Data S1, Table S4.2). For individuals on the Indian cline (n = 2,126), we find the model with Sarazm_EN provides the best fit for the vast majority (>95%) of individuals (p value > 0.01 for threeway model with AHG-related and Central_Steppe_MLBA). In contrast, models with Namazga_CA or Parkhai_Anau_EN fail or yield negative coefficients for a substantial fraction (>15%) of the individuals (Data S1, Table S4.3).
Turning to the individuals beyond the Indian cline (n = 494), we tried three models including Sarazm_EN, AHG-related, and either (a) Steppe pastoralist-related (as the Indian cline), (b) Austroasiatic-related (using Nicobarese), or (c) East Asian-related (using Han Chinese) ancestries. We also tested four-way models by adding Steppe pastoralist-related ancestry if models (b and c) failed. We obtain good fits for 91% of the 494 individuals (Data S1, Table S4.4). Among these, 91 individuals can be modeled without Steppe pastoralist-related ancestry, including nearly all (~96%) individuals in the Austroasiatic-related cluster (using model b). This indicates that the closest proxy (among the sampled populations) of Iranian-farmer ancestry in ANI, ASI, Austroasiatics-related, and East Asian-related individuals in India is Sarazm_EN. Indeed, one of the two Sarazm_EN individuals has tentative evidence for some AHG-related ancestry as suggested previously37 (Data S1, Table S4.5 and Figure S4.12).
Using AHG-related, Sarazm_EN, and Central_Steppe_MLBA as reference populations, we inferred the ancestry proportions of individuals on the Indian cline. We find marked variation in genetic composition across India, with AHG-related ancestry varying between ~19% and 69%, Sarazm_EN between ~27% and 68%, and Central_Steppe_MLBA between ~0% and 45%. Among the three ancestry components, variation in AHG-related shows the strongest correlation to the ANI-ASI cline in PCA (Data S1, Figure S4.14). AHG-related ancestry proportion is significantly associated with geography (e.g., highest in the South and lowest in the North of India), language (i.e., higher in Dravidian vs. Indo-European language speakers) and social group (highest in tribal groups compared with others), though there is large variation within each group (Figure 1C). This highlights that these ancient admixture events have had a formative impact on the genetic diversity in India.
Founder events increase homozygosity in India
Previous studies have shown that many Indian groups have a history of strong founder events due to endogamy (marriages within the community) and consanguinity (marriages between close relatives) 8,28,38. Such events reduce genetic variation, decrease the efficacy of selection in removing deleterious variants, and increase the risk of recessive diseases. At a genomic level, founder events increase the sharing of chromosomal regions inherited identity-by-descent (IBD) from a few common ancestors39. Descendants of consanguineous marriages are more likely to inherit IBD segments from both parents, resulting in segments that are homozygous-by-descent (HBD). A founder event results in many, short HBD segments, while recent consanguinity results in fewer but longer HBD segments.
We identified IBD and HBD segments in LASI-DAD and 1000G datasets using hap-IBD40, a haplotype-based IBD detection method. To differentiate the effects of founder events and recent consanguineous marriages, we stratified the HBD segments by length—long (>8 cM) indicative of consanguinity and short (<8 cM) mostly reflecting founder events41. Indians, on average, have a larger fraction of their genome in HBD segments (~29 cM) compared with 1000G EAS (~6 cM), EUR (~6 cM), and AFR (~4 cM) (Figure 2A). Within India, individuals from South have significantly higher homozygosity, both in terms of the total amount of their genome in HBD segments (on average, ~56 cM in the South compared with ~19 cM in other regions, p value < 10−16) and the fraction of long HBD segments (8.4% vs. 4.3%, p value < 10−6). This reflects a higher prevalence of consanguineous marriages in southern India42 (Figures 2A; Data S1, Figures S5.1 and S5.2). A majority (>90%) of the homozygosity stems from short HBD segments (rather than long HBD segments), suggesting a primary role of historical founder events rather than recent consanguinity as the source of homozygosity (Figures 2A and Data S1, Figure S5.2). Similar results are obtained when we use a threshold of 20 cM to define long HBD segments (Data S1, Figures S5.1 and S5.2B).
Figure 2. Founder events and consanguinity leads to high rates of homozygosity and relatedness in Indians.

(A) Genome-wide homozygosity in LASI-DAD samples by region, compared to 1000G individuals from East Asia, Europe, and South Asia. Black lines show homozygous segments >8 cM, colored lines include shorter segments. (B) For each individual in LASI-DAD and 1000G, we identified the “closest related individual” as the individual who shares the highest total amount of identity-by-descent (IBD) segments, measured in centimorgans (cM). The Y-axis shows the percentage of individuals sharing ≥ x cM (X-axis). For LASI-DAD individuals, we inferred the mean and the standard error by resampling 500 individuals (dashed lines represent the mean and 95% CI). The vertical dashed lines indicate the expected value of the total IBD sharing for kth degree cousins. This figure was adapted from 72.
Next, we investigated genome-wide IBD sharing across individuals. We computed the fraction of individuals who have at least one close genetic relative within LASI-DAD and compared this proportion across worldwide populations in 1000G (see STAR Methods and Data S1, Figure S5.3). We infer that ~51.0% (38.4%–59.2% across regions) of individuals in LASI-DAD find at least one genetic relative with expected IBD sharing equivalent to a 3rd-degree cousin or closer relationship (~53 cM), which is markedly higher than 17.2% in AFR, 14.2% in SAS, 8.8% in EAS, and 8.8% in EUR from 1000G (Figure 2B and Data S1, Table S5.1) (note that a previous study identified ~5%–10% of individuals are first-and second-degree relatives in Gambians from Mandinka [GWD] and Esan in Nigeria [ESN] contributing to higher relatedness in AFR44). The higher IBD sharing in LASI-DAD, especially compared with 1000G SAS may stem from: (1) larger sample size of LASI-DAD, or (2) ascertainment bias in selecting individuals in either study. We examined each of these hypotheses in turn. We performed bootstrap resampling of equal numbers of individuals (n = 500) from LASIDAD and 1000G SAS and inferred that the fraction of 3rd-degree cousins decreased to 24.2% (95% confidence interval [CI]: 19.4%–28.6%) yet was significantly higher than 1000G SAS (Figure 2B and Data S1, Table S5.1). In LASI-DAD, individuals were recruited using a stratified sampling scheme; first, secondary sampling units (SSU)—villages or urban census blocks— were selected within each state, followed by a random selection of individuals within each SSU (see STAR Methods). To control for the impact of our sample selection approach, we compared pairs of individuals from different SSUs in LASI-DAD (Data S1, Section 5). Despite controlling for sampling location and sample size, we still find higher relatedness in LASI-DAD (~16.4%–35.0%) compared with 1000G SAS (14.2%), though the differences are more modest (Data S1, Figure S5.4). This comparison highlights the limitations of the sampling of 1000G groups for representing the genetic variation of India (with mainly a few expatriate groups from the subcontinent). Overall, we find that all individuals in LASI-DAD have at least one putative 4th degree cousin or closer relative (with IBD > 10 cM) in our dataset.
Gene flow from archaic hominins in India
Most non-Africans, including Indians, derive ~1%–2% of their ancestry from gene flow from archaic hominins, Neanderthals and Denisovans5,7,45. The functional impact and regional variation in archaic ancestry in India, however, remains unclear. We applied a reference-free hidden Markov model, called hmmix33, to 2,679 phased individuals from India (to maximize our sample size, we retained first-degree relatives [except offspring of trios]). hmmix classifies genomic fragments into two states—”modern human” or “archaic” by comparing the density of derived alleles that are not found in the outgroup (here, we use 490 sub-Saharan Africans who have a negligible amount of archaic ancestry25) (see STAR Methods). To compare archaic ancestry patterns in Indians to other worldwide populations, we also applied hmmix to phased data from 2,309 individuals from 1000G and 825 individuals from HGDP and used the published results for hmmix for 27,566 Icelanders from deCODE genetics.26 Unless stated otherwise, we retained archaic ancestry segments with a posterior probability greater than 0.8 that translates to a <4% false positive rate in simulations26.
We inferred that Indians have an average of 102.98 Mb or 2.02% (95% percentile range: 1.79%–2.29%) of the callable genome of archaic ancestry. By comparing the putative archaic segments to sequenced Neanderthal and Denisovan genomes46–49, we inferred the source of the archaic ancestry based on measuring the number of shared derived archaic variants (DAVs) present on archaic segments. We find that each individual has ~1.43% (1.26%–1.65%) Neanderthal and ~0.10% (0.03%–0.17%) Denisovan ancestry. The Neanderthal ancestry proportion in India is similar to Europeans (~1.2%) and Americans (~1.3%), though significantly lower than East Asians (~1.7%, Wilcoxon ranked test p value < 10−15). The highest Denisovan ancestry is inferred in Oceanians (~2.0%), while Americans, East Asians, and South Asians have similar amounts (~0.1%) (Table S6).
By assembling non-overlapping archaic ancestry segments extracted from individuals in LASI-DAD, we reconstructed 1,524 Mb of the introgressing Neanderthal and 591 Mb of the introgressing Denisovan genome (Figures 3A and 3B). Notably, using individuals from all worldwide regions (from 1000G, HGDP, and LASI-DAD), we reconstructed 1,679 Mb of the introgressed Neanderthal genome that is similar in size to the directly sequenced Neanderthal genomes (after filtering, ~1,650 Mb; Data S1 Section 6, Figure S6.8). Despite higher per individual Neanderthal ancestry in East Asians, we recover more Neanderthal sequences from Indians than East Asians even after controlling for sample size (as also seen in Witt et al.45; Table S6). The largest study of archaic ancestry in 27,566 Icelanders recovered 978 Mb of the introgressed Neanderthal and 112 Mb of the introgressed Denisovan genome (using more-stringent filtering and posterior probability > 0.9 in hmmix50). Even with these more-stringent thresholds, we recovered >50% more Neanderthal ancestry segments from Indians (LASI-DAD) than from Icelanders (Figure 3A). Using all worldwide regions, we reconstructed 1,080 Mb of the introgressing Denisovan genome. The largest amount of Denisovan ancestry is recovered from Indians, though this is not significant after downsampling to the sample size of Oceanians in HGDP (n = 28) (Data S1, Figure S6.8).
Figure 3. Archaic gene flow in India and worldwide populations.

Upset plot and cumulative amount of archaic ancestry sequences (in Gb) in modern humans for (A) Neanderthal and (B) Denisovan ancestries. For comparison with deCODE49, we used the stringent posterior probability cutoff (>0.9) and removed any SNPs in repetitive regions, (C) Each dot represents the minimum coalescence time with Sub-Saharan Africans estimated by using the emission parameters of the modern human state in hmmix. The X-axis shows each population colored by the region, and the gray area marks the 95% CI of the coalescence time of that population to sub-Saharan Africans. The dotted line represents the time of the Toba eruption (74,000 years ago53) reflective of the minimum estimate of the Southern Dispersal out of Africa.
Next, we calculated the amount of archaic sequence that is shared (i.e., overlapping in the same genomic regions) between Indians and other worldwide populations from 1000G and HGDP datasets. We find that 81.2% of Neanderthal ancestry is shared between at least two global regions (Figure 3A), and around 11.7% (or 195.9 Mb out of 1,679 Mb) is only seen in India. Overall, ~90.8% of worldwide Neanderthal sequences are observed among Indians (Figure S1). Even after downsampling to sample sizes to match the minimum sample sizes in 1000G (n = 490) and HGDP (n = 28), we find the largest proportion of Neanderthal ancestry is present in India (84.5% with n = 490 [Data S1, Figure S6.9] and 57.3% with n = 29 [Data S1, Figure S6.10]). Moreover, Oceanians and Indians have substantial amounts of Denisovan ancestry sequences that are not shared with other worldwide populations (Data S1, Figure S6.6). Around 51% of Denisovan sequences (301.6 Mb out of 591 Mb) are observed only in Indians (Data S1, Figure S6.6), which remains significant even after downsampling (Data S1, Figure S6.8).
To infer the relationship of the introgressed archaic population to the sequenced archaic genomes, we estimated match rates for the DAV SNPs in each introgressed segment to three high coverage Neanderthals46–48 and one Denisovan genome49. Using a similar approach as Browning et al. 52, we find that on average the introgressed Neanderthal segments share 83% of the DAVs with one of the three sequenced Neanderthal genomes (with highest sharing with Vindija Neanderthal), replicating the finding of a single pulse of Neanderthal gene flow in India (Data S1, Section 6). Most groups in India derive Denisovan-related ancestry from a single population that was distantly related to the sequenced Altai Denisovan genome (with 46%–50% of shared DAV SNPs). A small fraction of Denisovan-related ancestry in individuals from the Northeast and South of India stems from a Denisovan group that is closely related to the sequenced Denisovan genome (segments share on average 84% of DAV SNPs) (Data S1, Figure S6.11). Individuals in Northeast India harbor recent ancestry from East Asian-related groups (Figure 1B) that have previously been shown to harbor two pulses of Denisovan ancestry52. Beyond Neanderthal and Denisovan ancestry, we inferred 0.42% (95% percentile range: 0.37%–0.48%) of archaic ancestry from an unknown source in Indians (Table S6). This proportion is similar across all non-Africans and potentially related to the difference between the sequenced archaic genomes and the introgressing archaic individuals (Data S1, Figure S6.3). Thus, there is no clear evidence for additional contribution from other unknown archaic hominins to Indians (at least not more than other worldwide populations), contrary to previous claims53.
Archaic ancestry varies across regions in India, with the highest archaic ancestry in the Northeast and East of India and the lowest in North India (Data S1, Figure S6.3; and Tables S6.4 and S6.6). To investigate how recent gene flow events have shaped the distribution of archaic ancestry in India, we examined the relationship between Neanderthal and Denisovan ancestry as functions of the three main ancestry components in India. Focusing on individuals on the Indian cline (n = 2,126), we find the AHG-related ancestry is positively correlated with both Denisovan (r = 0.46, p value < 10−15) and Neanderthal (r = 0.24, p value < 10−15) ancestries (Data S1, Table S6.4). These results are robust, using more-stringent criteria for assigning archaic ancestry segments to Neanderthal and Denisovan origin by using sites where only one archaic group has a derived allele that matches modern humans (Data S1, Table S6.4). This pattern, particularly the correlation between Denisovan and AHG-related ancestry, is not unexpected considering that West Eurasian related groups have minimal Denisovan ancestry54.
Timing of the out-of-Africa migration to Indian subcontinent
A central question in the peopling of India is when modern humans first arrived in the subcontinent from Africa. Archaeological evidence suggests occupation in India before and after the Toba eruption that occurred around 74,000 years ago51. It is unclear, however, whether the pre-Toba group contributed to the ancestry of present-day people in India. In order to infer the separation time of Indians from sub-Saharan African groups (i.e., the time of out of Africa migration), we used the rate of emissions inferred by hmmix (for the “modern human” state). Theoretically, this parameter reflects the number of mutations that an individual has accrued since their divergence from sub-Saharan Africans. For a given mutation rate, we can thus infer the minimum coalescence time to sub-Saharan Africans (see STAR Methods). Moreover, we examined whether the inferred coalescent time for Indians is similar to other non-African groups such as East Asians, Europeans, and Americans (using hmmix emission rates for each population). We controlled for technical factors across populations and datasets, including phasing errors, removal of triallelic sites and excluded any individuals with more than 1% sub-Saharan African-related ancestry (as each of these factors can bias the emission rates). We infer the minimum coalescence time between Indians and sub-Saharan Africans is 53,932 years ago (95% percentile range: 53,190–54,644), assuming a human mutation rate of 0.45 × 10−9 per base pair per year55 (Data S1, Table S9.2; Figures 3C; Data S1, Figure S9.3). We obtain qualitatively similar results for Europeans, East Asians, and South Asians in the HGDP dataset. Moreover, by performing simulations, we show that the observed emission parameter in India is consistent with variation stemming from 0% to 3% of ancestry from an earlier migration that occurred around 74,000 years ago (Data S1, Figure S9.5). Our results demonstrate that the majority of the ancestry of present-day Indians derives from a major migration event out of Africa that occurred around 50,000 years ago.
Impact of evolutionary history on disease and functional variation
Population history, including gene flow, founder events, and natural selection, plays a crucial role in shaping genetic variation including disease susceptibility. To study the impact of evolutionary history on genetic variation in India, we characterized the functional effects of variants, including those that alter protein structure, such as putative loss of function (pLoF) or missense variants (see STAR Methods). We identified 385,985 missense variants and 20,319 pLoF variants (Data S1, Table S5.2). Each individual carries ~10,344 (range: 9,911–10,761) derived missense variants and ~67 (46–96) pLoF variants on the autosomes, similar to estimates observed in other worldwide populations23. Most (>90%) of these variants are rare (frequency below 1%) or singletons (~62%). Among 18,451 protein-coding autosomal genes in the human genome (RefSeq database56), we find missense variants and pLoFs in 89.5% of the genes (pLoFs in 48%), ranging from 1 to 1,265 variants per gene (1–52 pLoFs). The top three genes with the highest number of pLoF variants are mucin genes: MUC3A, MUC16, and MUC17, with 52, 42, and 41 pLoFs, respectively, including homozygous pLoFs in MUC17. As there is partial redundancy in the function of mucin genes, there may be greater tolerance for loss-of-function variants57.
The history of founder events predicts a high burden of deleterious variants and an increased risk of recessive diseases, as seen in Finns and Ashkenazi Jews.39,58 We examined the variation in the prevalence of homozygous deleterious mutation burden (measured as the sum of homozygous missense and pLoFs) across individuals in India. We find that individuals with higher AHG-related ancestry carry a greater homozygous deleterious mutation burden compared with other ancestries (Figure 4A). Notably, the homozygous deleterious mutation burden is strongly correlated with the sum of HBD segments per individual. In turn, this implies that the higher mutational burden in individuals with higher AHG-related ancestry is driven by higher HBD, which is a consequence of recent founder events and consanguinity (Figure 4B). Among the 406,304 missense variants and pLoFs, we find that ~40% are not registered in gnomAD or 1000G, with the vast majority of those variants being extremely rare (<0.1%) (Data S1, Table S5.2 and Figure S5.5B). We find that ~7% of the non-singleton missense/pLoF variants are present in the ClinVar database59, including 214 variants classified as “pathogenic” or “likely pathogenic” (Data S1, Table S5.2). Many of the ClinVar pathogenic variants are in genes associated with rare recessive disorders such as HBB related to blood disorders60, GJB2 associated with congenital hearing loss61, CFTR linked to cystic fibrosis62, and PAH that plays a role in phenylketonuria63. Notably, we find a pathogenic variant (L307P) in the butyrylcholinesterase (BCHE) gene that is present in 15 individuals (0.28%) in LASI-DAD but is not seen outside South Asia64 (Data S1, Table S5.2). Patients with BCHE deficiency have higher risk for muscle paralysis in response to the use of certain muscle relaxants used commonly during anesthesia. This variant is enriched in individuals from the Vysya community that live in Andhra Pradesh and Telangana28,64,65, consistent with our observation that 8 of the 15 carriers in LASI-DAD are from Telangana. These findings underscore the value of population-specific genetic screening programs to reduce the disease burden in India.
Figure 4. Impact of demographic history on disease risk.

(A) Relationship between the number of homozygous derived missense/pLoFs and the ancestry coefficient for individuals on the Indian cline. The y-axis is truncated to better illustrate the trends. (B) Relationship between the number of homozygous derived missense/pLoFs and the total sum of HBD segments per individual. Individuals are colored in function of their AHG-related ancestry coefficient, individuals not on the cline are in grey. We fit a regression using generalized linear model (glm) and obtain the following fit: y = 2576 + 0.916*x. (C) Distribution of archaic ancestry regions across the genome. We computed the mean archaic frequency along the genome of LASI-DAD individuals and considered segments with an archaic frequency higher than the mean (μ) + two standard deviations (σ) as enriched (blue for Neanderthal, green for Denisovan). Archaic ancestry deserts (<0.1% archaic ancestry over 10 Mb) are shown as striped rectangles in the same colors.
To characterize the functional impact of archaic ancestry in India, we examined the genome-wide distribution of archaic ancestry and identified regions of “high-frequency archaic ancestry” (defined as regions where the frequency of archaic ancestry across individuals is two standard deviations above the genome-wide average) (Figure 4C). We identified 1,590 and 818 candidate regions with high frequencies of Neanderthal and Denisovan ancestry, respectively. For Neanderthals, we replicated previously identified genes54,66,67 such as FBP2 and FYCO1 and identified PCAT7 and CXCR6 as additional candidates. For Denisovans, we replicated signals66 in WDFY2, CHD1L, and HELZ2 and identified several additional candidate genes, including LINC00708 and CDKN2B (Data S1, Section 7; Table S3). Performing Gene Ontology (GO) enrichment analysis, we find 14 pathways enriched for Neanderthal and 22 pathways for Denisovan ancestry primarily related to immune function (Table S4).
Next, we searched for regions that have a high number of derived alleles that are shared between modern humans and archaic groups, a signature previously observed for EPAS1 and Denisovan ancestry in Tibetans17. Interestingly, we find that one region of the genome has a disproportionately elevated number of Denisovan-derived variants only shared with Indians; however, no similar enrichment is seen for Neanderthal-derived variants (Data S1, Section 7). This region harboring the BTNL2 gene, part of the major histocompatibility complex (MHC), contains 78 Denisovan-specific derived variants within a 13.2-kilobase (kb) region. It also has exceptionally high (~10%) Denisovan ancestry in Indians (>99.9th percentile). Denisovan haplotypes in this region cluster by length: a short haplotype of 55–65 kb and a long haplotype of ~150 kb, containing 116.1 and 126.7 Denisovan specific derived variants, respectively. The haplotype length and number of shared derived alleles between Indians and Denisovans suggests this region is likely a product of gene flow from Denisovans or Denisovan-related populations, rather than ancestral lineage sorting (p value < 10−6 for the long haplotype; p value = 0.027 for the short haplotype). Across worldwide populations, these Denisovan haplotypes are also present at high frequency in East Asians (~11.8%, >99.8 percentile), but they are rare in Europeans (~0.4%) and notably, absent in Oceanians (Data S1, Table S7.2). The MHC contains many genes associated with immune function and infectious diseases which are likely subject to balancing selection. Indeed, previous studies have identified BTNL2 as a candidate for balancing selection in East Asians10. Though simulations show that genetic drift generated by strong founder events alone can rapidly shift frequencies of archaic ancestry (even in the absence of selection), thus regions of high-frequency archaic ancestry should be interpreted with caution (Data S1, Section 8).
To identify archaic ancestry regions that are enriched in Indians (compared with East Asians and Europeans), we computed the population branch statistic (PBS).68 The PBS measures the increase in frequency in a population (India), since its divergence from the two reference groups. For each population, we measured the frequency of Neanderthal and Denisovan ancestry across genomic windows (see STAR Methods). We identified ~10.7 Mb (or 235 genes) enriched for Neanderthal and ~5.5 Mb (or 84 genes) for Denisovan ancestry in Indians (Data S1, Figure S7.5C; Table S3). The Denisovan-enriched regions contain genes related to innate immune response, including several TRIM genes— TRIM26, TRIM31, TRIM15, TRIM10, and TRIM40—implicated in cellular processes related to viral entry (or exit) into host cells.69 One of the most significant Neanderthal-enriched regions includes a gene cluster on chromosome 3, previously linked to increased risk of respiratory failure from SARS-CoV-2 infection16,70 (PBSNeanderthal > 0.118, in the 99.99th percentile of genome-wide PBS scores). There are two main haplotypes of Neanderthal origin in this region: a core haplotype of 49.4 kb and a long haplotype of 333.8 kb (Data S1, Figures S7.6A and S7.7). Across India, the frequency of core haplotype ranges between 20.5% (in Northeast) to 34.8% (in East India). The frequency of both the core and long haplotypes is significantly higher in the East of India compared with other regions (core, 34.8%, Z = 2.68; long, 23.2%, Z = 2.34, Data S1, Table S7.3; Data S1, Figure S7.6B). We also find 39 very long haplotypes that are >1 Mb. We also examined regions of the genome devoid of archaic ancestry in modern humans, referred to as “archaic deserts.”50,54,71,72 We identified six Neanderthal deserts spanning a total of 87.1 Mb, including five that were previously reported (Figures 4C; Data S1, Figure S7.8 and Table S7.4).71 We also identified 13 Denisovan deserts in Indians, including two that overlap with previously reported Neanderthal deserts (Figures 3C; Data S1, Figure S7.9 and Table S7.5). Considering the low genome-wide proportion of Denisovan ancestry in Indians, additional sampling will likely uncover Denisovan ancestry in some of these deserts. Previous studies have shown a stronger depletion of archaic ancestry near genes, particularly in functionally important regions of the genome, suggesting that introgressed variants were deleterious and removed by purifying selection.54,73 We thus examined the frequency of archaic ancestry across 18,451 protein-coding genes, discovering that Neanderthal ancestry varies between 0% and 36.7% and Denisovan ancestry between 0% and 21.9% among genes. Many genes (~35%) contain no archaic ancestry segments (Data S1, Figure S7.10), within the limits of resolution of our method. Among these genes is the apolipoprotein E (APOE) gene, which harbors the APOE ε4 variant, a major risk factor for late-onset Alzheimer’s disease (AD). However, with our current sample size, these results are not significant (one-sided empirical p value = 0.16) and should be replicated in larger cohorts (Data S1, Figure S7.10).
DISCUSSION
India is a region of extraordinary genetic diversity. We have generated the most comprehensive survey of genetic variation in India, including individuals from most geographic regions, speakers of all major language families, and tribal and historically underrepresented communities. We show that most of the genetic variation in Indians stems from a single major migration out of Africa that occurred around 50,000 years ago, followed by gene flow from Neanderthals and Denisovans. Most Indians today harbor ancestry from three main sources related to AHG, ancient Iranian-farmer and Steppe-pastoralist groups. By examining data from 14 ancient Iranian groups from the Neolithic to the Iron Age, we uncovered a common source of Iranian-farmer ancestry related to 4th-millennium BCE farmers and herders from Tajikistan (Sarazm_EN, ~3,600–3,500 BCE) into the ancestors of ASI, ANI, Austroasiatic-related, and East Asian-related groups in India. Archaeological studies have also documented trade connections between Sarazm and South Asia, including connections with agricultural sites of Mehrgarh and the early Indus Valley civilization.74 Indeed, one of the two Sarazm_EN individuals was discovered with shell bangles that are identical to ones found at sites in Pakistan and India such as Shahi-Tump, Makran, and Surkotada, Gujarat.75 Following these admixtures, India experienced a major demographic shift toward endogamy, resulting in extensive homozygosity and IBD sharing among individuals.
The high level of relatedness among Indians is notable: with a sample size of ~2,700, we find that each individual has at least one putative 4th-degree cousin or closer relative (with an amount of IBD shared > 10 cM) in our study.76 Moreover, Indians have elevated levels of homozygosity (on average across regions ranges from ~12 to 56 cM per individual), two to nine-times higher than East Asians and Europeans. These findings underscore the extensive familial connections among Indians, reflecting historical, cultural, or social patterns such as endogamy. Homozygosity increases the prevalence of deleterious variants and risk of recessive diseases. We generated a catalog of missense and putative loss-of-function variants, discovering over 160,000 variants (not registered in previous genomic surveys). Many of these variants are annotated in ClinVar and are associated with congenital and blood disorders, metabolic diseases and drug response, and complex conditions such as cognitive decline and dementia, as reported in our recent publication.77 Notably, these variants are not seen outside India and are present at low frequency across India, but fairly common in some communities as exemplified by the distribution of the pathogenic missense variant (L307P) leading to BCHE deficiency. The identification and mapping of such variants has immense potential to advance our understanding of disease etiologies and reduce disease burden, as previously shown in studies of Ashkenazi Jews and Finns.39,58
At deep timescales, Indians harbor 1%–2% of ancestry from archaic hominins, Neanderthals and Denisovans. Indians exhibit the largest variation in archaic ancestry among modern humans. Notably, the majority of Neanderthal ancestry that exists today in present-day individuals is found in India, while other worldwide populations retain only a subset of this variation. We find that variants introgressed from Neanderthals and Denisovans contribute to adaptation and disease. Several archaic variants are enriched in genes and pathways that play a role in immunity, including the Denisovan-inherited haplotypes enriched at the MHC complex, TRIM family genes associated with innate immune responses, and a Neanderthal-inherited gene cluster on chromosome 3 that harbors a major risk factor for severe symptoms after SARS-CoV-2 infection. Leveraging this knowledge could help develop novel therapeutics tailored to Indian populations, particularly for diseases with immune related components, such as autoimmune disorders and infectious diseases. We identified six Neanderthal and 13 Denisovan ancestry deserts, including four regions that lack both Neanderthal and Denisovan ancestry. Interestingly, one of these regions includes the FOXP2 gene that is associated with language development in humans.71 Functional studies of these deserts could uncover previously uncharacterized genetic variants contributing to modern human-specific traits and diseases.
In sum, these findings provide a comprehensive view of the genetic landscape of India, highlighting the deep evolutionary history, demographic shifts, and the impact of archaic and recent gene flow events on genetic variation in the subcontinent. The unique genetic structure of Indians underscores the importance of incorporating ancestry and homozygosity in future medical and functional genomics research.
Limitations of the study
In this study, we compared whole-genome sequences from ~2,700 individuals from India to other worldwide populations, including ancient and contemporary individuals. Using these data, we characterized the ancestry of diverse individuals and investigated the impact of demographic history on genetic variation and disease in India. Our results rely on the availability of reference individuals from diverse populations and timescales. For instance, we identified that the closest proxy (among the sampled individuals) for the Iranian farmer-related ancestry in India is 4th millennium BCE farmers and herders from Central Asian cultures. However, we have very sparse genetic data from this region and timescale, and as more ancient DNA data becomes available from India and Central Asia, it will be important to revisit these analyses to confirm the source of the Iranian-farmer ancestry in India. Moreover, identifying genomic regions of hunter-gatherer, Iranian farmers, and Steppe pastoralist-related ancestries in modern Indians using local ancestry inference methods will reveal the source and dynamics of environmental adaptations and disease susceptibility on the subcontinent.
RESOURCE AVAILABILITY
Lead contact
Requests for further information should be directed to the lead contact, Priya Moorjani (moorjani@berkeley.edu).
Materials availability
This study did not generate any new reagents.
Data and code availability
Whole-genome sequences are available through the National Institute on Aging Genetics of Alzheimer’s Disease Data Storage Site (NIAGADS) under the accession NIAGADS: ng00067. Individual-level summary statistics are also available through NIAGADS under the accession NIAGADS: ng00171. These data can be obtained through the GCAD at the University of Pennsylvania by following the data request instructions: https://dss.niagads.org/documentation/data-application-and-submission/application-instructions/. This study also uses previously published genomic datasets and publicly available reference databases. Whole-genome sequence data were obtained from the 1000 Genomes Project and the Human Genome Diversity Project (HGDP) via the International Genome Sample Resource (IGSR): https://www.internationalgenome.org/data. Additional genome sequence data from the GenomeAsia Pilot Project were accessed through the European GenomePhenome Archive (EGA), accession number EGA: EGAS00001002921. Ancient DNA data were retrieved from the Allen Ancient DNA Resource (version 54): https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/FFIDCW. Functional annotation and variant interpretation utilized the RefSeq database (https://www.ncbi.nlm.nih.gov/refseq/) and the ClinVar database (https://www.ncbi.nlm.nih.gov/clinvar/), based on the data release of December 17, 2023. Scripts and pipelines to reproduce the analysis are available through GitHub: https://github.com/MoorjaniLab/LASI-DAD.
STAR★METHODS
EXPERIMENTAL MODEL AND STUDY PARTICIPANT DETAILS
Samples
The Longitudinal Aging Study in India (LASI, https://lasi-india.org) is a nationally representative survey of aging including more than 70,000 individuals that are older than 45 years living in India.18 LASI-DAD includes a random sample of 4,096 participants that are 60 years and older from the LASI study, with an oversampling of respondents who might be at high or low risk of cognitive impairment. The current rate of dementia in the LASI-DAD study is ~7%, which is comparable to other population-based epidemiological estimates in India.78 Among these, 2,762 individuals consented for detailed cognitive assessment, informational interviews and genomics analysis including whole genome sequencing. Participants were recruited from 18 states and union territories in India, with sample sizes approximately proportional to the census size of those regions. Sample selection was performed in three stages79: first, sub-districts (Tehsils/Taluks) were stratified and sampled as primary sampling units in each state and union territory. Next, villages were sampled from rural areas, and wards were sampled from urban areas. From each selected urban ward, one Census Enumeration Block (CEB) was randomly selected. These sampled villages and CEBs are referred to as secondary sampling units (SSUs). Within each SSU, individuals were selected randomly.
In compliance with local regulations, the research plan and informed consent form, was presented to the Institute Ethics Committee at AIIMS, New Delhi and the Institutional Review Board (IRB) at the University of Southern California (USC), Los Angeles, CA. After obtaining ethical approval from both institutions, the research plan and other documents were submitted to the Health Ministry Screening Committee (HMSC) governed by the Department of Health Research/Indian Council of Medical Research (Approval number: 54/3/GER/Indo-Foreign/17-NCD-II). These committees reviewed the data collection and analysis plans including (1) to perform whole genome sequencing (WGS) and population genetic analyses of ~2,700 participants from 18 regions of India surveyed in the LASI-DAD, (2) to evaluate the association between known Alzheimer’s Diseases (AD) and dementia gene/genetic variants and cognitive function in sequenced LASI-DAD participants, and (3) to disseminate the WGS data as a reference panel for South Asian variants and mutations. De-identified and anonymized genomic data was shared with researchers at the University of Pennsylvania, the University of Michigan, Institute for Molecular Medicine Finland, Helsinki, Finland and the University of California, Berkeley. Each institution obtained IRB approval or relied on the USC IRB review through the inter-institutional IRB authorization agreement.
Our dataset includes individuals born in 23 states and union territories and speakers of approximately 26 languages. Nearly equal numbers of males (1,293) and females (1,443) were recruited (26 participants were excluded due to lack of sex or other phenotype information). Sex was used for performing quality control of whole genome sequences. For most analyses, we used autosomal data and we did not treat males and females differently, as sex should not impact inferences with autosomal data. We provide detailed description of the sample collection, informed consent, field work and sample size by group (i.e., location of birth, sex, language, social group etc.) in Data S1, Section 1.
METHOD DETAILS
Sequencing, variant calling and filtering
Whole genome sequencing at high coverage (~30x) was performed for 2,762 LASI-DAD individuals, including 22 trios, using Illumina HiSeq X Ten machines at Medgenome, Bangalore, India. All samples were processed using a PCR-free library preparation using 100 base pair paired-end sequencing. The raw sequence reads (fastq) from Medgenome were sent to the Genome Center for Alzheimer’s Disease (GCAD) at the University of Pennsylvania for genome mapping to the human reference genome (build GRCh38/hg38). We used Variant Calling Pipeline and data management tool (VCPA) developed by GCAD in collaboration with Alzheimer’s Disease Sequencing Project (ADSP) to call variants in a uniform way across other studies that are part of ADSP.80 The pipeline uses genome mapping and variant calling pipeline as recommended by Genome Analysis Took lit (GATK) best practices. Details of the data processing are described in Data S1, Section 2. Briefly, we excluded 61 samples from the analysis based on several quality control criteria, including (i) 2 samples were excluded due to low concordance between genotypes generated on the Illumina Infinium Global Screening Array-24 v2.0 BeadChip81 and whole genome sequences, (ii) 28 samples were removed due to discrepancies between self-reported and genetically inferred sex, (iii) 3 samples were excluded due to potential contamination, (iv) 5 unexpected duplicates were excluded, (v) 17 samples were removed due to low coverage (with genome-wide mean coverage below 20x) or other sequencing quality issues (e.g. Mendelian inconsistencies), and (iv) 6 individuals with missing phenotypic data (e.g., sampling information) were removed. We also excluded 22 offspring from trios from most analyses, unless specified otherwise. Overall, a total of 2,679 LASI-DAD samples passed sequencing metrics and quality control checks. Details of quality checks are described in Data S1, Section 2.
Identification of first-degree relative pairs
We applied KING (v2.3.0)82 with the “–ibdseg” option to identify first degree relatives. Following software guidelines, we applied the following filters: sample pairs without any long IBD segments (>10Mb) were excluded and short IBD segments (< 3Mb) were not utilized to estimate the proportion of IBD sharing between two individuals. Parent-offspring pairs share 50% of their genomes and siblings may share between 38–65% of their genome inherited IBD.83 Thus, we use a minimum cutoff of 38% to identify first-degree relatives and consequently, we flagged 64 pairs of individuals. To avoid overestimating the genetic relatedness and distorting the representation of genetic diversity across populations in clustering analysis, for each pair of first-degree relatives, we removed the individual with the larger amount of missing data. In total, we removed 59 individuals, leaving 2,620 individuals that were used for founder event and population structure analyses.
Phasing
We used SHAPEIT420 (without reference data) with the HapMap recombination map84 to phase the genomes from LASI-DAD. By comparing trio-phasing and computational phasing in 22 trios (mother, father, offspring), we evaluated the phasing accuracy. We compared the average phasing switch error and number of variants that could be phased with (a) using no reference panel,(b) using HGDP reference panel, and (c) using 1000G reference panel. Detailed description of this analysis is provided in the Data S1, Section 3.
Published genomic datasets
To learn about the population history of India and compare it to worldwide populations, we combined the LASI-DAD dataset with other published genomic datasets of present-day individuals including 1000G23, HGDP6 and GenomeAsia7 and ancient DNA samples from the Allen Ancient DNA Resource (AADR) v54.85 GenomeAsia and AADR are available in hg19/GRCh37 and we updated the genome positions to hg38/GRCH38 using liftOver (https://liftover.broadinstitute.org/). Then, we merged the datasets using mergeit (with ‘strandcheck: YES’) from the EIGENSOFT package (v7.2.1)86,87 that generates an intersection of the SNPs in the different datasets, keeping only variants present in all datasets. The number of individuals and variants for each merged dataset and the analyses they are used in are reported in Data S1, Table S4.1.
QUANTIFICATION AND STATISTICAL ANALYSIS
Principal component analysis and ADMIXTURE
To perform Principal component analysis (PCA) and ADMIXTURE, we excluded SNPs in linkage disequilibrium (LD) using PLINK (v2.00a2.3)88 with the option ‘–indep-pairwise 50 10 0.5’ that removes, one variant in each pair of SNPs in a window of 50 SNPs, if the LD is greater than 0.5. We further excluded variants with a MAF<0.05. We performed PCA using smartpca from the EIGENSOFT package (v7.2.1).86,87 We also applied unsupervised hierarchical clustering of individuals using the maximum likelihood method implemented in the ADMIXTURE software (v1.3.0).25 Following program documentation, we varied the number of clusters (K) between 2–6 and performed cross validation ten times (option: –cv=10). We stopped the algorithm when the change in log-likelihood between iterations was less than 0.1 (option: -C 0.1).
qpAdm
We used the qpAdm35,36 package in ADMIXTOOLS (v7.0.2) to identify the best fitting model and estimate ancestry proportions in a population of interest that is modeled as a mixture of n ‘reference’ populations using a set of ‘Outgroup’ populations. The outgroup populations for the Indian cline model were: Ethiopia_4500BP.SG, Western European hunter-gatherers (WEHG), Eastern European hunter-gatherers (EEHG), Iran_GanjDareh_N, Anatolia_N, Western Siberia hunter-gatherers (WSHG), East Siberian hunter-gatherers (ESHG) and Dai.DG; and for beyond the cline models: Ethiopia 4500BP.SG, WEHG, EEHG, Iran GanjDareh N, Anatolia N, ESHG, Dai. DG, Russia Samara EBA Yamnaya and Vietnam_BA (for population details see Box S1 in Data S1, Section 4). To explore the closest proxy to the Iran farmer-related ancestry in South Asia, we used the following outgroups: Ethiopia_4500BP.SG, WEHG, EEHG, Anatolia_N, ESHG, Dai.DG, Iran_GanjDareh_N and Russia_Samara_EBA_Yamnaya; except when Iran_GanjDareh_N or Anatolia_N were in the reference populations (Data S1, Table S4.3). We set the parameters as ‘allsnps: NO’ and ‘details: YES’, which reports a normally distributed Z score for the fitted model. We computed coefficient estimations, standard deviations and p-values through block jackknife resampling. We considered a model to be a good fit if p-value > 0.01 and all coefficients are positive.
ALDER
To infer the date of East Asian admixture and ancestry proportion in Bengalis (East of India), we used ALDER (v1.04).31 We used the ‘one-reference’ model (runmode: 1) with East Asians (CHB.DG from AADR v54) as the reference population with the following parameters: binsize: 0.001 Morgans; maximum distance: 1.0 Morgans; zdipcorrmode: YES; jackknife: YES. To convert the dates of admixture from generations to years, we assume the mean human generation time was 28 years.89
IBD and HBD sharing
We identified IBD and HBD segments using hap-IBD40 with the following parameters: min-seed: 0.5; max-gap: 1000; min-extend: 0.5; min-output: 1.0; min-markers: 100; min-mac: 2; nthreads: 1. We used the HapMap genetic maps. To minimize false positives, we only considered shared segments with length greater at 2cM. Then, we filtered out segments that overlapped centromeres (using the GRCh38/hg38 annotation from genome.ucsc.edu/cgi-bin/hgTables). To differentiate between the effects of founder events and recent consanguineous marriages, we stratified the HBD segments by length: long (> 8cM) indicative of consanguinity and short (< 8cM) mostly reflecting founder events. The threshold of 8 cM is motivated by the insight that the expected number of HBD segments > 8cM unexpected in the offspring of fifth-degree cousins or more is zero.41 We also use 20 cM as the threshold to define long HBD segments and obtain qualitatively similar results (Data S1, Figures S5.1 and S5.2B).To infer the putative degree of relatedness between two individuals, we computed the total IBD sharing for kth degree cousins using 2G(1/2)2(k+1), where G = 6,782cM is the total diploid autosomal genome size90 and k represents the degree of cousin relationship.91 We note, however, the expected values assume a random mating population and a history of founder events could lead to increased genomic sharing and thus these values should be interpreted with caution.
Loss of function/missense variants
To quantify the mutational burden in India, we used the Variant Effect Predictor (VEP; version 105)92 and LOFTEE (v1.0.3)19 to identify missense and predicted loss-of-function (pLoF) single nucleotide variants (SNVs). VEP annotates each SNV according to its functional effect on gene transcripts. We used GENCODE93 as the transcript annotation reference and focused our analysis on the most severe functional effect per SNV across different transcripts. Besides the functional annotations directly obtained from VEP, we identified pLoF SNVs by coupling VEP with LOFTEE.19 LOFTEE further assesses stop-gained, splice-site-disrupting, or frameshift SNVs identified by VEP and implements a set of filters to infer if a SNV should be considered a pLoF. We intersect the list of pLoF/ missense variants with the RefSeq database56 and the ClinVar database59 (data release of 2023-12-17) to infer the nearest gene and any disease associations respectively. We consider ClinVar status for variants with a review of at least two stars. Information for each of the pLoF/missense variants is available in Table S1.
Inference of archaic ancestry
To learn about the genomic landscape and regional variation in archaic ancestry in Indians and compare it to worldwide populations, we applied hmmix94 to 2,679 phased individuals from India (we retain first-degree relatives (except offspring of trios) as they may have archaic ancestry in different positions). This method uses an outgroup who have negligible amount of archaic ancestry. We used 426 individuals from the 1000G23 including Yoruba in Ibadan, Nigeria, Mende in Sierra Leone (YRI), Esan in Nigeria (ESN) and 64 Africans from HGDP,6 who have less than 1% West Eurasian admixture, including Bantu South Africa, Biaka Pygmy, Mbuti Pygmy, San and Yoruba. We estimated the number of callable sites, the single-nucleotide polymorphism density (as a proxy for per-window mutation rate) and the number of private variants with respect to the outgroup individuals in 1-kb windows across the genome. We obtained regions identified as ‘archaic’ and compared them to the four published high coverage archaic genomes––Altai Neanderthal,46 Chagyrskaya Neanderthal,47 Vindija Neanderthal48 and Altai Denisovan49 to identify the source of the archaic ancestry (see details in Data S1, Section 6). We further compared archaic segments previously published for 27,566 individuals from Iceland50 that were also inferred using hmmix. The datasets and number of individuals per population used for the analysis of archaic ancestry in non-Africans are reported in Data S1, Table S6.1. Per individual and genome-wide archaic frequencies are computed using the inferred introgressed segments with a posterior probability ≥ 0.8, considering only the callable genome.
Enrichment analysis
To detect regions that are enriched for archaic ancestry in Indians, we searched for regions with ‘significantly high’ archaic ancestry (defined as regions with inferred archaic frequency > mean plus 2*standard deviations). For Neanderthals, our threshold for significantly high archaic ancestry translates to regions with frequency >7.32% (average genome-wide ancestry = 1.4% and standard deviation = 2.96%) and for Denisovans, the threshold is >1.52% (average genome-wide ancestry = 0.135% and standard deviation= 0.693%) (Table S6). We then computed the mean archaic frequency for windows of 40,000 bp (or 40 Kb as in 58) with a mean callability of >50% per window (i.e., where at least 50% of the bases are included in the accessible genome (Data S1, Section 6)). In the 2.56 Gb considered in the study, we found a total of 117.28 million bp (Mb) enriched for Neanderthal ancestry (4.6% of the 2.56 Gb studied) and 61.52 Mb enriched for Denisovan ancestry (2.4% of the 2.56 Gb studied).
Population Branch Statistics
To identify archaic ancestry regions that are enriched in Indians (compared to East Asians and Europeans), we computed the population branch statistic (PBS),68 which measures allele frequency increases in a target population relative to two reference populations. Following Witt et al.,67 we used archaic frequencies instead of genotype frequencies. We estimated archaic ancestry frequencies using inferred introgressed segments, for 1000 bp windows in Indians (LASI-DAD), East Asians (EAS), and Europeans (EUR) from 1000 Genomes, excluding regions with < 50% callability.
For each pair of population, we computed the population frequency differentiation () using archaic frequencies as:
with the frequency in population and the mean frequency between the two studied populations.
In order to obtain estimates of the population divergence time in units scaled by population size, we used the transformation as (68):
Finally, we computed PBS scores as:
To assess the significance of the observed PBS score, we retained regions that fall in the 99% tail of the distribution of PBS scores across the genome. We consider Neanderthal and Denisovan ancestry separately and sites with non-zero archaic ancestry in at least one of the three populations (Indians, EUR or EAS) (Data S1, Figure S7.5). Based on this criteria, a significant PBS score reflects regions of high archaic frequency in Indians or regions that are depleted for archaic ancestry in East Asians or Europeans. To focus on the Indian enriched high frequency archaic regions, we applied two additional filters: (a) retain windows of significantly high frequency archaic ancestry in India (which are detected as enriched in archaic ancestry in our previous analysis) and (b) retain windows where the archaic allele frequency in Indians is greater than in EAS and EUR. After filtering, we find the amount of Indian enriched archaic DNA comprises Neanderthal: 10.7 Mb, which represents 9.1% of the total high frequency Neanderthal regions, and Denisovan: 5.5 Mb, which represents 8.9% of the total high frequency Denisovan regions.
Gene Ontology enrichment analysis
To assess the functional impact of archaic introgression, we map the genes included in the regions of interest using RefSeq gene annotations. We then conducted pathway enrichment analysis using GeneSCF,95 covering all GO categories: biological process, molecular function, and cellular components; with the option ‘-db=GO_all’ (GO database release date of 2023–01-05). We applied the Benjamini-Hochberg method96 with a 5% FDR to correct for multiple testing.
Timing of Out-of-African migration
We infer the minimum coalescence time for non-African individuals with sub-Saharan African-related individuals from the outgroup (n=490). hmmix classifies the genome into ‘modern human’ and ‘archaic’ states. The emission parameters for the modern human state is informative about the minimum coalescence time between non-African individuals and sub-Saharan African-related individuals. The emission parameter for the modern human state is emission human = μ⋅L⋅Toutgroup, where L is the window size (1 kb), μ is the mutation rate (0.45 × 10−9 per base pair per year55) and Toutgroup is the time it has taken for the mutations to accumulate in the test individual that were not seen in the outgroup, thus reflecting the minimum coalescence time between the test individual and the outgroup. Any systematic difference can impact the emission parameter. Thus, we applied additional filters to compare data across populations and datasets, including (i) we removed individuals with >1% sub-Saharan African-related ancestry to minimize the effect of recent gene flow on the minimum coalescence time estimates. To infer sub-Saharan African-related ancestry in non-Africans, we used ADMIXTURE (v1.3.0)25 in unsupervised-mode (k=2) with data from a merge of HGDP, 1000G and LASI-DAD and focused on the SNPs found in 1240K array85 to minimize the effect of archaic ancestry spill over on the emission parameters for the modern human state, we corrected for the amount of high confidence archaic segments (posterior probability > 0.9), and (iii) we corrected for phasing drop-out rate and the removal of multi-allelic sites in HGDP and LASI-DAD. Detailed description of this analysis is provided in the Data S1, Section 9.
Supplementary Material
Figure S1. Sharing of Neanderthal and Denisova ancestry sequences found in worldwide individuals in LASI-DAD, 1000G and HGDP. A) Amount of Neanderthal sequence found at a posterior probability cutoff at 0.8 (y-axis) that is unique to any region, found in multiple regions (at least two) where one includes India (LASI-DAD dataset) or found in multiple regions (at least two) where India is not included. B) same as A) but for Denisovan sequence. Horizontal lines indicate the total length of the assembled Neanderthal (1,679 Mb) and Denisova (1,080 Mb) genome using LASI-DAD, HGDP and 1000G datasets. Numbers above bars are a percentage of the total assembled Neanderthal or Denisova sequence.
Data S1: Supplementary information: Section 1–9, Description of sample collection, data filtering and quality control, supplementary methods and additional statistical analyses, related to STAR Methods.
Table S1: Summary loss of function (LOF) and missense variants in LASI-DAD, related to Figure 4. (A) For each variant, we report their position, their frequency in LASI-DAD, the number of homozygous genotypes, the genes they are located in and the medical relevance of the genes, if available. (B) We report per gene, the number of LoF and missense variants overlapping the gene.
Table S2: Total amount of reconstructed Neanderthal and Denisovan genome, related to Figure 3. Amount of reconstructed Neanderthal and Denisovan sequence for each population and dataset. We report the number of archaic overlapping clusters of segments and how much sequence they span. We report the numbers for phased and unphased data with posterior probability cutoff at 0.5, 0.8 and 0.9.
Table S3: List of Genes enriched in Neanderthal and Denisovan ancestry (High frequency and PBS), related to Figure 4. We show genes enriched in Neanderthal and Denisovan ancestry in South Asians, based on high frequency (A-B) or PBS analysis (C-D).
Table S4: GO enrichment analysis for archaic ancestry segments in India, related to Figure 4. We show high frequency segments of Neanderthal (A) and Denisovan (B) ancestries and archaic enriched segments in Indians compared to European and East Asians (PBS analysis) for Neanderthal (C) and Denisovan (D) ancestries.
Table S6: Average amount of archaic ancestry in MB using different approaches in LASI-DAD and HGDP/1000G, related to Data S1. (A) LASI-DAD using Approach 1, (B) HGDP/1000G using Approach 1, (C) LASI-DAD using Approach 2, and (D) HGDP/1000G using Approach 2; assuming an accessible genome size of 2 × 2,550,757,832 bp. For phased and unphased data using different posterior probability cutoffs. We assumed that the accessible genome is 2 * 2,550,757,832.
Table S5: Positions of archaic deserts in HGPD, 1000G and LASI-DAD, related to Figure 4. Chromosome positions (in hg38) for Neanderthal ancestry deserts (A) and Denisovan ancestry deserts (B). For all populations except Oceania and the Middle East we combine the individuals from 1000G and HGDP.
SUPPLEMENTAL INFORMATION
Supplemental information can be found online at https://doi.org/10.1016/j.cell.2025.04.027.
KEY RESOURCES TABLE.
ACKNOWLEDGMENTS
We thank the participants of the LASI and LASI-DAD studies for being part of our research project and donating their samples and time (for detailed interviews). We thank Vagheesh Narasimhan, Michael Frachetti, and J. Mark Kenoyer for helpful discussion about the archaeological connections between Sarazm, Iran, and South Asia. We thank Kumarasamy Thangaraj for helpful discussion about founder events and regional distribution of the Vysya community in India. We also thank members of the LASI-DAD advising committee and Moorjani lab for helpful feedback on the analysis and results throughout the project. We thank Yulin Zhang, Elena Zavala, Monty Slatkin, Ben Peter, David Reich, Rasmus Nielsen, Vagheesh Narasimhan, and Shai Carmi for helpful comments on the manuscript. The Longitudinal Aging Study in India-Diagnostic Assessment of Dementia data (https://doi.org/10.25549/5hhx-s820) is sponsored by the National Institute on Aging (grant number R01AG051125, RF1AG055273, and U01AG064948) and is conducted by the University of Southern California.
P.M. and E.K. were supported by NIA U01AG064948 and NIH R35GM142978. L. Skov was supported by NovoNordisk Hallas-Møller Emerging Investigator NNF23OC0081723 and U01AG064948.
Footnotes
DECLARATION OF INTERESTS
The authors declare no competing interests.
REFERENCES
- 1.Mastana SS (2014). Unity in diversity: an overview of the genomic anthropology of India. Ann. Hum. Biol. 41, 287–299. 10.3109/03014460.2014.922615. [DOI] [PubMed] [Google Scholar]
- 2.Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, et al. ; 1000 Genomes Project Consortium (2015). A global reference for human genetic variation. Nature 526, 68–74. 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, Motyer A, Vukcevic D, Delaneau O, O’Connell J, et al. (2018). The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209. 10.1038/s41586-018-0579-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Taliun D, Harris DN, Kessler MD, Carlson J, Szpiech ZA, Torres R, Taliun SAG, Corvelo A, Gogarten SM, Kang HM, et al. (2021). Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299. 10.1038/s41586-021-03205-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Mallick S, Li H, Lipson M, Mathieson I, Gymrek M, Racimo F, Zhao M, Chennagiri N, Nordenfelt S, Tandon A, et al. (2016). The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538, 201–206. 10.1038/nature18964. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Bergström A, McCarthy SA, Hui R, Almarri MA, Ayub Q, Danecek P, Chen Y, Felkel S, Hallast P, Kamm J, et al. (2020). Insights into human genetic variation and population history from 929 diverse genomes. Science 367, eaay5012. 10.1126/science.aay5012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.GenomeAsia100K Consortium (2019). The GenomeAsia 100K Project enables genetic discoveries across Asia. Nature 576, 106–111. 10.1038/s41586-019-1793-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Wall JD, Sathirapongsasuti JF, Gupta R, Rasheed A, Venkatesan R, Belsare S, Menon R, Phalke S, Mittal A, Fang J, et al. (2023). South Asian medical cohorts reveal strong founder effects and high rates of homozygosity. Nat. Commun. 14, 3377. 10.1038/s41467-023-38766-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Rasmussen M, Guo X, Wang Y, Lohmueller KE, Rasmussen S, Albrechtsen A, Skotte L, Lindgreen S, Metspalu M, Jombart T, et al. (2011). An Aboriginal Australian genome reveals separate human dispersals into Asia. Science 334, 94–98. 10.1126/science.1211177. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Pagani L, Lawson DJ, Jagoda E, Mörseburg A, Eriksson A, Mitt M, Clemente F, Hudjashov G, DeGiorgio M, Saag L, et al. (2016). Genomic analyses inform on migration events during the peopling of Eurasia. Nature 538, 238–242. 10.1038/nature19792. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Sohail M, Maier RM, Ganna A, Bloemendal A, Martin AR, Turchin MC, Chiang CW, Hirschhorn J, Daly MJ, Patterson N, et al. (2019). Polygenic adaptation on height is overestimated due to uncorrected stratification in genome-wide association studies. eLife 8, e39702. 10.7554/eLife.39702. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Berg JJ, Harpak A, Sinnott-Armstrong N, Joergensen AM, Mostafavi H, Field Y, Boyle EA, Zhang X, Racimo F, Pritchard JK, et al. (2019). Reduced signal for polygenic adaptation of height in UK Biobank. eLife 8, e39725. 10.7554/eLife.39725. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Pasaniuc B, Zaitlen N, Lettre G, Chen GK, Tandon A, Kao WHL, Ruczinski I, Fornage M, Siscovick DS, Zhu X, et al. (2011). Enhanced statistical tests for GWAS in admixed populations: assessment using African Americans from CARe and a Breast Cancer Consortium. PLoS Genet. 7, e1001371. 10.1371/journal.pgen.1001371. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Atkinson EG, Maihofer AX, Kanai M, Martin AR, Karczewski KJ, Santoro ML, Ulirsch JC, Kamatani Y, Okada Y, Finucane HK, et al. (2021). Tractor uses local ancestry to enable the inclusion of admixed individuals in GWAS and to boost power. Nat. Genet. 53, 195–204. 10.1038/s41588-020-00766-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Simonti CN, Vernot B, Bastarache L, Bottinger E, Carrell DS, Chisholm RL, Crosslin DR, Hebbring SJ, Jarvik GP, Kullo IJ, et al. (2016). The phenotypic legacy of admixture between modern humans and Neandertals. Science 351, 737–741. ence.aad2149. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Zeberg H, and Pääbo S. (2021). A genomic region associated with protection against severe COVID-19 is inherited from Neandertals. Proc. Natl. Acad. Sci. USA 118, e2026309118. 10.1073/pnas.2026309118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Huerta-Sánchez E, Jin X, Asan, Bianba Z, Peter BM, Vinckenbosch N, Liang Y, Yi X, He M, Somel M, et al. (2014). Altitude adaptation in Tibetans caused by introgression of Denisovan-like DNA. Nature 512, 194–197. 10.1038/nature13408. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Lee J, Banerjee J, Khobragade PY, Angrisani M, and Dey AB (2019). LASI-DAD study: a protocol for a prospective cohort study of late-life cognition and dementia in India. BMJ Open 9, e030300. 10.1136/bmjopen-2019-030300. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alföldi J, Wang Q, Collins RL, Laricchia KM, Ganna A, Birnbaum DP, et al. (2020). The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443. 10.1038/s41586-020-2308-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Delaneau O, Zagury J-F, Robinson MR, Marchini JL, and Dermitzakis ET (2019). Accurate, scalable and integrative haplotype estimation. Nat. Commun. 10, 5436. 10.1038/s41467-019-13225-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Singh KS (1994). The Scheduled Tribes. Vol. III (Oxford University Press; ). [Google Scholar]
- 22.Singh KS (1993). The Scheduled Castes. Vol. II (Oxford University Press; ). [Google Scholar]
- 23.Byrska-Bishop M, Evani US, Zhao X, Basile AO, Abel HJ, Regier AA, Corvelo A, Clarke WE, Musunuri R, Nagulapalli K, et al. (2022). High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. Cell 185, 3426–3440.e19. 10.1016/j.cell.2022.08.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Reich D, Price AL, and Patterson N (2008). Principal component analysis of genetic data. Nat. Genet. 40, 491–492. 10.1038/ng0508-491. [DOI] [PubMed] [Google Scholar]
- 25.Alexander DH, Novembre J, and Lange K (2009). Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664. 10.1101/gr.094052.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Reich D, Thangaraj K, Patterson N, Price AL, and Singh L (2009). Reconstructing Indian population history. Nature 461, 489–494. 10.1038/nature08365. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Moorjani P, Thangaraj K, Patterson N, Lipson M, Loh PR, Govindaraj P, Berger B, Reich D, and Singh L (2013). Genetic evidence for recent population mixture in India. Am. J. Hum. Genet. 93, 422–438. 10.1016/j.ajhg.2013.07.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Nakatsuka N, Moorjani P, Rai N, Sarkar B, Tandon A, Patterson N, Bhavani GS, Girisha KM, Mustak MS, Srinivasan S, et al. (2017). The promise of discovering population-specific disease-associated genes in South Asia. Nat. Genet. 49, 1403–1407. 10.1038/ng.3917. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Basu A, Sarkar-Roy N, and Majumder PP (2016). Genomic reconstruction of the history of extant populations of India reveals five distinct ancestral components and a complex structure. Proc. Natl. Acad. Sci. USA 113, 1594–1599. 10.1073/pnas.1513197113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Narasimhan VM, Patterson N, Moorjani P, Rohland N, Bernardos R, Mallick S, Lazaridis I, Nakatsuka N, Olalde I, Lipson M, et al. (2019). The formation of human populations in South and Central Asia. Science 365, eaat7487. 10.1126/science.aat7487. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Loh P-R, Lipson M, Patterson N, Moorjani P, Pickrell JK, Reich D, and Berger B (2013). Inferring Admixture Histories of Human Populations Using Linkage Disequilibrium. Genetics 193, 1233–1254. 10.1534/genetics.112.147330. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Karlsson EK, Harris JB, Tabrizi S, Rahman A, Shlyakhter I, Patterson N, O’Dushlaine C, Schaffner SF, Gupta S, Chowdhury F, et al. (2013). Natural selection in a bangladeshi population from the choleraendemic ganges river delta. Sci. Transl. Med. 5, 192ra86. 10.1126/scitranslmed.3006338. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Fuller DQ (2006). Agricultural origins and frontiers in south Asia: A working synthesis. J. World Prehist. 20, 1–86. 006–9006-8. [Google Scholar]
- 34.dia Partners, LLC).
- 35.Haak W, Lazaridis I, Patterson N, Rohland N, Mallick S, Llamas B, Brandt G, Nordenfelt S, Harney E, Stewardson K, et al. (2015). Massive migration from the steppe was a source for Indo-European languages in Europe. Nature 522, 207–211. 10.1038/nature14317. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Patterson N, Moorjani P, Luo Y, Mallick S, Rohland N, Zhan Y, Genschoreck T, Webster T, and Reich D (2012). Ancient admixture in human history. Genetics 192, 1065–1093. netics.112.145037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.IIT Gandhinagar. (2022). Ornaments in Prehistoric and Early Historic South Asia | Prof JM Kenoyer | ASC IITGN. Youtube. https://www.youtube.com/watch?v=kl7R3Nm-Fbs. [Google Scholar]
- 38.Tournebize R, Chu G, and Moorjani P (2022). Reconstructing the history of founder events using genome-wide patterns of allele sharing across individuals. PLoS Genet. 18, e1010243. 10.1371/journal.pgen.1010243. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Slatkin M (2004). A population-genetic test of founder effects and implications for Ashkenazi Jewish diseases. Am. J. Hum. Genet. 75, 282–293. 10.1086/423146. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Zhou Y, Browning SR, and Browning BL (2020). A Fast and Simple Method for Detecting Identity-by-Descent Segments in Large-Scale Data. Am. J. Hum. Genet. 106, 426–437. 10.1016/j.ajhg.2020.02.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Ringbauer H, Novembre J, and Steinrücken M. (2021). Parental relatedness through time revealed by runs of homozygosity in ancient DNA. Nat. Commun. 12, 5425. 10.1038/s41467-021-25289-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Acharya S, and Sahoo H (2021). Consanguineous Marriages in India: Prevalence and Determinants. J. Health Manag. 23, 631–648. 10.1177/09720634211050458. [DOI] [Google Scholar]
- 43.Nait Saada J, Kalantzis G, Shyr D, Cooper F, Robinson M, Gusev A, and Palamara PF (2020). Identity-by-descent detection across 487,409 British samples reveals fine scale population structure and ultra-rare variant associations. Nat. Commun. 11, 6130. 10.1038/s41467-020-19588-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Gazal S, Sahbatou M, Babron M-C, Génin E, and Leutenegger A-L. (2015). High level of inbreeding in final phase of 1000 Genomes Project. Sci. Rep. 5, 17453. 10.1038/srep17453. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Witt KE, Villanea F, Loughran E, Zhang X, and Huerta-Sanchez E (2022). Apportioning archaic variants among modern populations. Philos. Trans. R. Soc. Lond. B Biol. Sci. 377, 20200411. 10.1098/rstb.2020.0411. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Prüfer K, Racimo F, Patterson N, Jay F, Sankararaman S, Sawyer S, Heinze A, Renaud G, Sudmant PH, de Filippo C, et al. (2014). The complete genome sequence of a Neanderthal from the Altai Mountains. Nature 505, 43–49. 10.1038/nature12886. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Mafessoni F, Grote S, de Filippo C, Slon V, Kolobova KA, Viola B, Markin SV, Chintalapati M, Peyrégne S, Skov L, et al. (2020). A highcoverage Neandertal genome from Chagyrskaya Cave. Proc. Natl. Acad. Sci. USA 117, 15132–15136. 10.1073/pnas.2004944117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Prüfer K, de Filippo C, Grote S, Mafessoni F, Korlević P, Hajdinjak M, Vernot B, Skov L, Hsieh P, Peyrégne S, et al. (2017). A highcoverage Neandertal genome from Vindija Cave in Croatia. Science 358, 655–658. 10.1126/science.aao1887. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Meyer M, Kircher M, Gansauge MT, Li H, Racimo F, Mallick S, Schraiber JG, Jay F, Prüfer K, de Filippo C, et al. (2012). A highcoverage genome sequence from an archaic Denisovan individual. Science 338, 222–226. 10.1126/science.1224344. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Skov L, Coll Macià M, Sveinbjörnsson G, Mafessoni F, Lucotte EA, Einarsdóttir MS, Jonsson H, Halldorsson B, Gudbjartsson DF, Helgason A, et al. (2020). The nature of Neanderthal introgression revealed by 27,566 Icelandic genomes. Nature 582, 78–83. 10.1038/s41586-020-2225-9. [DOI] [PubMed] [Google Scholar]
- 51.Clarkson C, Harris C, Li B, Neudorf CM, Roberts RG, Lane C, Norman K, Pal J, Jones S, Shipton C, et al. (2020). Human occupation of northern India spans the Toba super-eruption ~74,000 years ago. Nat. Commun. 11, 961. 10.1038/s41467-020-14668-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Browning SR, Browning BL, Zhou Y, Tucci S, and Akey JM (2018). Analysis of Human Sequence Data Reveals Two Pulses of Archaic Denisovan Admixture. Cell 173, 53–61.e9. 10.1016/j.cell.2018.02.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Mondal M, Casals F, Xu T, Dall’Olio GM, Pybus M, Netea MG, Comas D, Laayouni H, Li Q, Majumder PP, et al. (2016). Genomic analysis of Andamanese provides insights into ancient human migration into Asia and adaptation. Nat. Genet. 48, 1066–1070. 10.1038/ng.3621. [DOI] [PubMed] [Google Scholar]
- 54.Sankararaman S, Mallick S, Patterson N, and Reich D (2016). The Combined Landscape of Denisovan and Neanderthal Ancestry in Present-Day Humans. Curr. Biol. 26, 1241–1247. 10.1016/j.cub.2016.03.037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Jónsson H, Sulem P, Kehr B, Kristmundsdottir S, Zink F, Hjartarson E, Hardarson MT, Hjorleifsson KE, Eggertsson HP, Gudjonsson SA, et al. (2017). Parental influence on human germline de novo mutations in 1,548 trios from Iceland. Nature 549, 519–522. 10.1038/nature24018. [DOI] [PubMed] [Google Scholar]
- 56.O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, Rajput B, Robbertse B, Smith-White B, Ako-Adjei D, et al. (2016). Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44, D733–D745. 10.1093/nar/gkv1189. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.MacArthur DG, Balasubramanian S, Frankish A, Huang N, Morris J, Walter K, Jostins L, Habegger L, Pickrell JK, Montgomery SB, et al. (2012). A systematic survey of loss-of-function variants in human proteincoding genes. Science 335, 823–828. 10.1126/science.1215040. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Lim ET, Würtz P, Havulinna AS, Palta P, Tukiainen T, Rehnström K, Esko T, Mägi R, Inouye M, Lappalainen T, et al. (2014). Distribution and medical impact of loss-of-function variants in the Finnish founder population. PLoS Genet. 10, e1004494. 10.1371/journal.pgen.1004494. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Landrum MJ, Lee JM, Benson M, Brown GR, Chao C, Chitipiralla S, Gu B, Hart J, Hoffman D, Jang W, et al. (2018). ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 46, D1062–D1067. 10.1093/nar/gkx1153. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Thein SL (2013). The molecular basis of β-thalassemia. Cold Spring Harb. Perspect. Med. 3, a011700. 10.1101/cshperspect.a011700. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Kelsell DP, Dunlop J, Stevens HP, Lench NJ, Liang JN, Parry G, Mueller RF, and Leigh IM. (1997). Connexin 26 mutations in hereditary non-syndromic sensorineural deafness. Nature 387, 80–83. 10.1038/387080a0. [DOI] [PubMed] [Google Scholar]
- 62.Riordan JR, Rommens JM, Kerem B, Alon N, Rozmahel R, Grzelczak Z, Zielenski J, Lok S, Plavsic N, Chou JL, et al. (1989). Identification of the cystic fibrosis gene: cloning and characterization of complementary DNA. Science 245, 1066–1073. 10.1126/science.2475911. [DOI] [PubMed] [Google Scholar]
- 63.bolic and Molecular Bases of Inherited Disease, Valle DL, et al. , eds. (McGraw-Hill Education; ). [Google Scholar]
- 64.Manoharan I, Wieseler S, Layer PG, Lockridge O, and Boopathy R (2006). Naturally occurring mutation Leu307Pro of human butyrylcholinesterase in the Vysya community of India. Pharmacogenet. Genomics 16, 461–468. 10.1097/01.fpc.0000197464.37211.77. [DOI] [PubMed] [Google Scholar]
- 65.Pandey B, and Seto KC (2015). Urbanization and agricultural land loss in India: comparing satellite estimates with census data. J. Environ. Manage. 148, 53–66. 10.1016/j.jenvman.2014.05.014. [DOI] [PubMed] [Google Scholar]
- 66.Racimo F, Marnetto D, and Huerta-Sánchez E. (2017). Signatures of Archaic Adaptive Introgression in Present-Day Human Populations. Mol. Biol. Evol. 34, 296–317. 10.1093/molbev/msw216. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Witt KE, Funk A, Añorve-Garibay V, Fang LL, and Huerta-Sánchez E. (2023). The Impact of Modern Admixture on Archaic Human Ancestry in Human Populations. Genome Biol. Evol. 15, evad066. 10.1093/gbe/evad066. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Yi X, Liang Y, Huerta-Sanchez E, Jin X, Cuo ZXP, Pool JE, Xu X, Jiang H, Vinckenbosch N, Korneliussen TS, et al. (2010). Sequencing of 50 human exomes reveals adaptation to high altitude. Science 329, 75–78. 10.1126/science.1190371. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Ozato K, Shin DM, Chang TH, and Morse HC 3rd. (2008). TRIM family proteins and their emerging roles in innate immunity. Nat. Rev. Immunol. 8, 849–860. 10.1038/nri2413. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.tive The COVID-19 Host Genetics Initiative, a global initiative to elucidate CoV-2 virus pandemic. Eur. J. Hum. Genet. 28, 715–718. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Vernot B, Tucci S, Kelso J, Schraiber JG, Wolf AB, Gittelman RM, Dannemann M, Grote S, McCoy RC, Norton H, et al. (2016). Excavating Neandertal and Denisovan DNA from the genomes of Melanesian individuals. Science 352, 235–239. 10.1126/science.aad9416. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Chen L, Wolf AB, Fu W, Li L, and Akey JM (2020). Identifying and Interpreting Apparent Neanderthal Ancestry in African Individuals. Cell 180, 677–687.e16. 10.1016/j.cell.2020.01.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Juric I, Aeschbacher S, and Coop G (2016). The Strength of Selection against Neanderthal Introgression. PLoS Genet. 12, e1006340. 10.1371/journal.pgen.1006340. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Spengler RN, and Willcox G (2013). Archaeobotanical results from Sarazm, Tajikistan, an Early Bronze Age Settlement on the edge: Agriculture and exchange. Environ. Archaeol. 18, 211–221. 10.1179/1749631413Y.0000000008. [DOI] [Google Scholar]
- 75.Frachetti MD (2012). Multiregional Emergence of Mobile Pastoralism and Nonuniform Institutional Complexity across Eurasia. Curr. Anthropol. 53, 2–38. 10.1086/663692. [DOI] [Google Scholar]
- 76.Machha P, Gopalan A, Elangovan Y, Veeravalli SCM, Sowpati DT, and Thangaraj K (2025). Endogamy and high prevalence of deleterious mutations in India: evidence from strong founder events. J. Genet. Genomics 52, 570–582. 10.1101/2024.08.21.24312342. [DOI] [PubMed] [Google Scholar]
- 77.Abu-Amara H, Zhao W, Li Z, Leung YY, Schellenberg GD, Wang LS, Moorjani P, Dey AB, Dey S, Zhou X, et al. (2024). Region-based analysis with functional annotation identifies genes associated with cognitive function in South Asians from India. Preprint at medRxiv. 10.1101/2024.01.18.24301482. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Lee J, Meijer E, Langa KM, Ganguli M, Varghese M, Banerjee J, Khobragade P, Angrisani M, Kurup R, Chakrabarti SS, et al. (2023). Prevalence of dementia in India: National and state estimates from a nationwide study. Alzheimers Dement. 19, 2898–2912. 10.1002/alz.12928. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Perianayagam A, Bloom DE, Lee J, Sekher TV, Mohanty SK, and Agarwal A (2019). Longitudinal Aging Study in India. In Encyclopedia of Gerontology and Population Aging, Gu D and Dupre ME, eds. (Springer International Publishing; ), pp. 1–5. 69892–2_343–1. [Google Scholar]
- 80.Leung YY, Valladares O, Chou YF, Lin HJ, Kuzma AB, Cantwell L, Qu L, Gangadharan P, Salerno WJ, Schellenberg GD, et al. (2019). VCPA: genomic variant calling pipeline and data management tool for Alzheimer’s Disease Sequencing Project. Bioinformatics 35, 1768–1770. 10.1093/bioinformatics/bty894. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.NIAGADS. (2021). NG00106 LASI-DAD GWAS and imputation. 10.60859/qfhc-rv47. [DOI]
- 82.Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, and Chen WM (2010). Robust relationship inference in genome-wide association studies. Bioinformatics 26, 2867–2873.informatics/btq559. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Visscher PM, Medland SE, Ferreira MAR, Morley KI, Zhu G, Cornes BK, Montgomery GW, and Martin NG (2006). Assumptionfree estimation of heritability from genome-wide identity-by-descent sharing between full siblings. PLoS Genet. 2, e41. 10.1371/journal.pgen.0020041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, Gibbs RA, Belmont JW, Boudreau A, Hardenbol P, et al. ; International; HapMap Consortium (2007). A second generation human haplotype map of over 3.1 million SNPs. Nature 449, 851–861. 10.1038/nature06258. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Mallick S, Micco A, Mah M, Ringbauer H, Lazaridis I, Olalde I, Patterson N, and Reich D (2024). The Allen Ancient DNA Resource (AADR) a curated compendium of ancient human genomes. Sci. Data 11, 182. 10.1038/s41597-024-03031-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, and Reich D (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909. 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]
- 87.Patterson N, Price AL, and Reich D (2006). Population Structure and Eigenanalysis. PLoS Genet. 2, e190. 10.1371/journal.pgen.0020190. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, and Lee JJ (2015). Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience 4, 7. 10.1186/s13742-015-0047-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89.Moorjani P, Sankararaman S, Fu Q, Przeworski M, Patterson N, and Reich D (2016). A genetic method for dating ancient genomes provides a direct estimate of human generation interval in the last 45,000 years. Proc. Natl. Acad. Sci. USA 113, 5652–5657. 10.1073/pnas.1514696113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.Halldorsson BV, Palsson G, Stefansson OA, Jonsson H, Hardarson MT, Eggertsson HP, Gunnarsson B, Oddsson A, Halldorsson GH, Zink F, et al. (2019). Characterizing mutagenic effects of recombination through a sequence-level genetic map. Science 363, eaau1043. 10.1126/science.aau1043. [DOI] [PubMed] [Google Scholar]
- 91.Gusev A, Palamara PF, Aponte G, Zhuang Z, Darvasi A, Gregersen P, and Pe’er I (2012). The architecture of long-range haplotypes shared within and across populations. Mol. Biol. Evol. 29, 473–486. 10.1093/molbev/msr133. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92.McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GRS, Thormann A, Flicek P, and Cunningham F (2016). The Ensembl Variant Effect Predictor. Genome Biol. 17, 122. 10.1186/s13059-016-0974-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 93.Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken BL, Barrell D, Zadissa A, Searle S, et al. (2012). GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774. 10.1101/gr.135350.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 94.Skov L, Hui R, Shchur V, Hobolth A, Scally A, Schierup MH, and Durbin R (2018). Detecting archaic introgression using an unadmixed outgroup. PLoS Genet. 14, e1007641. 10.1371/journal.pgen.1007641. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 95.Subhash S, and Kanduri C (2016). GeneSCF: a real-time based functional enrichment tool with support for multiple organisms. BMC Bioinformatics 17, 365. 10.1186/s12859-016-1250-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 96.Benjamini Y, and Hochberg Y (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Series B Stat. Methodol. 57, 289–300. 10.1111/j.2517-6161.1995.tb02031.x. [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Figure S1. Sharing of Neanderthal and Denisova ancestry sequences found in worldwide individuals in LASI-DAD, 1000G and HGDP. A) Amount of Neanderthal sequence found at a posterior probability cutoff at 0.8 (y-axis) that is unique to any region, found in multiple regions (at least two) where one includes India (LASI-DAD dataset) or found in multiple regions (at least two) where India is not included. B) same as A) but for Denisovan sequence. Horizontal lines indicate the total length of the assembled Neanderthal (1,679 Mb) and Denisova (1,080 Mb) genome using LASI-DAD, HGDP and 1000G datasets. Numbers above bars are a percentage of the total assembled Neanderthal or Denisova sequence.
Data S1: Supplementary information: Section 1–9, Description of sample collection, data filtering and quality control, supplementary methods and additional statistical analyses, related to STAR Methods.
Table S1: Summary loss of function (LOF) and missense variants in LASI-DAD, related to Figure 4. (A) For each variant, we report their position, their frequency in LASI-DAD, the number of homozygous genotypes, the genes they are located in and the medical relevance of the genes, if available. (B) We report per gene, the number of LoF and missense variants overlapping the gene.
Table S2: Total amount of reconstructed Neanderthal and Denisovan genome, related to Figure 3. Amount of reconstructed Neanderthal and Denisovan sequence for each population and dataset. We report the number of archaic overlapping clusters of segments and how much sequence they span. We report the numbers for phased and unphased data with posterior probability cutoff at 0.5, 0.8 and 0.9.
Table S3: List of Genes enriched in Neanderthal and Denisovan ancestry (High frequency and PBS), related to Figure 4. We show genes enriched in Neanderthal and Denisovan ancestry in South Asians, based on high frequency (A-B) or PBS analysis (C-D).
Table S4: GO enrichment analysis for archaic ancestry segments in India, related to Figure 4. We show high frequency segments of Neanderthal (A) and Denisovan (B) ancestries and archaic enriched segments in Indians compared to European and East Asians (PBS analysis) for Neanderthal (C) and Denisovan (D) ancestries.
Table S6: Average amount of archaic ancestry in MB using different approaches in LASI-DAD and HGDP/1000G, related to Data S1. (A) LASI-DAD using Approach 1, (B) HGDP/1000G using Approach 1, (C) LASI-DAD using Approach 2, and (D) HGDP/1000G using Approach 2; assuming an accessible genome size of 2 × 2,550,757,832 bp. For phased and unphased data using different posterior probability cutoffs. We assumed that the accessible genome is 2 * 2,550,757,832.
Table S5: Positions of archaic deserts in HGPD, 1000G and LASI-DAD, related to Figure 4. Chromosome positions (in hg38) for Neanderthal ancestry deserts (A) and Denisovan ancestry deserts (B). For all populations except Oceania and the Middle East we combine the individuals from 1000G and HGDP.
Data Availability Statement
Whole-genome sequences are available through the National Institute on Aging Genetics of Alzheimer’s Disease Data Storage Site (NIAGADS) under the accession NIAGADS: ng00067. Individual-level summary statistics are also available through NIAGADS under the accession NIAGADS: ng00171. These data can be obtained through the GCAD at the University of Pennsylvania by following the data request instructions: https://dss.niagads.org/documentation/data-application-and-submission/application-instructions/. This study also uses previously published genomic datasets and publicly available reference databases. Whole-genome sequence data were obtained from the 1000 Genomes Project and the Human Genome Diversity Project (HGDP) via the International Genome Sample Resource (IGSR): https://www.internationalgenome.org/data. Additional genome sequence data from the GenomeAsia Pilot Project were accessed through the European GenomePhenome Archive (EGA), accession number EGA: EGAS00001002921. Ancient DNA data were retrieved from the Allen Ancient DNA Resource (version 54): https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/FFIDCW. Functional annotation and variant interpretation utilized the RefSeq database (https://www.ncbi.nlm.nih.gov/refseq/) and the ClinVar database (https://www.ncbi.nlm.nih.gov/clinvar/), based on the data release of December 17, 2023. Scripts and pipelines to reproduce the analysis are available through GitHub: https://github.com/MoorjaniLab/LASI-DAD.
