Abstract
Large-scale genomic projects and ancient DNA innovations have ushered in a new paradigm for exploring human evolutionary history. However, the genetic legacy of spatiotemporally diverse ancient Eurasians within Chinese paternal lineages remains unresolved. Here, we report an integrated Y-chromosome genomic database encompassing 15,563 individuals from both modern and ancient Eurasians, including 919 newly reported individuals, to investigate the Chinese paternal genomic diversity. The high-resolution, time-stamped phylogeny reveals multiple diversification events and extensive expansions in the early and middle Neolithic. We identify four major ancient population movements, each associated with technological innovations that have shaped the Chinese paternal landscape. First, the expansion of early East Asians and millet farmers from the Yellow River Basin predominantly carrying O2/D subclades significantly influenced the formation of the Sino-Tibetan people and facilitated the permanent settlement of the Tibetan Plateau. Second, the dispersal of rice farmers from the Yangtze River Valley carrying O1 and certain O2 sublineages reshapes the genetic makeup of southern Han Chinese, as well as the Tai-Kadai, Austronesian, Hmong-Mien, and Austroasiatic people. Third, the Neolithic Siberian Q/C paternal lineages originated and proliferated among hunter-gatherers on the Mongolian Plateau and the Amur River Basin, leaving a significant imprint on the gene pools of northern China. Fourth, the J/G/R paternal lineages derived from western Eurasia, which were initially spread by Yamnaya-related steppe pastoralists, maintain their presence primarily in northwestern China. Overall, our research provides comprehensive genetic evidence elucidating the significant impact of interactions with culturally distinct ancient Eurasians on the patterns of paternal diversity in modern Chinese populations.
Keywords: YanHuang cohort, Y-chromosome phylogeny, evolutionary history, founding lineage
Introduction
Population genomics and human pangenome projects aim to comprehensively document the genetic landscapes of globally diverse populations, elucidate their demographic histories, and uncover the genetic underpinnings of complex traits and diseases (Bergstrom et al. 2020; Byrska-Bishop et al. 2022). East Asia serves as one of the earliest cradles of civilization and the crossroads of the peopling of Oceania, Siberia, and America, whose genetic landscape is poorly characterized in the era of population genomics. China harbors extensive genetic, physical, cultural, and ethnolinguistic diversities, positioning it uniquely for studying the intricate demographic histories of diverse populations, including human divergence, migration, and admixture, and the interplay between genetics and culture (Wang et al. 2021b; Kumar et al. 2022). Numerous studies have sought to bridge the knowledge gap regarding the genetic diversity of Chinese populations by examining their evolutionary histories and the genetics of complex traits and diseases. Recent research utilized genome-wide SNP microarrays to analyze the genomic diversity and population history of various Sino-Tibetan, Mongolic, Tungusic, Turkic, Tai-Kadai, and Hmong-Mien groups (Feng et al. 2017; He et al. 2022; Wang et al. 2022; He et al. 2023b; Sun et al. 2023; Li et al. 2024). Additionally, the rise of whole-genome sequencing studies has expanded, featuring projects, such as the Westlake BioBank for Chinese, the NyuWa genome resource, the China Metabolic Analytics Project, and the 10K Chinese People Genomic Diversity Project (10K_CPGDP; Cao et al. 2020; Zhang et al. 2021a; Cong et al. 2022; Cheng et al. 2023; He et al. 2023c). These efforts enhance our understanding of the genetic diversity, demographic history, and genetic architecture of complex traits and diseases in ethnolinguistically distinct Chinese populations from an autosomal perspective, suggesting a further exploration of their fine-scale genetic structure from both uniparental and population-scale project perspectives.
The nonrecombining portion of the Y-chromosome has become pivotal in studying human evolutionary history across various time scales (Poznik et al. 2016). Recent advancements in sequencing technologies and computational methods for genome assembly, read mapping, variant calling, and benchmarking have significantly improved the generation of complete Y-chromosome sequences, enriching our understanding of Y-chromosome variations (Olson et al. 2023). These developments have facilitated the construction of a robust phylogenetic tree, with branch lengths indicating mutation counts (Poznik et al. 2016; Zhabagin et al. 2022). Over the past two decades, studies on targeted Y-SNPs have traced ancestral lines through paternal lineages, providing crucial phylogenetic data for research on human origins, migrations, and admixture (Su et al. 1999; Zerjal et al. 2003). Resequencing the entire Y-chromosome region using advanced next-generation sequencing and computational techniques has transformed research paradigms. For instance, Wei et al. identified 6,662 high-confidence variants across 36 diverse Y-chromosome sequences, refining existing Y-chromosome phylogenies (Wei et al. 2013). Similarly, Poznik et al. (2016) analyzed 1,244 complete Y-chromosome genomes from the 1000 Genomes Project (1KGP), uncovering over 65,000 variants and identifying recent expansions within specific paternal lineages. Studies on single populations or specific lineages have also been conducted. The O1a-M119 lineage, which is shared among the Sinitic, Tai-Kadai, and Austronesian groups, and key paternal lineages like C2a-F5484 and Q1a1a-M120 have been examined to trace their origins, diffusion, and contributions to the gene pools of Chinese ethnolinguistically diverse groups (Sun et al. 2019; Wu et al. 2020; Sun et al. 2021). However, the availability of large-scale Y-chromosome genomic databases for China remains limited, underscoring the need for more comprehensive databases to explore the paternal genetic landscape and its historical influences on diverse populations.
Recent increases in genomic resources from Chinese populations have highlighted the gap in our understanding of the paternal genetic diversity among ethnic minorities, which lags significantly behind that of Han Chinese and other global populations (Karmin et al. 2022). To address this issue, we launched the 10K_CPGDP by employing anthropologically informed sampling strategies (He et al. 2023c). Additionally, we introduce the YanHuang cohort (YHC) genomic resource that includes new Y-chromosome sequences from ethnolinguistically diverse ethnic minorities and integrates data from the 10K_CPGDP. The YHC aims to provide a high-quality population-specific Y-chromosome database, delineate the fine-scale paternal demographic history of underrepresented groups, construct a high-resolution, time-stamped phylogenetic tree, and develop novel East Asian-specific next-generation sequencing panels covering SNPs, STRs, InDels, and other variants for medical and forensic use. We also developed the “YHSeqY3000”, the highest-resolution Y-specific targeted resequencing panel designed from whole-genome and genome-wide SNP data of Y-chromosomes within the YHC. We genotyped 2,999 panel-related Y-SNPs in 919 males from 57 diverse ethnic minorities who were also genotyped by whole Y-chromosome sequencing. Our efforts culminated in a comprehensive Y-chromosome database encompassing 15,563 individuals from modern and ancient Eurasian backgrounds, allowing us to construct the first fully resolved phylogeny incorporating ancient DNA sequences. This phylogeny helps estimate the coalescence dates of dominant lineages, trace the origins of Chinese paternal lineages, and elucidate the impacts of historical migrations, admixture, and shifts in subsistence strategies on the genetic architecture of these diverse groups.
Results and Discussion
Genetic Diversity of YHC Paternal Lineages Inferred from Y-Chromosome Sequences and the YHSeqY3000 Panel
We performed whole Y-chromosome sequencing on 919 participants from 57 populations of 39 ethnic minorities (Fig. 1a; supplementary table S1, Supplementary Material online), integrated the genetic data of nearly 15,000 modern and ancient Eurasian people (supplementary tables S2 and S3, Supplementary Material online), and developed a high-resolution YHSeqY3000 panel, including Y-SNPs not present in existing phylogenetic databases (ISOGG, Yfull). The predominant paternal lineages identified, namely, C-M130, N-M231, O-M175, and R-M207, demonstrated haplogroup frequencies greater than 5% (supplementary fig. S1 and table S4, Supplementary Material online). Additional sublineages, such as D1-M174 and E1-P147, were also noted among these minorities (supplementary fig. S1 and table S4, Supplementary Material online). For the haplogroup classification, three methods were used, namely, in-house scripts, Y-LineageTracker, and HaploGrouper, to simultaneously infer haplogroups from the YHSeqY3000 panel data. The discrepancies in the classification results highlighted the need for improved accuracy in the haplogroup determination, especially the 40 significant discrepancies involving major subclades like C-M130 and J-M304 based on the Y-LineageTracker classification (supplementary table S4, Supplementary Material online). In contrast, the haplogroup differences obtained based on HaploGrouper were minimal (supplementary table S4, Supplementary Material online). The analysis revealed 564 distinct paternal lineages, with 384 subhaplogroups observed only once (supplementary fig. S2 and table S4, Supplementary Material online). This underpins the necessity for a continuous refinement of Y-chromosome phylogenetic trees to accommodate newly identified Y-SNPs and update the haplogroup classification tool (Chen et al. 2021; Jagadeesan et al. 2021). The upcoming version of the YHC phylogenetic topology aims to address these gaps. Overall, the resolution and coverage of the YHSeqY3000 panel confirmed by the minimal differences in the haplogroup classification compared to ∼10 Mb Y-chromosome sequences establish it as the most refined system to date for high-resolution paternal lineage analysis in Chinese populations (supplementary table S4, Supplementary Material online). This system exceeds the capabilities of previous methods, ensuring a more precise haplogroup classification at a finer scale (Wang et al. 2019; He et al. 2023a).
Genetic Connections and Population Stratification among Modern and Ancient Eurasians
We explored the population differentiation among spatiotemporally diverse Eurasian populations based on the clustering patterns identified via the principal component analysis (PCA), multidimensional scaling analysis (MDS), and other population genetic analyses. The PCA distinctly separated ancient western Eurasians from East Asians, with each group exhibiting unique patterns of dominant paternal lineages and clustering branches on the phylogenetic tree (supplementary fig. S3a to j, Supplementary Material online). Modern population clustering aligns with their geographic and linguistic attributes, showing a clear separation among most Austronesian and Tibeto-Burman groups, while other populations demonstrate a considerable overlap in their clustering positions (supplementary fig. S4, Supplementary Material online). Iron Age (IA) Hanben individuals show close genetic ties with Austronesian groups, and northern Chinese individuals are closely aligned with Sino-Tibetan groups. Notably, there is a marked stratification between northern and southern East Asians, with further substructures among linguistically similar, but geographically distinct groups (supplementary figs. S5 to S7, Supplementary Material online). For instance, IA Hanben populations align closely with modern Han populations from Guangxi and Taiwan, whereas Yellow River Basin farmers form distinct clusters from other Han groups (supplementary fig. S7a, Supplementary Material online). Diverse Tibeto-Burman groups exhibited genetic distinctions between their northern and southern divisions (supplementary fig. S7b, Supplementary Material online). A significant differentiation was also evident among the Transeurasian-speaking groups, with the Koreanic and Japonic groups forming separate clades, the Mongolic and some Tungusic groups clustering together, and the Turkic groups sharing close affinity with certain Tungusic populations (supplementary fig. S5, Supplementary Material online). In South China and Southeast Asia (SEA), fine-scale clustering among the Austroasiatic, Austronesian, Hmong-Mien, and Tai-Kadai groups suggests an extensive gene flow, as evidenced by their overlapping genetic patterns (supplementary fig. S6, Supplementary Material online). Phylogenetic relationships and haplogroup frequency spectra highlighted genetic disparities between northern and southern Han groups and between northern and southern Tibeto-Burman speakers, while the gene flow was apparent between geographically proximate groups, such as between Austronesian and southern Han populations and between Transeurasian and northern Han populations (supplementary fig. S8, Supplementary Material online). This comprehensive analysis elucidates the complex genetic landscape and interactions among Eurasian populations.
We grouped populations by linguistic and ethnic traits to investigate genetic affinities within language- or ethnicity-based metapopulations (supplementary fig. S7c to h, Supplementary Material online). Geographically close populations, including the Austronesian-speaking Saisiyat, Thao, Taroko, Atayal, and Tsou from Taiwan Province, clustered distinctly, separating early from other reference groups (supplementary fig. S7c, Supplementary Material online). Distinct branches primarily comprised Tai-Kadai, nearby Austronesian groups like Ede and Giarai, and southern Tibeto-Burman speakers, such as Sila and Lolo. The genetic closeness between the Austronesian-related and Tai-Kadai-dominant clusters supports the hypothesis of a shared origin for Austronesian and Tai-Kadai speakers, as demonstrated by phylogenetic analyses based on neighbor-joining methods and clustering inferred from the haplogroup frequency spectra, PCA, and MDS (supplementary fig. S7d to f, Supplementary Material online). These analyses also revealed fine-scale genetic differences between Han Chinese and Tibeto-Burman populations and among linguistically diverse groups, underscoring frequent massive population movements and gene flow events in historical contexts. To determine whether paternal lineages corroborate current language family classifications and further explore genetic relationships within linguistically defined metapopulations, we merged all groups based on linguistic affinities for a comprehensive population genetic analysis (supplementary fig. S7g and h, Supplementary Material online). Notably, a close genetic clustering between the Tai-Kadai and Austroasiatic groups and between the Mongolic/Tungusic groups and the Amur River Basin ancient populations was observed. The neighbor-joining tree also indicated close genetic relationships between the Turkic and ancient Xinjiang populations, between the Koreanic and Japonic populations, and between the Austronesian and ancient Hanben populations (supplementary fig. S7h, Supplementary Material online). This study provides robust paternal genetic evidence supporting complex admixture and interactions among modern Chinese populations and ancient Eurasians. However, caution is advised regarding potential biases from low-coverage sampling and the simplistic grouping of linguistically similar, yet geographically disparate populations.
Complex Population Migration and Admixture Events Inferred from the Y-Chromosome Diversity Landscape
The observed paternal genetic structure indicated that multiple complex ancient migration and admixture events significantly shaped the gene pool of Chinese populations. A time-stamped phylogenetic tree revealed multiple lineage diversifications after the last glacial maximum (20 kya), with these lineages dispersing at varying times (Fig. 1b; supplementary fig. S1, Supplementary Material online). Analysis using a maximum likelihood (ML) tree incorporating ancient DNA sequences revealed diverse founding populations contributing to the Chinese paternal gene pool that likely originated from ancient migrations of descendants from indigenous rice or millet farmers, Siberian hunter-gatherers, or western Eurasian steppe pastoralists (Fig. 2; supplementary figs. S9 to S12, Supplementary Material online). The extent to which ancestral sources affected the paternal genetic makeup of Chinese ethnic minorities was systematically investigated, along with the geographical spread of identified lineages and their associations with expansions related to ancient farmers, hunter-gatherers, and pastoralists. Additionally, to determine the origins and distribution patterns of dominant paternal lineages in China, the participants were grouped into geographically defined metapopulations, and general geographical distribution patterns were estimated (Figs. 3 to 5; supplementary figs. S17 and S18, Supplementary Material online). Finally, we systematically assessed how ancient technological innovations and human migration events have influenced the paternal genetic landscape of Chinese populations, revealing a complex interplay of genetic inputs from various ancient populations.
Gene Flow from Ancient Pastoralists and Barley Farmers in West Eurasia and Central/South Asia to East Asia
Prehistoric and historical cultural exchanges along the southern Bactrian Marianna Archaeological Complex oasis farming route, Inner Asian Mountain Corridor, and northern Yamnaya/Afanasievo steppe pastoralist migration routes have significantly shaped the autosomal gene pool of ancient populations in the Altai Mountains and surrounding areas of northwestern and northern East Asia (Zhang et al. 2021b). Haplogroups J/G/R and their major sublineages, which are prevalent among ancient western Eurasians, exhibit the highest frequencies in Northwest China (Figs. 2 and 3a and b; supplementary fig. S9a, Supplementary Material online). Specifically, most J haplogroup carriers in China belong to the J2-M172 sublineage, particularly J2a-M410. The origins of J2a in ancient populations can likely be traced back to the northern Fertile Crescent, and its current distribution primarily reflects expansions and admixture events related to ancient barley farmers (Figs. 2 and 3a). Similarly, individuals carrying G-M201 in Northwest China were predominantly classified under sublineage G2a (Fig. 3b). An optimized hot spot analysis revealed diffusion centers for J2a and G2a in the Xinjiang and Gansu–Qinghai regions, suggesting a correlation with these areas (supplementary fig. S17a, Supplementary Material online). Generally, the introduction of J/G-derived lineages into China is attributed to the eastward migration of barley farmer-related ancestral populations likely facilitated by gene flow events along the ancient Silk Road (Zhabagin et al. 2022; He et al. 2023c).
R-M207 is predominantly found among ancient western Eurasians and modern populations in North China, particularly in Northwest China (Figs. 2 and 3a and b). The basal haplogroup R was identified in a ∼24,000-year-old individual from the Mal’ta site near Lake Baikal in Siberia (Raghavan et al. 2014). In China, approximately 90% of R carriers are categorized as R1-M173, which bifurcates into R1a-L146 and R1b-M343 approximately 23 kya. The frequency of R1a-L146 notably exceeded that of R1b-M343 (Figs. 1b and 3b; supplementary fig. S1, Supplementary Material online). Furthermore, all individuals within R1a were classified into R1a1a sublineages, with R1a1a1b diverging approximately 5 kya and being the most prevalent (Fig. 1b; supplementary fig. S1 and table S5, Supplementary Material online). The spatiotemporal distribution of R1 subclades is closely linked to the movements of ancient steppe pastoralists, underscoring a significant genetic flow into China (Figs. 2 and 3a). Conversely, R2-M479 appears in East Asia at low frequencies (supplementary fig. S17a, Supplementary Material online) and is primarily concentrated in Central/South Asia, having recently extended from South Asia to North China via the northern route. Analysis combining ancient and modern population phylogenies revealed that samples from Mongolia with substantial West Eurasian ancestry, such as Mongolia_EIA_Sagly_4 and Mongolia_LBA_MongunTaiga_3, fall within the R1a1a sublineage. Nearly half of the ancient Xinjiang individuals are categorized within sublineages R1a or R1b, reflecting the historical impact of the Yamnaya/Afanasievo-related pastoralists on the genetic makeup of the northwestern Chinese populations (Figs. 2 and 3a). Additionally, the sporadic presence of other rare haplogroups like H-L901 and I-M170 in China suggests a broad and recent gene flow from Central/South Asian and West Eurasian ancestors into the region.
To confirm that migrations related to pastoralist populations have reshaped the distribution of western Eurasian-related lineages in Chinese populations, we estimated the correlation between haplogroup frequencies and both geographical (longitude and latitude) and genetic features (PC1-2, haplogroup frequency, Fst matrix, and autosomal-based admixture proportions). The frequencies of R-related sublineages correlate with latitude and exhibit high frequency in modern northwestern Chinese populations (supplementary fig. S14a and b, Supplementary Material online). Furthermore, the distribution patterns of R and its sublineages were significantly correlated (supplementary fig. S14c, Supplementary Material online). To elucidate the direct genetic contributions from ancient sources to modern Chinese populations, we constructed a six-source admixture model, revealing a gradual decrease in ancestral proportions from their archeologically confirmed origins or earliest emergence areas in China (Fig. 4b). If ancient migration events directly influence the lineage frequency patterns in Chinese populations, a solid positive correlation would be expected between the proportion of autosomal-based admixture from presumed ancestral sources and the frequency of founding lineages. Intriguingly, a significant correlation was observed between the Afanasievo-related ancestral proportions and the haplogroup frequencies of multiple H, J, and R sublineages (Fig. 5a and g). These findings, derived from the haplogroup frequency spectra of modern and ancient Eurasians, phylogeographic origin inferences, and multiple factor correlations, suggest that migrations of western Eurasian barley and pastoralist-related populations likely facilitated the development of these Chinese founding lineages.
Siberian Hunter-Gatherer-Dominant Paternal Lineages Are Widely Distributed in China
Ancient DNA studies have identified an ancestral component, termed Ancient Northeast Asian (ANA) ancestry, related to Neolithic hunter-gatherers from the Russian Far East, Mongolian Plateau, and Baikal region (Jeong et al. 2020; Fig. 4a and c). This ANA-related ancestry has contributed variably to distinct ancient populations in these regions, which are characterized by high proportions of C/N/Q/R sublineages (Figs. 2 and 3a). The frequencies of the C2/N1/R1 sublineages were significantly positively correlated with the ANA-related ancestry (P < 0.05, Fig. 5b and g). The haplogroup Q-M242 appears in China at very low frequencies (<3%, supplementary table S5, Supplementary Material online) and displays varied distribution patterns between North and South China (Fig. 3c; supplementary table S5, Supplementary Material online). This lineage, which might have originated in Central Asia and southern Siberia approximately 31 kya (Fig. 1b), includes the Q1a1a-M120 subclade. This subclade, unique to East Asians, is relatively prevalent among Han Chinese individuals (∼81% of all Q lineages, supplementary table S5, Supplementary Material online) and likely underwent a local expansion in Northwest China between 5 and 3 kya (Sun et al. 2019). Furthermore, the Q1a1a1-F1626 subclade, a derivative of Q1a1a-M120, diversified approximately 4.3 kya (Fig. 1b). The ML phylogenetic topology indicated that ancient Mongolian individuals with minimal West Eurasian-related ancestry (<20%) belonged to Q1a1a or its sublineages (Figs. 2 and 3a). Venn diagrams illustrating shared ancestry-correlated lineages also show that the Q and R lineages are common among the Yamnaya and ANA-associated lineages (supplementary fig. S15, Supplementary Material online). Moreover, ancient individuals from the middle Neolithic (MN) Yangshao culture and approximately 3,000-year-old Hengbei residents from Shanxi, who carried the Q1a1a-M120 lineage, indicate that this haplogroup influenced the Han Chinese gene pool at least 6 kya. Q1b-M346, although rare in China, is concentrated at the intersection of Siberia and North China (supplementary fig. S17b and table S5, Supplementary Material online), with some Bronze Age (BA) and IA individuals from the Mongolian Plateau and Xinjiang regions genotyped for Q1b or its subclades (Figs. 2 and 3a).
Haplogroup N-M231, particularly its subclade N1-CTS3750, is prevalent among Chinese populations, diverging into N1a-F1206 and N1b-F2930 around 19 kya (Fig. 1b; supplementary fig. S1, Supplementary Material online). Y-chromosome analyses of ancient individuals from the West Liao River Basin, dating from 6,500 to 2,700 BP, indicated that N-M231 was the dominant paternal lineage in Northeast China during the Neolithic, with its frequency gradually declining over time (Cui et al. 2013). The frequencies of the N1a-F1206 and N1b-F2930 subclades were high in North China and low-altitude Southwest China, respectively (supplementary fig. S17b, Supplementary Material online). These findings suggest a north‒south differentiation of N1 subclades in North China, with N1a-F1206 migrating northward beyond East Asia and N1b-F2930 moving southward to become a major paternal lineage among Tibeto-Burman groups, notably the Yi people (supplementary table S5, Supplementary Material online). N1a1-M46/Tat, a dominant subclade, likely originated in Northeast Asia. An individual from the Houtaomuga site, dated 7 kya, carried N1a1a1a1a-M2117, which was genetically linked to early Neolithic (EN) Amur River Basin individuals (Ning et al. 2020a). Further analysis revealed that the IA Xinjiang individuals, BA West Liao River individuals, and several southern Siberian ancient individuals belonged to N1a or its sublineages (Fig. 2), which correlated significantly with ANA-related ancestral components (Fig. 5b). N1a2-F1008/L666 comprised approximately 67% of the N1a sublineages, bifurcating into N1a2a and N1a2b approximately 9.5 kya; N1a2a1, which made up the largest proportion of N1a2a (∼82%), which diversified approximately 4.4 kya; and N1a2b, which diverged approximately 4.0 kya (Fig. 1b; supplementary fig. S1, Supplementary Material online). Ancient DNA data revealed that EN Shamanka individuals from Cis-Baikal and several southern Siberian ancients belonged to N1a2a (Fig. 2). Early diffusion centers for N1a2 were identified in North China and the southeastern part of Northeast China (supplementary fig. S17b, Supplementary Material online). N1b-F2930 is primarily found in Tibeto-Burman-speaking populations in low-altitude Southwest China (∼24%) and less frequently in other Chinese populations (supplementary table S5, Supplementary Material online). Notably, ancient East Asians belonging to sublineage N1b, specifically N1b2 or its derivatives, are mainly distributed on the Tibetan Plateau (Fig. 2). To better understand the phylogeographic origins of N-M231 and its N1a/N1b subclades and the factors influencing their distribution patterns, further collection and whole Y-chromosome sequencing of spatiotemporally distinct ancient and modern Eurasians belonging to N sublineages are essential.
Haplogroup C-M130, one of the primary paternal lineages in East Asia and likely carried by early settlers, diverged approximately 50 kya (Fig. 1b). Its subclade C2-M217, which is particularly widespread in North China, exhibited a notable frequency across multiple regions (Figs. 2 and 3a and c; supplementary fig. S17b, Supplementary Material online). The earliest known individual carrying C2-M217, designated as AR19K, dates from 19,587 to 19,175 cal BP in the Amur River Basin (Fig. 2). Distinct patterns are observed for the C2a-L1373 and C2b-F1067 subclades. C2a-L1373, sometimes referred to as the “northern branch,” shows the highest frequencies in Inner Mongolia, whereas C2b-F1067, the “southern branch,” is most prevalent in Central, North, and Northeast China (Fig. 3c). The C2a subclade, particularly C2a1a, is predominant among Transeurasian-speaking populations, with C2a1a1b1-F1756, C2a1a2a-M86, and C2a1a3a-F3796 identified as major subclades within China (Fig. 2; supplementary table S5, Supplementary Material online). BEAST-based phylogenetic analysis revealed that C2a1a1b1 diversified into C2a1a1b1a and C2a1a1b1b approximately 5.4 kya (Fig. 1b; supplementary fig. S1, Supplementary Material online), which are widely found in the northern Han, Mongolic and Tungusic people (supplementary table S5, Supplementary Material online). Historical dispersal of these subclades is evidenced by their presence in BA West Liao River, IA Amur River Basin, and several BA to Historical Era (HE) individuals from the Mongolian Plateau (Fig. 2), suggesting links to early expansions of Mongolic/Tungusic ancestors. Furthermore, C2a1a2a sublineages are common in Transeurasian groups across East Asia and North Asia (Fig. 2; supplementary table S5, Supplementary Material online). A Mesolithic Amur River Basin individual (AR11K), an ANA-representative Boisman_MN, and two HE Mongolian Plateau individuals carried C2a1a2 or C2a1a2a (Fig. 2), indicating that migration from the Amur River Basin to the Mongolian Plateau contributed to the genetic makeup of the current Transeurasians, particularly Mongolic/Turkic speakers (supplementary table S5, Supplementary Material online). C2a1a3a-F3796, also known as the C2*-Star Cluster, diverged from C2a1a3 approximately 3.7 kya, predating previous estimates (Wei et al. 2018). This may be due to sampling bias and differences in TMRCA estimation methods based on Y-STRs and Y-chromosome sequences. This sublineage is foundational among Mongolic-speaking populations. One Neolithic Amur River Basin individual, one MN Boisman individual, several HE Mongolian Plateau ancients, and IA Xinjiang samples are classified under the C2a1a3 sublineages (Fig. 2). Additionally, C2a sublineages are also identified in central and southern Chinese populations (supplementary table S5, Supplementary Material online), suggesting their southward migration from North Asia, likely driven by the expansion of the Mongol Empire.
The phylogenetic analysis of C2b-F1067 indicated that ancient populations carrying its sublineages significantly enriched the gene pool of modern eastern Eurasians. Observations suggest that Inner Mongolia and Northeast China were likely initial dispersal centers for C2b, exhibiting distinct geographical distribution patterns (supplementary fig. S17b, Supplementary Material online). For instance, C2b1a1-CTS2657 is found at high frequencies in North and Northeast China, while C2b1a2-F3880 predominates in Northeast and North China, as well as in East China, notably in Shandong, Jiangsu, and Shanghai. Conversely, C2b1b-F845 had the highest frequencies in Central China, Southwest China (mainly Guizhou), and SEA (supplementary fig. S17b, Supplementary Material online). The distribution patterns identified for the C2b sublineages, which partly diverge from previous studies (Wu et al. 2020), may result from sampling biases and differences in reference populations. Our analysis confirmed the southern origin of C2b1b-F845 (supplementary fig. S17b, Supplementary Material online) and identified two ancient individuals from Shigatse on the Tibetan Plateau with C2b1 mutations, one late Neolithic (LN) Shimao individual belonging to C2b1a2b1, and only two Neolithic Yellow River Basin farmers and one HE Tibetan Plateau individual associated with C2b1b sublineages (Fig. 2). To comprehensively explore the phylogeographic origin and dispersal of C2b1 sublineages, further analysis of spatiotemporally diverse ancient southern East Asians (ASEA), particularly from low-altitude regions, is needed. Statistically significant negative correlations between pairwise Fst genetic distances and the frequency of western Eurasian/Siberian-related lineages underscore their contribution to the genetic differentiation between northern and southern East Asians (supplementary fig. S14b, Supplementary Material online). Overall, genetic analyses incorporating the haplogroup frequency spectra of modern and ancient East Asians revealed a robust genetic connection between the descendants of Neolithic southern Siberian hunter-gatherers and modern East Asians. The geographical distribution patterns and TMRCA estimates of C2a1a/C2b1a/N1a1/Q1a1a-derived sublineages support the hypothesis that ancient migrations of West Liao River millet farmers have shaped the current distribution patterns in Chinese populations, particularly among Transeurasian speakers. These findings align with earlier findings triangulated from linguistic, archaeological, and genetic evidence (Robbeets et al. 2021).
Traces of the Early Asian and Ancient Northern East Asian Millet Farmer-Related Lineages in China
The ancient genetic connections among Andamanese, Jomon-related indigenous Japanese, and highland Tibetans are evidenced by shared Paleolithic ancestral components and the uniparental D lineage. Analysis of the phylogeographic origins of D subclades revealed that D1-M174, a major paternal haplogroup in East Asians, is prevalent in our YHC (supplementary fig. S16, Supplementary Material online). Haplogroup D1a, which is particularly frequent in the Tibetan Plateau, is predominantly subdivided into D1a1a-M15 and D1a1b-P99, with these divisions occurring approximately 46 kya (Fig. 1b; supplementary fig. S1, Supplementary Material online). D1a1a sublineages are commonly found (>54%) among Tibeto-Burman-speaking populations in Southwest China and are less frequent in other Chinese populations, while D1a1b sublineages are most prevalent on the Tibetan Plateau (>36%, supplementary table S5, Supplementary Material online). D1a1a sublineages are frequently found in the Mongolian and Tibetan plateaus and Yellow River Basin ancients, and D1a1b sublineages are mainly found in the Tibetan Plateau ancients (Fig. 2). The distribution patterns of these sublineages in both modern and ancient East Asians provide direct evidence of their migration paths: D1a1a-M15 likely migrated northward through western Sichuan to the Gansu–Qinghai region and possibly into the Himalayan area along the Tibetan-Yi corridor; D1a1b-P99, particularly its subclade D1a1b1-P47, originated on the Tibetan Plateau. These D1a sublineages are predominantly found in Tibetan populations, supported by genetic contributions from northern Chinese millet farmers via a revised Y-chromosome phylogeny and correlations with O2 sublineages and Lubrak-related Tibetan Plateau ancestry (Fig. 4d; supplementary fig. S14c, Supplementary Material online). Gene flow events and the presence of Lubrak-related D sublineages significantly influenced the genetic diversity patterns. Notably, the frequencies of four lineages (O2a2b1, O2a2b1a, O2a2b1a1, and O2a2b1a1a) strongly correlate with the Lubrak-related ancestry, confirming that Neolithic expansions from the Yellow River Basin contributed to the peopling of the Tibetan Plateau (Fig. 5c). Ancient DNA evidence from autosomal variations and maternal lineages further underscores the substantial impact of Neolithic millet farmers on the permanent settlement of the Tibetan Plateau (Wang et al. 2023).
Archeological evidence indicates that millet-based agriculture independently emerged in the Yellow River Basin and West Liao River at approximately 6,000 BCE, fostering the development of foxtail (Setaria italica)-prevalent Yangshao and broomcorn (Panicum miliaceum)-prevalent Xinglongwa cultures, respectively (Miller et al. 2016; Leipe et al. 2019). Leipe et al. noted that shifts in agricultural practices from approximately 6000 to 2,000 BCE led to a quasi-exponential population growth in North China, aligning with the major dispersal of Sino-Tibetan-speaking populations from the Yellow River Basin during the fourth millennium BCE (Leipe et al. 2019). Ancient DNA analyses of millet farmers from the Yangshao and Longshan cultures suggested that the Sino-Tibetan people originated in North China (Ning et al. 2020b). The Haojiatai-related ancestry dominant in Chinese populations correlated strongly with the O/Q/C/N lineages (Figs. 4e and 5d). O-M175, which is prevalent in East and Southeast Asians, includes the significant O1-F265 and O2-M122 subclades, whose expansions are associated with the spread of millet and rice agriculture from domestication centers in the Yellow River Basin, West Liao River, and Yangtze River Basin (Fig. 5d to f). The influence of Ancient Northern East Asian (ANEA) on modern East Asian paternal genetic diversity requires a further comprehensive assessment. O-related sublineages, with O2 lineages diversifying approximately 29 kya (Fig. 1b), are broadly distributed in North China and the Tibetan Plateau (Figs. 2 and 3a). O2-M122, particularly subclade O2a-M324, is a major paternal lineage in East and Southeast Asians, showing a strong correlation in distribution patterns (Fig. 3d; supplementary figs. S17, and S18c, Supplementary Material online). O2a-M324 is found at high frequencies along China's coast and surrounding areas (>52%), suggesting ancestral migration routes along the coast extending into SEA (supplementary fig. S17b, Supplementary Material online). An ancient individual from the MN West Liao River Hongshan culture identified as belonging to O2a-M324 supports this lineage's association with early cultural developments in Northeast China. Systematic evidence further corroborated that O2a-M324 originated in Northeast China, particularly in Heilongjiang Province, where it remains highly prevalent (supplementary fig. S17b, Supplementary Material online). However, the high frequencies also observed in eastern coastal provinces like Shandong, Shanghai, Fujian, and Guangdong may reflect sampling biases and historical migrations, notably during the Chuangguandong movement. Additionally, the optimized hot spot analysis results suggest that the middle and lower reaches of the Yellow River Basin were early diffusion centers for O2a-M324 (supplementary fig. S17b, Supplementary Material online).
Distinct distribution patterns were observed for the O2a1-L127.1 and O2a2-JST021354/P201 sublineages (supplementary fig. S17, Supplementary Material online). O2a1 is most prevalent in Southeast China, with its frequency decreasing in adjacent regions (supplementary fig. S17c, Supplementary Material online). Most of the O2a1 subclades show similar distribution patterns (supplementary fig. S17c to d, Supplementary Material online). O2a2 has the highest frequency in the Tibetan Plateau, Southeast China, and SEA (supplementary fig. S17f, Supplementary Material online). The sublineage O2a2a-M188 is notably frequent in the SEA, decreasing in frequency from south to north across East Asia (supplementary fig. S17f, Supplementary Material online); O2a2b-P164 is widespread in China, with the highest occurrence on the Tibetan Plateau (supplementary fig. S17h, Supplementary Material online). The majority of O2a1 individuals (∼87%) are O2a1b-JST002611, which is widespread across Chinese populations, particularly among Han populations (supplementary fig. S17d and table S5, Supplementary Material online). However, O2a1b and its sublineages appear infrequently among Tibeto-Burman groups, suggesting a minimal impact on this population. The initial diffusion centers for O2a1b sublineages are identified in the middle and lower reaches of the Yellow River Basin (supplementary fig. S17d, Supplementary Material online). Two main sublineages, O2a1b1a1a1a-F11 and O2a1b1a2a-F238, are found with differing frequencies; O2a1b1a1a1a-F11 is more common, especially in diverse Han populations (supplementary table S5, Supplementary Material online). O2a1b1a1a1a expanded approximately 8.9 kya, and O2a1b1a2a diverged approximately 9.0 kya (Fig. 1b; supplementary fig. S1, Supplementary Material online), which are the times that greatly preceded earlier TMRCA estimates. This discrepancy highlights differences between the Y-STR and Y-SNP-based TMRCA estimations and the influence of the Y-chromosome sequence coverage. O2a1b1a1a, the upstream lineage of O2a1b1a1a1a, appears most frequently in Southeast China and Guizhou in Southwest China, and its initial diffusion center is likely to be the middle and lower portions of the Yellow River Basin (supplementary fig. S17e, Supplementary Material online). O2a1b1a1a1a-F11 was identified in a Banpo site sample (Zhang et al. 2018), linking its emergence to Yangshao millet farmers. Furthermore, historical individuals from Shigatse on the southern Tibetan Plateau and Mongolian Plateau also carried this sublineage (Fig. 2), indicating the significant influence of Neolithic millet farmers. The O2a1b lineage was also detected in the 500-year-old GaoHuaHua (Fig. 2), establishing a connection with Yangshao millet and ASEA rice farmers (Miao et al. 2021). Linguistic evidence points to the initial divergence of Sino-Tibetan languages during the Yangshao period, with their dispersal likely occurring in the upper Yellow River Basin (Zhang et al. 2019). The estimated expansion of O2a1b1a1a1a and the divergence of Sino-Tibetan languages, in addition to paleogenomic evidence, suggest significant genetic contributions from ANEA millet farmers to modern Sino-Tibetan groups in China. Notably, the diffusion center for Sino-Tibetan-related ancestors with O2a1b1a1a1a does not align with the dispersal center of Sino-Tibetan languages, highlighting potential discrepancies due to real differences, sampling bias, or limitations in computational biology algorithms. Thus, further extensive sampling of modern and ancient East Asians is recommended to refine these findings.
High frequencies of most O2a2a subclades are observed in South China and SEA (supplementary fig. S17f and g, Supplementary Material online). Among these sublineages, the O2a2a1a2-M7 sublineages constitute the largest proportion (∼43.8%, supplementary table S5, Supplementary Material online) and are primarily found in the Hmong-Mien people and southern Han Chinese (supplementary table S5, Supplementary Material online). Only one IA Hanben individual from Taiwan Island was identified within O2a2a1a2a2 (Fig. 2). A recent rapid expansion of O2a2a1a2a1a1a2a1a1a1 around 2.9 kya was noted (Fig. 1b). Moreover, the O2a2b sublineages are widely distributed across China (supplementary fig. S17h and table S5, Supplementary Material online). O2a2b1-M134, a major subclade of O2a2b, appears predominantly among Sino-Tibetan speakers (∼85%), with the highest occurrence in the Tibetan Plateau (supplementary fig. S17h and table S5, Supplementary Material online). Two star-like expansions have been linked to O2a2b1a1a1-F8 (∼7.3 kya) and O2a2b1a2a1a-F46 (∼9 kya) (Fig. 1b; supplementary fig. S1, Supplementary Material online). The upstream lineage of O2a2b1a1a1, O2a2b1a1a, is prevalent in Southwest/Southeast China and the Circum-Bohai-Sea region (supplementary fig. S17h, Supplementary Material online). The frequencies of O2a2b1a2 and the upstream lineage of O2a2b1a2a1a are greater in Northeast, North, and East China than in other areas (supplementary fig. S17h, Supplementary Material online). The optimized hot spot analysis suggests that the early diffusion center for O2a2b1a2 is likely the Circum-Bohai-Sea region (supplementary fig. S17h, Supplementary Material online). Several LN to IA ANEA millet farmers, HE Mongolian Plateau ancients, IA to HE Xinjiang individuals, and multiple IA to HE Tibetan Plateau individuals, particularly those in the southern Tibetan Plateau, are assigned to the sublineages of O2a2b1a1. Additionally, some ancient individuals from the Yellow River Basin and Northeast/Southeast Tibetan Plateau are linked to O2a2b1a2a1a or its sublineages (Fig. 2). Star-like expansions noted in O2a1b1a1a1a-F11 (∼8.9 kya), O2a2b1a1a1-F8 (∼7.3 kya), and O2a2b1a2a1a-F46 (∼9 kya) represent approximately 27% of the newly reported paternal lineages and 31% of the paternal lineages in China (supplementary fig. S1 and tables S4 and S5, Supplementary Material online), highlighting significant contributions from the Neolithic expansions of ANEA millet farmers to modern Chinese gene pools. Consequently, the development of millet agriculture, migration of ancient millet farmers, and admixture with diverse indigenous populations have shaped the present distribution of the O2a-M324 sublineages, particularly O2a1b-JST002611 and O2a2b1-M134.
ASEA Rice Farmer-Related Founding Lineages from Yangtze River Basin Left a Massive Genetic Legacy in China and SEA
Southern East Asia, an origin center for rice domestication, is considered the ancestral homeland of the Hmong-Mien, Tai-Kadai, Austroasiatic, and Austronesian people. ADMIXTURE models suggested that Hmong/Hanben-related ancestral components prevalent in southern Chinese populations are associated with most O1 subclades (Figs. 4f and g and 5e). Recent studies have shown that ancient Yangtze River Basin rice farmers influenced the genetic makeup of ancient Yellow River Basin millet farmers and populations in SEA and Oceania (Yang et al. 2020; Wang et al. 2021a). The exploration of the phylogeographic features of the O1 sublineages revealed a high prevalence of O1-F265 across Southeast, South, and Southwest China, SEA, and the Japanese archipelago. The subclade O1a-M119 is common in Southeast China, while O1b-M268 predominates in Southwest China and SEA (supplementary fig. S18 and table S5, Supplementary Material online). The O1a sublineages are primarily found among Austronesian-, Tai-Kadai-, and Sinitic-speaking populations in Southeast, South, and Southwest China (supplementary table S5, Supplementary Material online), suggesting a shared patrilineal origin among these groups and a significant gene flow with the Han Chinese (Chen et al. 2022; Wang et al. 2022; Liu et al. 2023). A Neolithic expansion linked to subclade O1a1a1 (∼7.6 kya, supplementary fig. S1, Supplementary Material online) is identified, with O1a1a1a being more prevalent than O1a1a1b (supplementary fig. S18b and table S5, Supplementary Material online). O1a1a1a and its sublineages, which are found predominantly in Southeast China (∼51%), diversified approximately 5.7 kya, with early dispersal centers likely in the middle and lower portions of the Yangtze River Basin and the southeast coast (supplementary fig. S18b, Supplementary Material online). O1a1a1b, which has the highest frequency in Hainan among the Li people, decreases from south to north, with initial dispersal centers in South/Southwest China. This lineage, which is possibly ancestral to the Baiyue, significantly contributed to other Chinese populations (supplementary fig. S18b and table S5, Supplementary Material online). Another Neolithic expansion associated with O1a1a2 (∼8.4 kya, supplementary fig. S1, Supplementary Material online) shows high frequencies along the southeastern coast, South/Southwest China, and Vietnam, with likely initial dispersal centers in South/Southwest China (supplementary fig. S18b, Supplementary Material online). The primary sublineage O1a1a2a diverged approximately 6.4 kya (supplementary fig. S1, Supplementary Material online). The geographical distribution patterns and divergence times of O1a1a1b- and O1a1a2a-related lineages align with inferred migration routes from coastal to inland Southwest China and from Southwest China to mainland SEA according to the phylogenetic reconstructions of the Tai-Kadai languages (Tao et al. 2023). Several Taiwanese Hanben individuals are found within O1a1a1a1 sublineages (Fig. 2), and evidence from the Liangzhu culture in the Yangtze River Delta indicates that rice farmers carrying O1a-M119 in the Yangtze River Basin were likely the direct ancestors of the modern Tai-Kadai and Austronesian people, profoundly influencing southern Han Chinese. This migration proceeded southward along China's southeastern coast or inland routes to Southeast/Southwest China and mainland SEA.
Haplogroup O1b-M268, predominantly found in Southwest/South China, SEA, and the Japanese archipelago, is divided into three major subclades, namely, O1b1a1-PK4, O1b1a2-Page59, and O1b2-P49, each displaying distinct distribution patterns (supplementary fig. S18d to f, Supplementary Material online). O1b1a1 and its sublineages, which are mainly located in Southwest/South China and SEA, constitute key paternal lineages among the Tai-Kadai-speaking populations (supplementary table S5, Supplementary Material online). However, O1b1a1a-M95, primarily found in Austroasiatic groups, suggests an ancient gene flow between the proto-Austroasiatic and proto-Tai-Kadai populations, highlighting the impact of limited Austroasiatic sample sizes in our data set (Zhang et al. 2014; Kutanan et al. 2019; Macholdt et al. 2020). Ancient DNA analysis revealed that individuals from ∼3,000 years ago at the Wucheng site in Jiangsu along the Yangtze River Basin and from the Hengbei site in Shanxi, as well as several ∼1,500-year-old Guangxi individuals from southern East Asia (Li et al. 2007; Zhao et al. 2014), carried O1b1a1a-M95 or related sublineages (Fig. 2). Additionally, recent expansion events associated with O1b1a1a1a1b1a1a1 (∼3 kya) and O1b1a1a1a1b2a1a1a (∼2.5 kya) have been identified. O1b1a2 and its sublineages, which are relatively rare in East Asia, are primarily found in East China, the southeastern part of Northeast China, and Vietnam, especially among Han Chinese individuals (supplementary fig. S18e and table S5, Supplementary Material online). An MN individual from the Wanggou site, which is part of the Yangshao culture, was identified as belonging to O1b1a2-Page59 (Fig. 2). O1b2-P49 is most frequent in Japan, followed by Northeast China, but its detailed phylogenetic structure has yet to be fully elucidated (supplementary fig. S18f, Supplementary Material online). The genetic diversity patterns of the O1 lineages indicate a significant influence of ancient rice farmers on the gene pools of populations in South China and SEA. The complex movements and admixture events associated with these ancient agriculturists have profoundly shaped the genetic landscape of modern and ancient East Asians. To clarify the origins of crucial Chinese-dominant subclades and the demographic processes influencing modern Chinese populations, we should design a systematic sampling strategy. This approach should include comprehensive Y-chromosome sequences and the collection of spatiotemporally distinct ancient samples for more detailed analyses.
Conclusion
Genetic evidence from autosomal DNA studies has profoundly transformed our understanding of the genetic histories of diverse human populations. However, research into the ancient genetic legacy reflected in modern Chinese populations via Y-chromosome analysis remains sparse. To address this gap, we used the YHC to analyze the Y-chromosome diversity in ethnolinguistically diverse Chinese populations through whole Y-chromosome sequencing and our newly developed high-resolution YHSeqY3000 panel. This project reconstructs demographic events, such as isolation, expansion, and admixture, using various computational models. The new data were integrated with a Y-chromosome genomic database of 14,644 individuals, creating a comprehensive database that includes 1,786 ancient Eurasians and 115 modern Chinese populations from 47 ethnic groups. This integration facilitates an in-depth exploration of the paternal genetic diversity of Chinese populations. Our findings indicate that multiple founding lineages associated with millet/rice farmers from the Yellow River Basin and the Yangtze River Basin, Siberian hunter-gatherers, and ancient western Eurasian pastoralists and farmers significantly influence the geographical patterns of paternal genetic stratification in Chinese populations. There is a strong correlation between the frequency of subsistence model-related founding lineages and the proportion of autosomal-based admixture from presumed ancestral sources, as well as between the latitude and a differentiated north-to-south genetic matrix. These correlations suggest that ancient migrations and extensive admixtures with indigenous populations primarily shaped the paternal genetic landscape of Chinese populations. To further elucidate the paternal evolutionary history of East Asians, we emphasize the importance of combining high-depth whole-genome sequencing data from both modern and spatiotemporally diverse ancient populations. This comprehensive approach will enhance our understanding of the dynamic interplay between migration, admixture, and cultural development in this region.
Materials and Methods
Sampling, Sequencing, Genotyping, and Phylogenetic Construction
Study Participants
To comprehensively characterize the paternal diversity across China, saliva samples were collected from 919 participants representing 39 ethnolinguistic groups (supplementary table S1, Supplementary Material online). The participants were all descendants of self-identified ethnic group members, with their grandparents having resided in their respective sampling districts for at least three generations. The study received approval from the Medical Ethics Committee of West China Hospital of Sichuan University (2023-306) and was conducted following the Helsinki Declaration of 2013 (World Medical Association 2013). Informed consent was obtained from each participant before sample collection.
DNA Extraction, Whole-Genome Sequencing, and Genotyping
Genomic DNA was extracted using the QIAamp DNA Mini Kit (QIAGEN, Germany). DNA concentrations were quantified with the Qubit dsDNA HS Assay Kit, following the standard protocol on a Qubit 3.0 fluorometer (Thermo Fisher Scientific). Sequencing was conducted on the Illumina platform (Illumina, San Diego, CA, USA), achieving 80× genome-wide coverage. The raw sequencing reads were mapped to the human reference genome GRCh37 using BWA v0.7.13 (Li and Durbin 2009). Duplicate reads were removed with Picard v3.0.0, followed by a base quality score recalibration via GATK v4.2.6.1. Joint variant calling was executed using GATK HaplotypeCaller, CombineGVCFs, and GenotypeGVCFs modules (McKenna et al. 2010). High-quality variant calls within a 10 Mb region were obtained through a sequence mask (Poznik et al. 2013). Variants exhibiting missing call rates greater than 5%, base quality below 20, and heterogeneity rate above 15% were filtered out using BCFtools v1.8 (Li 2011). Samples with missing call rates exceeding 5% were removed via vcftools v0.1.16 (Danecek et al. 2011). Ultimately, 914 samples meeting quality standards were selected for the downstream analysis, including the reconstruction of a time-scaled phylogenetic tree. Additionally, Y-specific target sequences with 100× coverage were generated using the custom-designed YHSeqY3000 panel on the MGI sequencing platform to validate the sequencing performance.
Haplogroup Classification and Phylogenetic Relationship Construction
The initial classification of the Y-chromosome haplogroups was performed using in-house scripts based on a newly reconstructed phylogenetic tree supplemented by classifications from HaploGrouper (Jagadeesan et al. 2021) and Y-LineageTracker (Chen et al. 2021), referencing the Y-DNA Haplogroup Tree 2019–2020 (https://isogg.org/tree/index.html). BEAST v1.10.4 (Suchard et al. 2018) facilitated the construction of a phylogenetic tree and the estimation of the TMRCA for various nodes using approximately 10 Mb of Y-chromosome sequences. B-related haplotypes served as an outgroup (Mallick et al. 2016). The optimal substitution model was selected via jModelTest v2.1.10 (Darriba et al. 2012). Markov chain Monte Carlo sampling was executed over 100 million iterations, with samples logged every 1,000 iterations and the initial 10 million iterations discarded as a burn-in. An exponential growth coalescent tree prior was used alongside the GTR (general time reversible) substitution model and a strict molecular clock. The substitution rate was set at 7.6 × 10−10 mutations per base pair per year (95% confidence interval: 6.7 × 10−10 to 8.6 × 10−10), as estimated by Fu et al. (2014). Three independent runs were amalgamated using LogCombiner, with the quality of the combined output manually verified using Tracer v1.7.1 (Rambaut et al. 2018). The maximum clade credibility tree was then generated with TreeAnnotator v1.10 and visualized using FigTree. To further investigate the ancient influences on the paternal landscape of the recently genotyped Chinese ethnic minorities, an ML phylogenetic tree was constructed using RAxML (Stamatakis 2014) with 914 ∼10 Mb of Y-chromosome sequences. Ancient genomes were integrated into this modern ML phylogeny using pathPhynder (Martiniano et al. 2022), and the tree was refined with iTOL (Letunic and Bork 2021). For the complete data set of Y-chromosome target sequences from 919 samples, a network-based analysis of shared haplotypes was conducted using PopART (Leigh and Bryant 2015), providing a comprehensive view of haplogroup relationships.
Haplogroup Frequency Spectra Estimation and Clustering Analysis
Data Set Composition
We integrated previously published haplogroup data from 11,979 East Asian individuals across 79 populations drawn from key studies, the 1KGP, and the Human Genome Diversity Project (Poznik et al. 2016; Bergstrom et al. 2020). Additionally, data from 879 individuals across 27 SEA populations; 252 ancient East Asians from regions, including the Tibetan Plateau, Xinjiang, Amur River Basin, Yellow River Basin, West Liao River, and South China; and 1,534 ancient western Eurasians from the Allen Ancient DNA Resource were included (supplementary tables S2 and S3, Supplementary Material online; Mallick et al. 2024). A total of 13,777 modern individuals from 12 linguistically distinct groups were sampled, spanning 22 provinces, five autonomous regions, and four municipalities in China, as well as Thailand and Vietnam. These included 135 Austroasiatic-, 693 Austronesian-, 285 Hmong-Mien-, 75 Japonic-, 35 Koreanic-, 994 Mongolic-, 863 Tai-Kadai-, 1338 Tibeto-Burman-, 260 Tungusic-,1 Indo-European-, 291 Turkic-, and 805 Sinitic-speaking Hui, 3,248 northern Han Chinese, and 4,754 southern Han individuals (supplementary tables S1 and S2, Supplementary Material online). The haplogroups were manually revised according to variant information and the Y-DNA Haplogroup Tree 2019–2020. To facilitate the estimation of the spatial distributions of the paternal lineages, we aggregated haplogroup data to create metapopulations based on geographical region, ethnicity, and language family. The haplogroup frequencies were estimated at various levels of terminal haplogroups. Population genetic analyses were conducted on individual populations with sample sizes exceeding 10 and metapopulations exceeding 30.
Population Structure Inference
Pairwise Fst genetic distances were calculated from the haplogroup frequency spectra using Y-LineageTracker. MDS analyses were conducted based on these genetic distances utilizing the “cmdscale” function in R (https://itol.embl.de/itol.cgi). Additionally, PCA was performed on the haplogroup frequency spectra using Y-LineageTracker.
Spatial Statistics Correlated with the Phylogeographic Origin of Founding Lineages
The frequency of specific haplogroups within a province-defined population at various terminal haplogroup levels was computed using Y-LineageTracker, with level parameters adjusted from 0 to 6. The Chinese populations were grouped according to provincial administrative boundaries, while populations from the island and mainland SEA were aggregated by country. The spatial distribution patterns of the dominant haplogroups in China were examined using ArcMap. This included the application of the Getis-Ord General G method for optimized hot spot analysis and spatial autocorrelation analysis using Moran's I. The clusters identified through optimized hot spot analysis, referred to as hot and cold spots, approximated the potential geographical origins or diffusion centers of specific haplogroups, and the mirroring regions illustrated the general distribution trends of these haplogroups.
Autosomal-Based ADMIXTURE Estimation
A data set was constructed from 445 ancient individuals across 88 Eurasian populations and 1,325 modern individuals from 62 geographically diverse populations, sourced from our integrated 10K_CPGDP database. Admixture proportions of Chinese populations were estimated using model-based ADMIXTURE. The autosomal data set was pruned using PLINK (Chang et al. 2015) with the parameters “--indep-pairwise 200 25 0.4” and “--allow-no-sex”. Subsequently, ADMIXTURE was run with predefined ancestral sources ranging from 2 to 15 (Alexander et al. 2009). The optimal admixture model was determined based on the lowest cross-validation error values, and correlations between the haplogroup frequencies and autosomal-based admixture proportions of modern Chinese populations were estimated.
Correlation between Haplogroup Frequency and ADMIXTURE-Based Ancestral Proportion
The haplogroup frequencies of geographically defined metapopulations were initially calculated. The Chinese populations distinguished by geographic and ethnolinguistic characteristics were grouped by provincial administrative region. All examined lineages were truncated at the ninth level, identifying 139 common lineages with a frequency exceeding 0.05 in at least one population, 177 low-frequency lineages, and 165 rare lineages. Pearson's correlation coefficients between haplogroup frequencies and geographic coordinates (longitude and latitude), along with their intercorrelations and statistical significance, were estimated using the “corrplot” R package. Subsequently, all Chinese populations were consolidated into a single subpopulation, defining common lineages with frequencies above 0.01 or 0.05. The “corrplot” R package was also utilized to assess the correlation between admixture proportions and haplogroup frequencies.
Declarations
Ethics Approval and Consent to Participate
This study received approval from the Medical Ethics Committee of West China Hospital of Sichuan University and was conducted following the principles outlined in the Helsinki Declaration.
Consent for Publication
Not applicable.
Supplementary Material
Acknowledgments
We thank all the volunteers who participated in this project.
Appendix
Full Author Lists of the 10K_CPGDP Consortium
Guanglin He1, Chao Liu2, Mengge Wang2, Renkuan Tang3, Libing Yun4, Junbao Yang5, Chuan-Chao Wang6, Jiangwei Yan7, Bofeng Zhu8, Liping Hu9, Shengjie Nie9, Hongbing Yao10
1Institute of Rare Diseases, West China Hospital of Sichuan University, Sichuan University, Chengdu, 610000, China
2Anti-Drug Technology Center of Guangdong Province, Guangzhou, 510220, China
3Department of Forensic Medicine, College of Basic Medicine, Chongqing Medical University, Chongqing, 400331, China
4West China School of Basic Science & Forensic Medicine, Sichuan University, Chengdu, 610041, China
5School of Basic Medicine and Forensic Medicine, North Sichuan Medical College, Nanchong, Sichuan, 637007, China
6State Key Laboratory of Cellular Stress Biology, School of Life Sciences, Xiamen University, Xiamen, 361005, China
7School of Forensic Medicine, Shanxi Medical University, Jinzhong, 030001, China
8Guangzhou Key Laboratory of Forensic Multi-Omics for Precision Identification, School of Forensic Medicine, Southern Medical University, Guangzhou, 510220, China
9School of Forensic Medicine, Kunming Medical University, Kunming, 650500, China
10Belt and Road Research Center for Forensic Molecular Anthropology, Gansu University of Political Science and Law, Lanzhou, 730000, China
Contributor Information
Mengge Wang, Institute of Rare Diseases, West China Hospital of Sichuan University, Sichuan University, Chengdu 610000, China; Center for Archaeological Science, Sichuan University, Chengdu 610000, China; Faculty of Forensic Medicine, Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou 510275, China.
Yuguo Huang, Institute of Rare Diseases, West China Hospital of Sichuan University, Sichuan University, Chengdu 610000, China.
Kaijun Liu, School of International Tourism and Culture, Guizhou Normal University, Guiyang 550025, China; MoFang Human Genome Research Institute, Tianfu Software Park, Chengdu, Sichuan 610042, China.
Zhiyong Wang, Institute of Rare Diseases, West China Hospital of Sichuan University, Sichuan University, Chengdu 610000, China; School of Forensic Medicine, Kunming Medical University, Kunming 650500, China.
Menghan Zhang, Institute of Modern Languages and Linguistics, Fudan University, Shanghai 200433, China; Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China.
Haibing Yuan, Center for Archaeological Science, Sichuan University, Chengdu 610000, China.
Shuhan Duan, Institute of Rare Diseases, West China Hospital of Sichuan University, Sichuan University, Chengdu 610000, China; School of Basic Medical Sciences, North Sichuan Medical College, Nanchong 637100, China.
Lanhai Wei, School of Ethnology and Anthropology, Institute of Humanities and Human Sciences, Inner Mongolia Normal University, Hohhot 010022, China.
Hongbing Yao, Belt and Road Research Center for Forensic Molecular Anthropology Gansu University of Political Science and Law, Lanzhou 730000, China.
Qiuxia Sun, Institute of Rare Diseases, West China Hospital of Sichuan University, Sichuan University, Chengdu 610000, China; Department of Forensic Medicine, College of Basic Medicine, Chongqing Medical University, Chongqing 400331, China.
Jie Zhong, Institute of Rare Diseases, West China Hospital of Sichuan University, Sichuan University, Chengdu 610000, China.
Renkuan Tang, Department of Forensic Medicine, College of Basic Medicine, Chongqing Medical University, Chongqing 400331, China.
Jing Chen, Institute of Rare Diseases, West China Hospital of Sichuan University, Sichuan University, Chengdu 610000, China; School of Forensic Medicine, Shanxi Medical University, Jinzhong 030001, China.
Yuntao Sun, Institute of Rare Diseases, West China Hospital of Sichuan University, Sichuan University, Chengdu 610000, China; Institute of Forensic Medicine, West China School of Basic Medical Sciences & Forensic Medicine, Sichuan University, Chengdu 610041, China.
Xiangping Li, Institute of Rare Diseases, West China Hospital of Sichuan University, Sichuan University, Chengdu 610000, China; School of Forensic Medicine, Kunming Medical University, Kunming 650500, China.
Haoran Su, Institute of Rare Diseases, West China Hospital of Sichuan University, Sichuan University, Chengdu 610000, China; School of Laboratory Medicine and Center for Genetics and Prenatal Diagnosis, Affiliated Hospital of North Sichuan Medical College, Nanchong, Sichuan 637007, China.
Qingxin Yang, Institute of Rare Diseases, West China Hospital of Sichuan University, Sichuan University, Chengdu 610000, China; School of Forensic Medicine, Kunming Medical University, Kunming 650500, China.
Liping Hu, School of Forensic Medicine, Kunming Medical University, Kunming 650500, China.
Libing Yun, Institute of Forensic Medicine, West China School of Basic Medical Sciences & Forensic Medicine, Sichuan University, Chengdu 610041, China.
Junbao Yang, Institute of Basic Medicine and Forensic Medicine, North Sichuan Medical College and Center for Genetics and Prenatal Diagnosis, Affiliated Hospital of North Sichuan Medical College, Nanchong, Sichuan 637007, China.
Shengjie Nie, School of Forensic Medicine, Kunming Medical University, Kunming 650500, China.
Yan Cai, School of Laboratory Medicine and Center for Genetics and Prenatal Diagnosis, Affiliated Hospital of North Sichuan Medical College, Nanchong, Sichuan 637007, China.
Jiangwei Yan, School of Forensic Medicine, Shanxi Medical University, Jinzhong 030001, China.
Kun Zhou, MoFang Human Genome Research Institute, Tianfu Software Park, Chengdu, Sichuan 610042, China.
Chuanchao Wang, State Key Laboratory of Cellular Stress Biology, School of Life Sciences, Xiamen University, Xiamen 361005, China.
Bofeng Zhu, Guangzhou Key Laboratory of Forensic Multi-Omics for Precision Identification, School of Forensic Medicine, Southern Medical University, Guangzhou 510515, China; Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, Guangdong 510515, China.
Chao Liu, Guangzhou Key Laboratory of Forensic Multi-Omics for Precision Identification, School of Forensic Medicine, Southern Medical University, Guangzhou 510515, China; Anti-Drug Technology Center of Guangdong Province, Guangzhou 510230, China.
Guanglin He, Institute of Rare Diseases, West China Hospital of Sichuan University, Sichuan University, Chengdu 610000, China; Center for Archaeological Science, Sichuan University, Chengdu 610000, China.
10K_CPGDP Consortium:
Guanglin He, Chao Liu, Mengge Wang, Renkuan Tang, Libing Yun, Junbao Yang, Chuan-Chao Wang, Jiangwei Yan, Bofeng Zhu, Liping Hu, Shengjie Nie, and Hongbing Yao
Supplementary Material
Supplementary material is available at Molecular Biology and Evolution online.
Author Contributions
G.H., M.W., B.Z., and C.L. conceived and supervised the project. G.H. and M.W. collected the samples. K.L., K.Z., Y.H., G.H., and M.W. extracted the genomic DNA and performed the genome sequencing. G.H., M.W., and K.L. did variant calling. M.Z. provided first-hand language documents from their previous linguistic fieldwork. Ha.Y., L.W., and C.W. collected the archaeological and ancient DNA data. M.W., Y.H., K.L., Z.W., S.D., Ho.Y., Q.S., J.Z., R.T., J.C., Y.S., X.L., H.S., Q.Y., L.H., L.Y., Ju.Y., S.N., Y.C., Ji.Y., K.Z., B.Z., C.L., and G.H. performed population genetic analysis. G.H. and M.W. drafted the manuscript. G.H., M.W., B.Z., and C.L. revised the manuscript.
Funding
This work was supported by grants from the National Natural Science Foundation of China (82202078) and the Major Project of the National Social Science Foundation of China (23&ZD203), the Open Project of the Key Laboratory of Forensic Genetics of the Ministry of Public Security (2022FGKFKT05), the Center for Archaeological Science of Sichuan University (23SASA01), the 1·3·5 Project for Disciplines of Excellence, West China Hospital, Sichuan University (ZYJC20002), and the Sichuan Science and Technology Program (2024NSFSC1518).
Data Availability
All haplogroup information is provided in the Supplementary material. We followed the regulations of the Ministry of Science and Technology of the People's Republic of China. The raw genotype data required controlled access. Further requests for access to the raw data can be sent to Guanglin He (Guanglinhescu@163.com) and Mengge Wang (Menggewang2021@163.com).
References
- Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 2009:19(9):1655–1664. 10.1101/gr.094052.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bergstrom A, McCarthy SA, Hui R, Almarri MA, Ayub Q, Danecek P, Chen Y, Felkel S, Hallast P, Kamm J, et al. Insights into human genetic variation and population history from 929 diverse genomes. Science. 2020:367(6484):eaay5012. 10.1126/science.aay5012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Byrska-Bishop M, Evani US, Zhao X, Basile AO, Abel HJ, Regier AA, Corvelo A, Clarke WE, Musunuri R, Nagulapalli K, et al. High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. Cell. 2022:185(18):3426–3440 e3419. 10.1016/j.cell.2022.08.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cao Y, Li L, Xu M, Feng Z, Sun X, Lu J, Xu Y, Du P, Wang T, Hu R, et al. The ChinaMAP analytics of deep whole genome sequences in 10,588 individuals. Cell Res. 2020:30(9):717–731. 10.1038/s41422-020-0322-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015:4(1):7. 10.1186/s13742-015-0047-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen H, Lin R, Lu Y, Zhang R, Gao Y, He Y, Xu S. Tracing Bai-Yue ancestry in aboriginal Li people on Hainan Island. Mol Biol Evol. 2022:39(10):msac210. 10.1093/molbev/msac210. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen H, Lu Y, Lu D, Xu S. Y-LineageTracker: a high-throughput analysis framework for Y-chromosomal next-generation sequencing data. BMC Bioinformatics. 2021:22(1):114. 10.1186/s12859-021-04057-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cheng S, Xu Z, Bian S, Chen X, Shi Y, Li Y, Duan Y, Liu Y, Lin J, Jiang Y, et al. The STROMICS genome study: deep whole-genome sequencing and analysis of 10K Chinese patients with ischemic stroke reveal complex genetic and phenotypic interplay. Cell Discov. 2023:9(1):75. 10.1038/s41421-023-00582-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cong P-K, Bai W-Y, Li J-C, Yang M-Y, Khederzadeh S, Gai S-R, Li N, Liu Y-H, Yu S-H, Zhao W-W, et al. Genomic analyses of 10,376 individuals in the Westlake BioBank for Chinese (WBBC) pilot project. Nat Commun. 2022:13(1):2939. 10.1038/s41467-022-30526-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cui Y, Li H, Ning C, Zhang Y, Chen L, Zhao X, Hagelberg E, Zhou H. Y chromosome analysis of prehistoric human populations in the West Liao River Valley, Northeast China. BMC Evol Biol. 2013:13(1):216. 10.1186/1471-2148-13-216. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, et al. The variant call format and VCFtools. Bioinformatics. 2011:27(15):2156–2158. 10.1093/bioinformatics/btr330. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Darriba D, Taboada GL, Doallo R, Posada D. jModelTest 2: more models, new heuristics and parallel computing. Nat Methods. 2012:9(8):772. 10.1038/nmeth.2109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Feng Q, Lu Y, Ni X, Yuan K, Yang Y, Yang X, Liu C, Lou H, Ning Z, Wang Y, et al. Genetic history of Xinjiang’s Uyghurs suggests Bronze Age multiple-way contacts in Eurasia. Mol Biol Evol. 2017:34(10):2572–2582. 10.1093/molbev/msx177. [DOI] [PubMed] [Google Scholar]
- Fu Q, Li H, Moorjani P, Jay F, Slepchenko SM, Bondarev AA, Johnson PLF, Aximu-Petri A, Prüfer K, de Filippo C, et al. Genome sequence of a 45,000-year-old modern human from western Siberia. Nature. 2014:514(7523):445–449. 10.1038/nature13810. [DOI] [PMC free article] [PubMed] [Google Scholar]
- He G, Fan Z, Zou X, Deng X, Yeh H, Wang Z, Liu J, Xu Q, Chen L, Deng X, et al. Demographic model and biological adaptation inferred from the genome-wide single nucleotide polymorphism data reveal tripartite origins of southernmost Chinese Huis. Am J Biol Anthropol. 2022:180(3):488–505. 10.1002/ajpa.24672. [DOI] [Google Scholar]
- He G, Wang M, Miao L, Chen J, Zhao J, Sun Q, Duan S, Wang Z, Xu X, Sun Y, et al. Multiple founding paternal lineages inferred from the newly-developed 639-plex Y-SNP panel suggested the complex admixture and migration history of Chinese people. Hum Genomics. 2023a:17(1):29. 10.1186/s40246-023-00476-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- He G, Wang J, Yang L, Duan S, Sun Q, Li Y, Wu J, Wu W, Wang Z, Liu Y, et al. Genome-wide allele and haplotype-sharing patterns suggested one unique Hmong-Mein-related lineage and biological adaptation history in Southwest China. Hum Genomics. 2023b:17(1):3. 10.1186/s40246-023-00452-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- He G, Yao H, Sun Q, Duan S, Tang R, Chen J, Wang Z, Sun Y, Li X, Wang S, et al. Whole-genome sequencing of ethnolinguistic diverse northwestern Chinese Hexi Corridor people from the 10K_CPGDP project suggested the differentiated East-West genetic admixture along the Silk Road and their biological adaptations. bioRxiv. 2023c. 2023.2002. 2026.530053. [Google Scholar]
- Jagadeesan A, Ebenesersdóttir SS, Guðmundsdóttir VB, Thordardottir EL, Moore KHS, Helgason A. HaploGrouper: a generalized approach to haplogroup classification. Bioinformatics. 2021:37(4):570–572. 10.1093/bioinformatics/btaa729. [DOI] [PubMed] [Google Scholar]
- Jeong C, Wang K, Wilkin S, Taylor WTT, Miller BK, Bemmann JH, Stahl R, Chiovelli C, Knolle F, Ulziibayar S, et al. A dynamic 6,000-year genetic history of Eurasia’s eastern steppe. Cell. 2020:183(4):890–904 e829. 10.1016/j.cell.2020.10.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Karmin M, Flores R, Saag L, Hudjashov G, Brucato N, Crenna-Darusallam C, Larena M, Endicott PL, Jakobsson M, Lansing JS, et al. Episodes of diversification and isolation in Island Southeast Asian and Near Oceanian male lineages. Mol Biol Evol. 2022:39(3):msac045. 10.1093/molbev/msac045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kumar V, Wang W, Zhang J, Wang Y, Ruan Q, Yu J, Wu X, Hu X, Wu X, Guo W, et al. Bronze and Iron Age population movements underlie Xinjiang population history. Science. 2022:376(6588):62–69. 10.1126/science.abk1534. [DOI] [PubMed] [Google Scholar]
- Kutanan W, Kampuansai J, Srikummool M, Brunelli A, Ghirotto S, Arias L, Macholdt E, Hübner A, Schröder R, Stoneking M. Contrasting paternal and maternal genetic histories of Thai and Lao populations. Mol Biol Evol. 2019:36(7):1490–1506. 10.1093/molbev/msz083. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Leigh JW, Bryant D. Popart: full-feature software for haplotype network construction. Methods Ecol Evol. 2015:6(9):1110–1116. 10.1111/2041-210X.12410. [DOI] [Google Scholar]
- Leipe C, Long T, Sergusheva EA, Wagner M, Tarasov PE. Discontinuous spread of millet agriculture in Eastern Asia and prehistoric population dynamics. Sci Adv. 2019:5(9):eaax6225. 10.1126/sciadv.aax6225. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Letunic I, Bork P. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res. 2021:49(W1):W293–W296. 10.1093/nar/gkab301. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011:27(21):2987–2993. 10.1093/bioinformatics/btr509. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009:25(14):1754–1760. 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H, Huang Y, Mustavich LF, Zhang F, Tan J-Z, Wang L-E, Qian J, Gao M-H, Jin L. Y chromosomes of prehistoric people along the Yangtze River. Hum Genet. 2007:122(3-4):383–388. 10.1007/s00439-007-0407-2. [DOI] [PubMed] [Google Scholar]
- Li X, Wang M, Su H, Duan S, Sun Y, Chen H, Wang Z, Sun Q, Yang Q, Chen J, et al. Evolutionary history and biological adaptation of Han Chinese people on the Mongolian Plateau. hLife. 2024:2(6):296–313. 10.1016/j.hlife.2024.04.005. [DOI] [Google Scholar]
- Liu D, Ko AMS, Stoneking M. The genomic diversity of Taiwanese Austronesian groups: implications for the “into- and out-of-Taiwan” models. PNAS Nexus. 2023:2:pgad122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Macholdt E, Arias L, Duong NT, Ton ND, Van Phong N, Schröder R, Pakendorf B, Van Hai N, Stoneking M. The paternal and maternal genetic history of Vietnamese populations. Eur J Hum Genet. 2020:28:636–645. 10.1038/s41431-019-0557-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mallick S, Li H, Lipson M, Mathieson I, Gymrek M, Racimo F, Zhao M, Chennagiri N, Nordenfelt S, Tandon A, et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature. 2016:538(7624):201–206. 10.1038/nature18964. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mallick S, Micco A, Mah M, Ringbauer H, Lazaridis I, Olalde I, Patterson N, Reich D. The Allen Ancient DNA Resource (AADR) a curated compendium of ancient human genomes. Sci Data. 2024:11(1):182. 10.1038/s41597-024-03031-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Martiniano R, De Sanctis B, Hallast P, Durbin R. Placing ancient DNA sequences into reference phylogenies. Mol Biol Evol. 2022:39(2):msac017. 10.1093/molbev/msac017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010:20(9):1297–1303. 10.1101/gr.107524.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Miao B, Liu Y, Gu W, Wei Q, Wu Q, Wang W, Zhang M, Ding M, Wang T, Liu J, et al. Maternal genetic structure of a neolithic population of the Yangshao culture. J Genet Genomics. 2021:48(8):746–750. 10.1016/j.jgg.2021.04.005. [DOI] [PubMed] [Google Scholar]
- Miller NF, Spengler RN, Frachetti M. Millet cultivation across Eurasia: origins, spread, and the influence of seasonal climate. The Holocene. 2016:26(10):1566–1575. 10.1177/0959683616641742. [DOI] [Google Scholar]
- Ning C, Fernandes D, Changmai P, Flegontova O, Yüncü E, Maier R, Altınışık NE, Kassian AS, Krause J, Lalueza-Fox C, et al. The genomic formation of First American ancestors in East and Northeast Asia. bioRxiv. 2020a. 2020.2010. 2012.336628. [Google Scholar]
- Ning C, Li T, Wang K, Zhang F, Li T, Wu X, Gao S, Zhang Q, Zhang H, Hudson MJ, et al. Ancient genomes from northern China suggest links between subsistence changes and human migration. Nat Commun. 2020b:11(1):2700. 10.1038/s41467-020-16557-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Olson ND, Wagner J, Dwarshuis N, Miga KH, Sedlazeck FJ, Salit M, Zook JM. Variant calling and benchmarking in an era of complete human genome sequences. Nat Rev Genet. 2023:24(7):464–483. 10.1038/s41576-023-00590-0. [DOI] [PubMed] [Google Scholar]
- Poznik GD, Henn BM, Yee M-C, Sliwerska E, Euskirchen GM, Lin AA, Snyder M, Quintana-Murci L, Kidd JM, Underhill PA, et al. Sequencing Y chromosomes resolves discrepancy in time to common ancestor of males versus females. Science. 2013:341(6145):562–565. 10.1126/science.1237619. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Poznik GD, Xue Y, Mendez FL, Willems TF, Massaia A, Wilson Sayres MA, Ayub Q, McCarthy SA, Narechania A, Kashin S, et al. Punctuated bursts in human male demography inferred from 1,244 worldwide Y-chromosome sequences. Nat Genet. 2016:48(6):593–599. 10.1038/ng.3559. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Raghavan M, Skoglund P, Graf KE, Metspalu M, Albrechtsen A, Moltke I, Rasmussen S, Stafford TW, Orlando L, Metspalu E, et al. Upper Palaeolithic Siberian genome reveals dual ancestry of Native Americans. Nature. 2014:505(7481):87–91. 10.1038/nature12736. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rambaut A, Drummond AJ, Xie D, Baele G, Suchard MA. Posterior summarization in Bayesian phylogenetics using Tracer 1.7. Syst Biol. 2018:67(5):901–904. 10.1093/sysbio/syy032. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Robbeets M, Bouckaert R, Conte M, Savelyev A, Li T, An D-I, Shinoda K-I, Cui Y, Kawashima T, Kim G, et al. Triangulation supports agricultural spread of the Transeurasian languages. Nature. 2021:599(7886):616–621. 10.1038/s41586-021-04108-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014:30(9):1312–1313. 10.1093/bioinformatics/btu033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Su B, Xiao J, Underhill P, Deka R, Zhang W, Akey J, Huang W, Shen D, Lu D, Luo J, et al. Y-Chromosome evidence for a northward migration of modern humans into Eastern Asia during the last Ice Age. Am J Hum Genet. 1999:65(6):1718–1724. 10.1086/302680. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Suchard MA, Lemey P, Baele G, Ayres DL, Drummond AJ, Rambaut A. Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10. Virus Evol. 2018:4(1):vey016. 10.1093/ve/vey016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sun J, Li Y, Ma P, Yan S, Cheng H, Fan Z, Deng X, Ru K, Wang C, Chen G, et al. Shared paternal ancestry of Han, Tai-Kadai-speaking, and Austronesian-speaking populations as revealed by the high resolution phylogeny of O1a-M119 and distribution of its sub-lineages within China. Am J Phys Anthropol. 2021:174(4):686–700. 10.1002/ajpa.24240. [DOI] [PubMed] [Google Scholar]
- Sun N, Ma P-C, Yan S, Wen S-Q, Sun C, Du P-X, Cheng H-Z, Deng X-H, Wang C-C, Wei L-H. Phylogeography of Y-chromosome haplogroup Q1a1a-M120, a paternal lineage connecting populations in Siberia and East Asia. Ann Hum Biol. 2019:46(3):261–266. 10.1080/03014460.2019.1632930. [DOI] [PubMed] [Google Scholar]
- Sun Y, Wang M, Sun Q, Liu Y, Duan S, Wang Z, Zhou Y, Zhong J, Huang Y, Huang X, et al. Distinguished biological adaptation architecture aggravated population differentiation of Tibeto-Burman-speaking people. J Genet Genomics. 2023:51(5):517–530. 10.1016/j.jgg.2023.10.002. [DOI] [PubMed] [Google Scholar]
- Tao Y, Wei Y, Ge J, Pan Y, Wang W, Bi Q, Sheng P, Fu C, Pan W, Jin L, et al. Phylogenetic evidence reveals early Kra-Dai divergence and dispersal in the late Holocene. Nat Commun. 2023:14(1):6924. 10.1038/s41467-023-42761-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang M, He G, Zou X, Chen P, Wang Z, Tang R, Yang X, Chen J, Yang M, Li Y, et al. Reconstructing the genetic admixture history of Tai-Kadai and Sinitic people: insights from genome-wide SNP data from South China. J Syst Evol. 2022:61(1):157–178. 10.1111/jse.12825. [DOI] [Google Scholar]
- Wang M, Wang Z, He G, Liu J, Wang S, Qian X, Lang M, Li J, Xie M, Li C, et al. Developmental validation of a custom panel including 165 Y-SNPs for Chinese Y-chromosomal haplogroups dissection using the ion S5 XL system. Forensic Sci Int Genet. 2019:38:70–76. 10.1016/j.fsigen.2018.10.009. [DOI] [PubMed] [Google Scholar]
- Wang T, Wang W, Xie G, Li Z, Fan X, Yang Q, Wu X, Cao P, Liu Y, Yang R, et al. Human population history at the crossroads of East and Southeast Asia since 11,000 years ago. Cell. 2021a:184(14):3829–3841 e3821. 10.1016/j.cell.2021.05.018. [DOI] [PubMed] [Google Scholar]
- Wang H, Yang MA, Wangdue S, Lu H, Chen H, Li L, Dong G, Tsring T, Yuan H, He W, et al. Human genetic history on the Tibetan Plateau in the past 5100 years. Sci Adv. 2023:9(11):eadd5582. 10.1126/sciadv.add5582. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang C-C, Yeh H-Y, Popov AN, Zhang H-Q, Matsumura H, Sirak K, Cheronet O, Kovalev A, Rohland N, Kim AM, et al. Genomic insights into the formation of human populations in East Asia. Nature. 2021b:591(7850):413–419. 10.1038/s41586-021-03336-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wei W, Ayub Q, Chen Y, McCarthy S, Hou Y, Carbone I, Xue Y, Tyler-Smith C. A calibrated human Y-chromosomal phylogeny based on resequencing. Genome Res. 2013:23(2):388–395. 10.1101/gr.143198.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wei L-H, Yan S, Lu Y, Wen S-Q, Huang Y-Z, Wang L-X, Li S-L, Yang Y-J, Wang X-F, Zhang C, et al. Whole-sequence analysis indicates that the Y chromosome C2*-Star Cluster traces back to ordinary Mongols, rather than Genghis Khan. Eur J Hum Genet. 2018:26(2):230–237. 10.1038/s41431-017-0012-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- World Medical Association (WMA) . World Medical Association Declaration of Helsinki: ethical principles for medical research involving human subjects. JAMA. 2013:310:2191–2194. 10.1001/jama.2013.281053. [DOI] [PubMed] [Google Scholar]
- Wu Q, Cheng H-Z, Sun N, Ma P-C, Sun J, Yao H-B, Xie Y-M, Li Y-L, Meng S-L, Zhabagin M, et al. Phylogenetic analysis of the Y-chromosome haplogroup C2b-F1067, a dominant paternal lineage in Eastern Eurasia. J Hum Genet. 2020:65(10):823–829. 10.1038/s10038-020-0775-1. [DOI] [PubMed] [Google Scholar]
- Yang MA, Fan X, Sun B, Chen C, Lang J, Ko Y-C, Tsang C-h, Chiu H, Wang T, Bao Q, et al. Ancient DNA indicates human population shifts and admixture in northern and southern China. Science. 2020:369(6501):282–288. 10.1126/science.aba0909. [DOI] [PubMed] [Google Scholar]
- Zerjal T, Xue Y, Bertorelle G, Wells RS, Bao W, Zhu S, Qamar R, Ayub Q, Mohyuddin A, Fu S, et al. The genetic legacy of the Mongols. Am J Hum Genet. 2003:72(3):717–721. 10.1086/367774. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhabagin M, Wei L-H, Sabitov Z, Ma P-C, Sun J, Dyussenova Z, Balanovska E, Li H, Ramankulov Y. Ancient components and recent expansion in the Eurasian heartland: insights into the revised phylogeny of Y-chromosomes from Central Asia. Genes (Basel). 2022:13(10):1776. 10.3390/genes13101776. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang X, Kampuansai J, Qi X, Yan S, Yang Z, Serey B, Sovannary T, Bunnath L, Aun HS, Samnom H, et al. An updated phylogeny of the human Y-chromosome lineage O2a-M95 with novel SNPs. PLoS One. 2014:9(6):e101020. 10.1371/journal.pone.0101020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang Y, Lei X, Chen H, Zhou H, Huang S. Ancient DNAs and the Neolithic Chinese super-grandfather Y haplotypes. bioRxiv. 2018:487918. 10.1101/487918. [DOI] [Google Scholar]
- Zhang P, Luo H, Li Y, Wang Y, Wang J, Zheng Y, Niu Y, Shi Y, Zhou H, Song T, et al. NyuWa Genome resource: a deep whole-genome sequencing-based variation profile and reference panel for the Chinese population. Cell Rep. 2021a:37(7):110017. 10.1016/j.celrep.2021.110017. [DOI] [PubMed] [Google Scholar]
- Zhang F, Ning C, Scott A, Fu Q, Bjørn R, Li W, Wei D, Wang W, Fan L, Abuduresule I, et al. The genomic origins of the Bronze Age Tarim Basin mummies. Nature. 2021b:599(7884):256–261. 10.1038/s41586-021-04052-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang M, Yan S, Pan W, Jin L. Phylogenetic evidence for Sino-Tibetan origin in northern China in the Late Neolithic. Nature. 2019:569(7754):112–115. 10.1038/s41586-019-1153-z. [DOI] [PubMed] [Google Scholar]
- Zhao Y, Zhang Y, Li H, Cui Y, Zhu H, Zhou H. Ancient DNA evidence reveals that the Y chromosome haplogroup Q1a1 admixed into the Han Chinese 3,000 years ago. Am J Hum Biol. 2014:26(6):813–821. 10.1002/ajhb.22604. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All haplogroup information is provided in the Supplementary material. We followed the regulations of the Ministry of Science and Technology of the People's Republic of China. The raw genotype data required controlled access. Further requests for access to the raw data can be sent to Guanglin He (Guanglinhescu@163.com) and Mengge Wang (Menggewang2021@163.com).