Abstract
Determining genetic ancestry of an individual is challenging from poorly preserved or mixed samples that permit only ultra-low coverage sequence at depths less than 0.1 × at target loci. Leveraging recent advances in telomere-to-telomere sequencing of whole genomes with long reads, we develop a new k-mer based method, Y-mer, and show how information from hundreds of thousands of k-mers in distance-based models enables accurate inference of chrY haplogroup from whole-genome sequence at depth less than 0.01x. We test the performance of Y-mer on ancient DNA and prenatal screening data, showing its potential for genetic ancestry inference for cell-free, forensic and ancient DNA research.
Supplementary Information
The online version contains supplementary material available at 10.1186/s13059-025-03714-3.
Keywords: aDNA, ulcWGS, NIPS, NIPT, K-mer, Human, chrY, Haplogroups prediction
Background
Human Y chromosome was due to its high complexity the last chromosome of the human genome to be completely sequenced [1]. Approximately half of its length is composed of euchromatin, regions recombining (PAR1 and PAR2), and regions not recombining with human X chromosome, while the other half, including its centromere and the largest heterochromatin region (Yq12) in the entire human genome, remained uncharted in the human reference genome until the recent completion of the Y chromosome assembly [1]. Skov et al. [2] study of 62 Danish Y chromosomes sequenced to high coverage with a wide range of library insert sizes and a study of Y chromosome assemblies of 7 major Y chromosome haplogroups by Esteller-Cucala [3] highlighted major gaps in the reference (GRCh38) sequence and uncovered a high degree of variation in the centromeric region and structural variants restricted to certain haplogroups. Furthermore, two recent studies that used high coverage long read sequence data to determine the sequence of Y chromosome from telomere to telomere revealed highly variable numbers of amplicon genes and up to two-fold difference in the total length of the Y chromosome [1, 4]. The main source of Y chromosome length variation is the Yq12 region, which showed a high degree of length variation not only across but also within the two most represented haplogroups, E1 and O2 in a sample of 43 Y chromosomes that have been completely sequenced [4]. Certain repeat regions, e.g., the TSPY array, showed, however, length variation that was constrained by phylogeny [4]. To what extent differences in the size of the individual constituent blocks of Y chromosome sequence are phylogenetically informative remains to be systematically studied with larger sample sizes.
Human Y chromosome haplogroups are defined as branches or clades in a phylogenetic tree drawn on the basis of variation in non-recombining regions. The haplogroups are conventionally labeled alphanumerically, with the base haplogroup structure shown in Fig. 1 and a more detailed version with sub-clades presented in Additional file 1: Table S4. Y chromosome haplogroups are routinely determined in ancient DNA (aDNA) studies as they are indicative of male-specific gene flow and important for clarifying specific relationships among individuals whose relatedness has been identified from autosomal data. They are also relevant for individual identification in forensic practice. However, Y chromosome haplogroup determination from shotgun sequence data is challenging below 0.1 × and specifically under 0.01 × coverage because most branches in the Y phylogeny are defined by less than 10 and only a few with > 100 SNVs, respectively. The increase of sequence coverage via capture or additional shotgun sequencing may be impractical in cases of poorly preserved samples due to high fragmentation, damage, or low proportion of endogenous DNA.
Fig. 1.
Schematic phylogeny of Y chromosome haplogroups. The counts of individuals in each haplogroup are presented below the tree for 1243 individuals from the 1000 Genomes Project (1000 GP), 240 from its European subset, and 1160 individuals from the Estonian Biobank (EstBB)
Here, in this study, we explore the potential of predicting Y chromosome haplogroups from ulcWGS data with k-mers. For this purpose, we develop a new tool Y-mer, which uses the counts of k-mers unique to the complete sequence of the human Y chromosome. We will first query the optimal number of k-mers required and determine the lower boundary of sequence coverage for the approach by downsampling the coverage of Y chromosomes of known haplogroups from the 1000 GP [5] and EstBB high coverage sequences [6]. We develop lower and higher resolution models of haplogroup prediction with Y-mer for broad purpose usage of the tool with training sets of regional and global samples and determine the prediction accuracy in relevant validation sets. Finally, we test the approach on ulcWGS ancient DNA from three regional studies and non-invasive prenatal screening (NIPS) data from Estonia [7] and China [8].
Results
Distance-based models for predicting Y chromosome haplogroups in multi-population data
As our aim is to develop a k-mer based approach of determining Y chromosome haplogroups from ulcWGS data, we need to know the minimum number of k-mers and source Y chromosome assemblies needed for robust predictions of base haplogroups. Furthermore, as we build distance-based haplogroup prediction models (Fig. 2), where k-mer frequencies are compared between competing haplogroups, we are interested to know whether the approach that works accurately on base haplogroups can be scaled up to a higher number of sub-clades and whether haplogroup models developed on a training set from population A can be successfully applied on individuals from population B. We also want to know how robust such an approach is to noise and, in the case of ancient DNA, to contamination.
Fig. 2.

Y chromosome haplogroup prediction models generated from k-mer lists and their testing on a validation set, aDNA and NIPT data. The workflow presented here has three stages, and the names of the R scripts and GenomeTester4 commands used to generate k-mer lists and models are shown next to relevant tasks in italics: (A) The chrY sources used for the generation of k-mer lists are chosen from high coverage (> 20X) long- or short-read sequence data. 1 M, 21 M, 110 M, 213 M, and 222 M refer to sources of 1, 21, 110, 213, and 222 male genomes. In the case of 1 M, three distinctive regions of the Y chromosome (Additional file 1: Table S1) were used for k-mer selection. In the case of other sources, the whole Y chromosome was used. From each source, k-mer lists are generated with glistmaker. K-mers mapping also to female genomes (Additional file 1: Table S6) are identified with glistcompare and removed. (B) Three training sets with global and local haplogroup distributions (Additional file 1: Table S4) are defined for haplogroup prediction, and 10 K to 100 K haplogroup-specific k-mers are identified in each training set with MWS.R. Models with specific k-mer numbers and haplogroup choices are trained with the training sets with MODEL.R. (C) Model accuracy testing is performed with PREDICTER.R on a validation set V110, 1000 GP, HGDP, SGDP, aDNA, and NIPT data
Firstly, to explore the numbers of chrY k-mers and their source genomes needed to robustly predict chrY haplogroups from multiple populations across the globe, we generated and tested models based on hundreds of thousands to millions of k-mers. We created models using five different sets of high-coverage genomes as sources of chrY data (Fig. 2), the first with only a single chrY (HG002, Additional file 1: Table S1) T2T long-read assembly, the second with 21 T2T Y chromosome assemblies from different haplogroups (Hallast et al., 2023, Additional file 1: Table S2), and, finally, three sets of short-read data for 110–222 Y chromosomes from 1000 GP [5] and EstBB data [6]. Using the glistmaker of GenomeTester4 [9] we extracted lists of canonical k-mers specific to chrY, including up to 14 million 25-mers (see the “Material and methods” section for details, Additional file 1: Table S3). Next, we extracted from these lists sets of 10,000 to 100,000 k-mers for determining the minimum number of k-mers required. We trained haplogroup predictions on three different geographic ranges, W set representing 11 worldwide haplogroups, E set representing 22 haplogroups common across Europe, and NE set representative of 23 haplogroups common specifically in Northeast Europe (Fig. 2, Additional file 1: Table S4, S5). In each model, the haplogroup prediction was made with distance-based clustering by assigning individuals to the haplogroup with the smallest distance, considering all k-mers in the model (Additional file 2, output description in Additional file 3).
Model validation
We tested the performance of haplogroup prediction models first in a validation set of 110 individuals (V110) selected from the 1000 GP, HGDP, and EstBB data, including only individuals that had not been used in model training (Additional file 1: Table S7). The haplogroup of each individual in the validation set had been determined from SNV data. The validation set included 10 individuals from each of the 11 basal haplogroups (AB, C, E, G, H, IJ, LT, N, O, Q, and R) chosen to represent human Y chromosome diversity at the global scale. Each individual in the validation set was downsampled to 10 different coverage values, within the range of 1 × to 0.00001 ×, to test the limits of the approach.
Firstly, to clarify how many different Y chromosome assemblies are required as sources of k-mer lists to accurately determine Y chromosome haplogroups from ulcWGS data, we compared the performance of M1W, M21W, and M110W models (Fig. 3, Additional file 1: Table S8, S9) on the validation set. We found that M21W and M110W models both offered high accuracy (> 0.95) from 1 × down to 0.001 × coverage, while using only a single Y chromosome assembly as a source performed poorly at all coverages, with a 0.65 accuracy at maximum for 1 × coverage. The M110W model showed higher accuracy than M21W between 0.0005 × and 0.005 × coverage, while at higher coverage ranges, the difference was not notable.
Fig. 3.
Haplogroup prediction accuracy at low coverage range. Three prediction models shown were based on k-mer selections from a single chromosome three repeats (1Y) and multiple (21Y, 110Y) chrY assemblies. In the case of all three models, 50,000 k-mers were used in haplogroup prediction on a validation set of 110 individuals representing 11 basal haplogroups of global distribution
Next, to determine how many k-mers are required to accurately predict chrY haplogroups, we compared the performance of models based on 10 K to 100 K k-mers selected from each haplogroup. In each tested model, we used the 21Y, the long-read assembly of 21 Y chromosomes [4], as the source for k-mer lists. We found that all models performed highly accurately (> 0.95) at coverage equal to or higher than 0.0005 × (Fig. 4). Models based on 10 K k-mers were significantly less accurate at coverage ranges below 0.0005 ×. While at the lowest coverage range, 0.00001 ×, the 20 K model also performed poorly, we observed no major differences in the performance of models based on 30 K or more k-mers on the coverage ranges tested.
Fig. 4.
The effect of selecting a different number of k-mers per haplogroup. Illustrated using the M21W model; accuracy of HG prediction for the V110 validation set dilutions
We tested the effect of contamination on the accuracy of haplogroup prediction with Y-mer on downsampled replicas of a Finnish individual from haplogroup N3. To this test sample, we added reads from another individual, from haplogroup R1a, while keeping the total number of reads constant at coverage 0.01 ×. We determined with Y-mer the haplogroups of such pools of individual reads, showing that Y-mer haplogroup prediction remains highly robust with contamination rates up to 30% (Fig. 5).
Fig. 5.
Effect of contamination on the accuracy of haplogroup prediction with Y-mer. The effect of contamination was estimated by adding reads from a single donor (HG03687, haplogroup R1a) by increments of 5% (shown on the x-axis) to down-sampled replicas of the recipient sample (HG00280, haplogroup N3) while keeping constant at 0.01 × the aggregate coverage. HG predictions with uncertain HG-s (p-value ≥ 0.05) are labelled as “no call”. Uncertain, but correct HG-s are indicated by light-green, and uncertain wrong HG predictions are indicated by pink color
Finally, we tested the models trained on 1000 GP data (1210 samples, Additional file 1: Table S10) also on individuals from the HGDP (492 samples, Additional file 1: Table S11) and SGDP (160 samples, Additional file 1: Table S12) data. Additional file 1: Table S13 summarized the estimates of prediction accuracy across the 1000 GP, HGDP, and SGDP data, showing consistently high (> 0.95) accuracy for W and E models across different data sets. The somewhat lower accuracy (0.9–0.95) of the E model in HGDP is largely due to the misassignment of the R1b13-FT289648 and N3a2d-M1932 lineages that are represented by multiple individuals uniquely in the HGDP data while being absent in the 1000 GP data and thus not being represented in the models.
Model testing on ancient DNA data
Our tests of haplogroup prediction accuracy on validation set V110 included individuals from the same modern populations from which the training sets were formed. To explore the behavior of our models on datasets with haplogroup composition different from the modern training sets, we turned to aDNA data. We first tested the haplogroup prediction models on ancient genomic data of 91 male individuals from the Eurasian Steppe Belt, dated to 1500–4500 years BP and whole genome sequenced to a depth range of 0.029–8.7 × [10]. Beyond the temporal difference between our modern references and the Damgaard data, none of the modern Steppe Belt or Central Asian populations were included in the data that we used for model training. For 47 individuals, basal Y chromosome haplogroups were reported by Damgaard et al. (2018), while a sizable proportion of 44 male samples had no haplogroup assignment, likely due to their low coverage. To generate a list of expected haplogroups for model testing, we called 256,463 binary haplogroup informative SNVs and determined the likeliest chrY haplogroups of all the 91 Steppe Belt individuals in Damgaard et al. data (Additional file 1: Table S14) following the approach described in Hui et al. (2024) [11]. All haplogroup assignments we made for the 44 individuals with previous assignments matched at the base haplogroup level those made by Damgaard et al. (2018) [10]. A number of haplogroups detected in this ancient data set, such as C3, I3, N5, O6a, Q1c, Q1g, R1b13, and R1b16, are either extremely rare in modern data sets and/or not included in our training sets.
To test the performance of Y-mer on the phylogenetically diverse ancient Steppe Belt data, we predicted the chrY haplogroups for all the 91 individuals using 6 different k-mer based models (Additional file 1: Table S14). At the most basal level of 11 haplogroups predicted from k-mer lists derived from 21 Y chromosome T2T assemblies (model M21W), we observed a high rate (94%) of matches between the haplogroup calls made from SNV and k-mer data (Table 1). In models trained to predict haplogroups at higher resolution (M21E, M222E, M21NE, M222NE) we observed an accuracy range of 72–86% with more mismatches, in particular in haplogroups not used in model training, such as C and O in the M21NE and M222NE models. In the case of model M222NE, all 78 individuals whose SNV-determined haplogroup was included in the model were predicted correctly. In the 13 cases where the haplogroup expected from SNV data was not used in our model training, the model prediction was in 6 cases a phylogenetically closely related lineage (such as I2 instead of I3 and R1b3 instead of R1b16) and in 7 cases haplogroup LT, which was phylogenetically more distant than the closest haplogroup in the training set. Instances where haplogroup predictions were incorrect included, notably, haplogroup LT assignments made with high confidence (p < 0.05), whereas only a small subset of wrong predictions were due to low confidence calls. As filtering for the p-values did not appear to increase the accuracy of haplogroup predictions, we considered the consistency of haplogroup predictions made by different models. We found that for 45 individuals, the haplogroup assignments were consistent across all 6 models we examined, and these predictions were always correct. Notably, the correct predictions include the individual with the lowest coverage. Using a p < 0.05 threshold for haplogroup predictions improved, but only modestly, the accuracy (Table 1).
Table 1.
Accuracy of haplogroup assignment models on three ancient DNA data sets (Additional file 1: Table S14–16)
| Gretzinger | Damgaard | Saag | |||||
|---|---|---|---|---|---|---|---|
| p-value | No thr | < 0.05 | No thr | < 0.05 | No thr | < 0.05 | |
| Haplogroup assignment models | |||||||
| M21W | N | 252 | 180 | 91 | 81 | 23 | 17 |
| Accuracy | 0.571 | 0.700 | 0.945 | 0.963 | 0.826 | 0.941 | |
| M110W | N | 252 | 225 | 91 | 82 | 23 | 21 |
| Accuracy | 0.900 | 0.970 | 0.835 | 0.866 | 0.913 | 0.952 | |
| M21E | N | 250 | 225 | 70 | 57 | 23 | 17 |
| Accuracy | 0.784 | 0.822 | 0.771 | 0.860 | 0.826 | 0.941 | |
| M213E | N | 250 | 248 | 70 | 68 | 23 | 23 |
| Accuracy | 0.992 | 0.992 | 0.929 | 0.941 | 1 | 1 | |
| M21NE | N | 123 | 78 | 26 | 24 | 15 | 14 |
| Accuracy | 0.537 | 0.692 | 1 | 1 | 0.333 | 0.357 | |
| M222NE | N | 123 | 70 | 26 | 26 | 15 | 8 |
| Accuracy | 0.423 | 0.529 | 1 | 1 | 0.600 | 0.625 | |
| Sub-clade assignment models | |||||||
| M80R1 | N | 75 | 48 | 13 | 11 | ||
| Accuracy | 0.853 | 0.979 | 0.846 | 0.909 | |||
| M43I1 | N | 15 | 10 | ||||
| Accuracy | 0.800 | 1 | |||||
To test the accuracy of the performance of higher resolution models M21E, M213E, M21NE, and M222NE on data with haplogroup composition more similar to our training set, we turned to two additional ancient DNA data sets, one from Early Medieval Northwest Europe [12] and one from Bronze Age and Iron Age Estonia [13, 14]. These data sets include individuals separated from our training sets by time while still having similar chrY haplogroup composition to them at subclade level (Additional file 1: Table S14–16). We find limited or no improvement compared to the Damgaard et al. data in the performance of M21E, M21NE, and M222NE models (Table 1). However, the M213E model yielded high (> 0.98) accuracy of basal haplogroup assignments in both Gretzinger et al. and Saag et al. data sets. As the M213E model differs from M222NE by its lack of resolution between recently diverged R1b and I1 sub-clades, it is notable that the accuracy of haplogroup predictions made after applying M80R1 and M43I1 models separately on individuals assigned to R1 and I1 clades, respectively, was considerably higher (> 0.8) than in the case of the M222NE model (< 0.53), in which these sub-clade assignments were made in context of higher level haplogroup assignments of the global chrY phylogeny.
Model testing on non-invasive prenatal screening (NIPS) data
Another source of low coverage data from which Y chromosome prediction with conventional SNV-based methods would be extremely challenging are non-invasive prenatal screening (NIPS) data generated from circulating free DNA (cfDNA) samples of maternal blood widely used to identify chromosomal structural abnormalities at various stages of fetal development. Using NIPS data of pregnant individuals from China [8] and Estonia [7] and focusing on male fetus cases, where the fetal sex had been independently determined, we show that Y-mer is able to predict with high confidence fetal haplogroups from sequencing depths 0.006–0.12 × in Estonian samples (182 samples, Additional file 1: Table S17) and 0.0009–0.009 × in Chinese samples (259 samples, Additional file 1: Table S18). While we do not have reference data for individual NIPS samples to estimate the accuracy of Y-mer, the haplogroup predictions in both Chinese and Estonian cohorts appear to match for most haplogroups their frequency expectations for China and Estonia made from independently sampled cohorts (Additional file 1: Table S19). In the case of Chinese NIPS haplogroup predictions, we observe an overall good match between predicted and observed haplogroup frequencies with somewhat higher than expected frequency of C and lower than expected O3 (O2a2 by ISOGG nomenclature) frequencies, which could reflect differences in recruitment in the NIPS and 1000 GP data of individuals from Southern and Northern China, where these haplogroups show opposing frequency gradients [15].
Discussion
Determining individual ancestry from traces of DNA derived from forensic context or from poorly preserved ancient human remains from archaeological context is typically challenging due to the low number of host DNA molecules in the sample, their fragmentation, and post-mortem damage. Because of the lack of recombination in the Y chromosome, its variation accumulates over time along a simple phylogenetic tree. The main branches of this tree can be robustly inferred even from low coverage data because of the hierarchic redundancy of the accumulated variation. Furthermore, higher Y chromosome regional differentiation observed among present-day populations, compared to mitochondrial and autosomal diversity [16], makes Y chromosome haplogroup prediction attractive for genetic ancestry inference. In this study we have shown that human Y chromosome haplogroups can be predicted accurately and efficiently from ulcWGS data using methods that determine the relative abundance of male-specific k-mers.
The accuracy, resolution, and robustness of haplogroup prediction, whether using the SNVs or the Y-mer approach described here, would depend, besides the quality and sequence coverage of the sample, on the size and diversity of the reference panel as well as the number of informative variants being used. Our tests with global and European training and validation sets showed that models using just a single T2T Y chromosome reference genome as the source of k-mer extraction performed poorly, with lower than 80% accuracy across the tested coverage range of down-sampled validation data. Models using k-mers extracted from 21, 110, and 213 different Y chromosome sources performed better, with accuracy higher than 95%, in coverage range > 0.001 × (Fig. 3). As we did not observe major differences in the performance of M21W versus M110W and M21E versus M213E models, we can conclude that a phylogenetically diverse panel of more than 20 Y chromosomes suffices as a k-mer source for basal haplogroup determination.
We showed that haplogroup predictions with Y-mer are robust to contamination in a model (M21E) trained with individuals from phylogenetically distinct haplogroups (Fig. 5). However, models trained with regionally specific sets of haplogroups (M21NE), including sub-clades of I1 and R1b that have separated only within the last 5000 years [16], performed less well even without the presence of contamination (Table 1). This drop of performance is likely caused by increasingly higher proportions of k-mers that overlap the k-mer lists extracted for the given sub-clades. This result highlights the need to use additional filters that remove k-mer overlaps in future developments of detailed sub-clade prediction models, or, with our current approach, the need to use both diverse and balanced training sets. Even though we did not see improvement between M21W and M110W models (Fig. 3) it is possible that with models that require higher haplogroup resolution, the number of Y chromosome sources from which k-mers are retrieved will also need to be adapted.
Our analyses revealed that for robust haplogroup prediction, the number of k-mers selected to distinguish each individual haplogroup included in the model has to be sufficiently high (> 10,000). Our comparisons of models with increasingly higher numbers of k-mers, however, showed no detectable increase in accuracy (Fig. 4) in models with more than 20,000 haplogroup-specific k-mers, suggesting that for computational efficiency, models with 20,000–50,000 k-mers are likely to represent the most optimal solution.
When applying Y-mer on validation sets different in their haplogroup composition from the training set, we observed higher rates of mismatches, particularly in association with rare sub-clades. Predictions supported by models trained at different haplogroup levels appeared to be mostly correct, suggesting that applying multiple models on the same data can be helpful in distinguishing predictions that are robustly supported by multiple models from those that have lower confidence and are supported only by individual predictions. The necessity to apply multiple models on the same data was further illustrated in the case of the ancient DNA data from the Steppe Belt, where our basal haplogroup prediction model M21W made haplogroup calls at high accuracy (~ 95%), while predictions with region-specific haplogroup compositions, adapted on Europe or, more specifically, in the case of some models on Northeast European data, showed a higher number of mismatches, particularly in the case of haplogroups that were not included in the model. While we could see that haplogroup predictions that were supported by multiple models appeared mostly to be correct, this case study highlighted the need also for caution in choosing the models with appropriate haplogroup composition and the training sets for the future use of the Y-mer tool in ancient DNA studies. Preliminary insights into haplogroup composition of an ancient cohort through SNV analyses of higher coverage individual samples will be advisable to inform such models. Where such high coverage data is obtainable, it can substantially increase the accuracy of Y-mer in predicting haplogroups from lower coverage range of data. Human chrY 0.001 × sequencing depth appeared to be sufficient for accurate haplogroup prediction in our tests, where the validation set was down-sampled to lower coverage (Fig. 3).
An advisable strategy of haplogroup determination in a broader world-wide context would be the application of hierarchical inference strategy. Generic models that cover all major haplogroups, such as M21W, can be applied first to determine the primary haplogroup, and further sub-haplogroup specific models can then be applied for further inference. We have illustrated this approach with haplogroup R1 and I1 models. Analyses of two ancient DNA data sets from Europe (Saag et al., 2017, 2019; Gretzinger et al., 2022) revealed high accuracy of base haplogroup prediction with M213E model, which does not differentiate between recently diverged sub-clades of R1b and I1. The M222NE model that was designed to predict these sub-clades with a training set that combines them with the global range of haplogroup diversity performed less well (accuracy < 0.53) than the case where either the M110W or M213E model’s haplogroup R1 or I1 predictions were further resolved with subclade-specific models, which showed high (> 0.95) accuracy for high confidence (p < 0.05) calls. These results suggest that a two-stage strategy in haplogroup prediction, whereby the base haplogroup is called first and the sub-clade determination is performed separately, may be preferable over models that combine different levels of haplogroup diversity.
Our analyses of Chinese and Estonian NIPS data further confirmed the good performance of Y-mer at low coverage range, 0.001–0.12 ×, of the male fetus Y chromosome data as we obtained haplogroup frequency profiles similar to those expected from relevant reference data. While these results show that the most common haplogroups in Europe and Asia can be predicted with sufficient accuracy, the drop of accuracy we observe with models that entail haplogroup distinction at sub-clade level further emphasizes the limitation of our currently described approach for purposes that require both high confidence and resolution, such as the determination of genetic relatedness. However, in cases where genetic relatedness has been determined independently, e.g., via identity-by-descent or identity-by-state methods [17, 18], Y-mer analyses can be useful for testing (ruling out) the plausibility of patrilineal relatedness, even when acknowledging that a match at a generic Y chromosome haplogroup level cannot constitute a proof of patrilineal relationship.
In conclusion, we present a new k-mer based tool, Y-mer, for predicting Y chromosome haplogroups. We show that Y-mer is able to accurately predict basal chrY haplogroups from ultra-low (> 0.1x) coverage data. As such, it is an approach that can be useful in situations where basic, low resolution, information about individual ancestry is required while higher coverage sequencing of the samples is either not possible or practical, e.g. due to costs or unavailability of sufficient quantities of the sample. For such purpose, we provide the tool and models we have already tested along with guidance for the development of new, more specific models https://github.com/bioinfo-ut/Y-mer/ [19]. For ancient DNA studies or forensic case analyses, this approach can potentially make more individual samples available for Y chromosomal ancestry analyses, which can be more informative when high coverage/quality data is already available for a subset of the samples. We show that Y-mer performs more accurately when its models are trained on data that match the haplogroup composition of the target group, which highlights the needs for a tailored approach in cases where detailed sub-haplogroup level distinctions are required. Besides its possible uses for Y chromosome data, the k-mer based approach described here is potentially extendable also to ancestry analyses of the autosomal genome. Considering the high rate of genetic variation detected in centromeric regions assembled with long read sequences and the low rate of recombination in the pericentromeric regions [20], the study of k-mers from (peri-)centromeric haplotypes may offer new prospects for autosomal ancestry scanning from low coverage data, similarly, though not identically, to Y chromosome analyses described here, considering the differences in inheritance of autosomal and Y chromosome DNA. Alternatively, genome-wide scans of population-specific peaks of autosomal k-mer abundance could be screened and used in ancestry mapping of low coverage data. Development of such tools would require larger ancestry diverse reference panels such as graph-based pangenomes that are currently being developed and likely to become available in the near future.
Methods
Human Y chromosome nomenclatures
Most previous population genetic studies have used for ancestry analyses and Y chromosome haplogroup determination only the X-degenerate regions of the Y chromosome that capture less than one fifth of the length of the Y chromosome and are deemed to be amenable for short-read sequence mapping [16, 21–26]. Human Y chromosome haplogroups are defined by the combined presence of unique sets of allele variants, typically SNVs, that co-occur in individuals who share patrilineal ancestry with the exclusion of other individuals examined. Numerous attempts have been made to update the Y chromosome haplogroup nomenclature since its establishment on the basis of 245 markers in a globally representative sample [27]. This alphanumeric nomenclature system continued to be updated from 2005 to 2020 by a large group of citizen scientists (https://isogg.org/tree/). As a result of ever-increasing volumes of sequence data, some updated ISOGG haplogroup labels exceed 20 characters, while many sub-clades of the Y phylogeny remain poorly labeled. To find shorter and more stable alternative solutions, van Oven et al. [28] proposed a minimal reference tree and Karmin et al. (2015) a shorter time-depth constrained haplogroup labeling system. The latter system will be used when referring to haplogroup names in this study. In cases where data sources have used parallel systems, such as ISOGG-based labeling, these will be explicitly referred to for clarity.
Selecting chrY haplogroups to model (Tables S4)
In order to explore the potential of using k-mer based metrics to capture Y chromosome variation at different evolutionary time depths, we dissected the single nucleotide variants (SNV) based Y chromosome phylogeny of the 1000 GP data [24] at different levels of depth (Additional file 1: Table S4). Firstly, at the broader global level (level 1, “WORLD”, or “W"), we distinguish 11 branches older than 20 thousand years that are representative of variation across the world. Next, at level 2, we focus on 22 branches that are informative of regional differences across Europe (“EUROPE”, “E “). At level 3, our focus on 23 branches includes sub-clades that are younger than 5000 years, and these clades are particularly informative for the study of Y chromosome variation in Northeast-Europe (“NE_EUROPE”, “NE”). We assembled three lists of individuals, 110Y, 213Y, and 222Y, representing these three layers of phylogeny, respectively, for k-mer based model training and testing, including WGS data from 1000 GP [5], EstBB [6], and HGDP [29]. In model training, we used high coverage genomes, while in testing, we used low coverage data.
Selecting chrY specific k-mers for the haplogroup prediction models
We used a fixed k-mer length 25 in this study following the rationale and results of our earlier study [30]. Briefly, k-mers shorter than 20 were associated with substantial loss of informative k-mers due to their low specificity in the human genome. On the other hand, the number of informative k-mers was observed to reach a plateau at k-mer length 24. Shorter k-mers are generally preferred to avoid the risk of somatic mutations or errors that would make the k-mer undetectable, and also because of the lower number of possible k-mers from sequencing reads. Considering this, we have selected the length 25 in several previously published studies [31–34] as well as for this study.
The first selection (1Y, HG002) has all k-mers in DYZ1, DYZ2, DYZ3, and DYZ19 regions not presented in other chromosomes in the CHM13v2.0 assembly, in total of 618,393 k-mers (Additional file 1: Table S1).
The second selection of 21Y chromosomes of different haplogroup assemblies from Hallast et al. (2023) was used for k-mer selection. All chrY sequences of the 21 chrY assemblies were merged, from which a list of all possible 25-mers was extracted. From this list, all 25-mers that were found to occur 6 or more times in a k-mer list of female WGSs (15 from 1000 GP and 15 from EstBB, Additional file 1: Table S6) were excluded. The comparisons of k-mer lists were performed with the glistcompare function of the GenomeTester4 package [9]. In the end, 12,003,307 k-mers were retained for further analyses (Tables S3).
Lastly, the third selection (110W, 213E, 222NE) is based on unassembled short-read sequencing data from 110, 213, and 222 individuals (Additional file 1: Table S5) from across the world, including 1000 GP data [5] and EstBB high coverage whole genome sequence data [6]. For each model, the union list of different haplogroup intersection lists was merged after the exclusion of k-mers found in the female list (Additional file 1: Table S6). In total, approximately 14 million k-mers were retained for further analyses.
Canonical k-mer manipulations
GenomeTester4 allows the user to create k-mer lists from fasta and fastq files with its glistmaker function. The k-mer lists can be compared with the glistcompare function, and the frequency of each k-mer can be obtained with the glistquery function.
Selecting haplogroup-specific k-mers for models applied on three training sets
In each haplogroup prediction model, based on different chrY data sources and k-mer lists (Fig. 2), we used as a training set one of three lists of individuals: WORLD training set, corresponding to 110Y set of Y chromosomes from chrY data sources; EUROPE training set, corresponding to 213Y; and NE_EUROPE training set, corresponding to 222Y. For each model, we queried Y chromosome-specific k-mer frequencies in all individuals (Additional file 1: Table S5) in a given training set, normalized with chrY median sequencing depth per individual using 5,071,089 k-mers extracted from GRCh38 [35].
As the next step, we determined for each haplogroup distinctive k-mers that were either more or less frequent in individuals of the given haplogroup compared to the pool of individuals from all other haplogroups for that model. Mann–Whitney test, implemented in the MWS.R script, was used to determine 10,000 k-mers with the highest and 10,000 k-mers with the lowest test statistic values for the given haplogroup. We excluded k-mers used in sequence depth estimation and explored models with up to 100,000 k-mers per haplogroup (Additional file 1: Table S3).
Validation set V110
To test the accuracy of the haplogroup prediction models, we generated a validation set V110 composed of 110 individuals chosen from high and low coverage data sources. Only individuals that had not been used in the training set were considered for inclusion in V110. The V110 set (Additional file 1: Table S7) mirrors in its haplogroup composition the global training set 110 W (Additional file 1: Table S4)—each of the eleven basal haplogroups is represented by 10 individuals. The V110 set is predominantly composed of individuals of the 1000 GP data [5]. As AB HG-s samples have limited representation, we added AB individuals from the Human Genome Diversity Panel (HGDP) data [29]. Similarly, to increase the representation of haplogroup N variation, individuals from the EstBB dataset [6] were added.
Supplementary Information
Additional file 1: Tables S1-S19 contains key data of every step of current work: HG and samples selection, k-mers selection, HG prediction accuracy summaries for datasets and according data.
Additional file 2: Statistical model of k-mer counts. Describes basics of creating distance based model and calling HG-s by model.
Additional file 3: Describes PREDICTER.R output in detail.
Acknowledgements
We thank Lehti Saag for sharing various Estonian aDNA samples, Richard Villems for discussions on human genomes, Mart Kals for valuable advice on data presentation, and Reidar Andreson for setting up the web-based tool for Y-mer. Data analysis was performed using the facilities of the High-Performance Computing Center at the University of Tartu.
Peer review information
Andrew Cosgrove was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team. The peer-review history is available in the online version of this article.
Authors’ contributions
T.P. and T.K. initiated the project. T.P., T.K., L.K., A.S., and M.R. designed the project until the final concept. K.K. prepared the Estonian NIPS samples. M.M. wrote the R scripts and built the statistical framework. K.M. named the project. T.P. and K.M. managed and processed the data. T.K., T.P., and M.M. wrote the first draft of the manuscript. All authors contributed to editing the manuscript. All authors read and approved the final manuscript. T.K. and M.R. coordinated and equally funded the study, sharing last authorship.
Funding
This work was funded by the Fonds Wetenschappelijk Onderzoek Grants G0A4521N (T.K.), G050822N (T.K.), and S003422N (T.K.); the Estonian Research Council Grant PRG1076 (A.S., K.K.) and PRG2706 (T.P., M.R.); the Horizon Europe NESTOR project Grant 101120075 (A.S., K.K.); and partly by the Estonian Business and Innovation Agency Grant RE.5.04.23–0214 (SCANS) (T.P., L.K., M.R.), as well as by Estonian Research Council Grant TEM-TA35, 2021–2027.1.01.24–0627 (T.P., L.K., M.R.).
Data availability
Y-mer is GenomeTester4 package, R and WGS data dependent. The workflow outlining the steps involving the R scripts (MWS.R, MODEL.R, DILUTER.R, and PREDICTER.R) is presented in Fig. 2. These R scripts, along with relevant documentation on their input and output formats, are available at https://github.com/bioinfo-ut/Y-mer/ [19]. Running MWS.R and MODEL.R requires 35 GB of storage space per individual. Ready-to-use model files, including relevant k-mer dictionaries, PREDICTER.R, and instructions, are available at https://bioinfo.ut.ee/randomtandem/mudelid/ and/or 10.5281/zenodo.15089783 [36]. The web-based tool for testing Y-mer’s models are available at https://bioinfo.ut.ee/Y-mer. The Y-mer’s source code is published under the GNU General Public License v3.0.
Declarations
Ethics approval and consent to participate
No ethical approvals were required as the data were publicly available.
Competing interests
A.S. and K.K. are board members of Celvia CC, which also offers NIPT testing for commercial purposes. The other authors have no competing interests.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Rhie A, et al. The complete sequence of a human Y chromosome. Nature. 2023;621(7978):344–54. 10.1038/s41586-023-06457-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Skov L, Schierup MH. Analysis of 62 hybrid assembled human Y chromosomes exposes rapid structural changes and high rates of gene conversion. PLoS Genet. 2017;13(8):e1006834. 10.1371/journal.pgen.1006834. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Esteller-Cucala P, et al. Y chromosome sequence and epigenomic reconstruction across human populations. Commun Biol. 2023;6(1):623. 10.1038/s42003-023-05004-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Hallast P, et al. Assembly of 43 human Y chromosomes reveals extensive complexity and variation. Nature. 2023;621(7978):355–64. 10.1038/s41586-023-06425-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Auton A, et al. A global reference for human genetic variation. Nature. 2015;526(7571):68–74. 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Milani L, et al. The Estonian Biobank’s journey from biobanking to personalized medicine. Nat Commun. 2025;16(1):3270. 10.1038/s41467-025-58465-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Žilina O, et al. Creating basis for introducing non-invasive prenatal testing in the Estonian public health setting. Prenat Diagn. 2019;39(13):1262–8. 10.1002/pd.5578. [DOI] [PubMed] [Google Scholar]
- 8.Xu H, et al. Informative priors on fetal fraction increase power of the noninvasive prenatal screen. Genet Med. 2018;20(8):817–24. 10.1038/gim.2017.186. [DOI] [PubMed] [Google Scholar]
- 9.Kaplinski L, Lepamets M, Remm M. Genometester4: a toolkit for performing basic set operations - union, intersection and complement on k-mer lists. Gigascience. 2015;4(1):58. 10.1186/s13742-015-0097-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Damgaard PDB, et al. 137 ancient human genomes from across the Eurasian steppes. Nature. 2018;557(7705):369–74. 10.1038/s41586-018-0094-2. [DOI] [PubMed] [Google Scholar]
- 11.Hui R, et al. Genetic history of Cambridgeshire before and after the Black Death. Sci Adv. 2024;10(3):eadi5903. 10.1126/sciadv.adi5903. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Gretzinger J, et al. The Anglo-Saxon migration and the formation of the early English gene pool. Nature. 2022;610(7930):112–9. 10.1038/s41586-022-05247-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Saag L, et al. Extensive farming in Estonia started through a sex-biased migration from the Steppe. Curr Biol. 2017;27(14):2185-2193.e6. 10.1016/j.cub.2017.06.022. [DOI] [PubMed] [Google Scholar]
- 14.Saag L, et al. The arrival of Siberian ancestry connecting the Eastern Baltic to Uralic Speakers further East. Curr Biol CB. 2019;29(10):1701-1711.e16. 10.1016/j.cub.2019.04.026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Li J, Song F, Lang M, Xie M. Comprehensive insights into the genetic background of Chinese populations using Y chromosome markers. R Soc Open Sci. 2023;10(9):230814. 10.1098/rsos.230814. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Karmin M, et al. A recent bottleneck of Y chromosome diversity coincides with a global change in culture. Genome Res. 2015;25(4):459–66. 10.1101/gr.186684.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Monroy Kuhn JM, Jakobsson M, Günther T. Estimating genetic kin relationships in prehistoric populations. PloS One. 2018;13(4):e0195491. 10.1371/journal.pone.0195491. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Popli D, Peyrégne S, Peter BM. KIN: a method to infer relatedness from low-coverage ancient DNA. Genome Biol. 2023;24(1):10. 10.1186/s13059-023-02847-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Puurand T, et al. Y-mer GitHub. 2025. 10.5281/ZENODO.15754017.
- 20.Logsdon GA, et al. The variation and evolution of complete human centromeres. Nature. 2024;629(8010):136–45. 10.1038/s41586-024-07278-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Francalacci P, et al. Low-pass DNA sequencing of 1200 Sardinians reconstructs European Y-chromosome phylogeny. Science. 2013;341(6145):565–9. 10.1126/science.1237947. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Mendez FL, et al. An African American paternal lineage adds an extremely ancient root to the human Y chromosome phylogenetic tree. Am J Hum Genet. 2013;92(3):454–9. 10.1016/j.ajhg.2013.02.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Poznik GD, et al. Sequencing Y chromosomes resolves discrepancy in time to common ancestor of males versus females. Science. 2013;341(6145):562–5. 10.1126/science.1237619. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Poznik GD, et al. Punctuated bursts in human male demography inferred from 1,244 worldwide Y-chromosome sequences. Nat Genet. 2016;48(6):593–9. 10.1038/ng.3559. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Wei W, Ayub Q, Xue Y, Tyler-Smith C. A comparison of Y-chromosomal lineage dating using either resequencing or Y-SNP plus Y-STR genotyping. Forensic Sci Int Genet. 2013;7(6):568–72. 10.1016/j.fsigen.2013.03.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Hallast P, et al. The Y-chromosome tree bursts into leaf: 13,000 high-confidence SNPs covering the majority of known clades. Mol Biol Evol. 2015;32(3):661–73. 10.1093/molbev/msu327. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.T. Y. C. Consortium. A nomenclature system for the tree of human Y-chromosomal binary haplogroups. Genome Res. 2002;12(2):339–48. 10.1101/gr.217602. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.van Oven M, Van Geystelen A, Kayser M, Decorte R, Larmuseau MH. Seeing the wood for the trees: a minimal reference phylogeny for the human Y chromosome. Hum Mutat. 2013;35(2):187–91. 10.1002/humu.22468. [DOI] [PubMed] [Google Scholar]
- 29.Bergström A, et al. Insights into human genetic variation and population history from 929 diverse genomes. Science. 2020;367(6484):eaay5012. 10.1126/science.aay5012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Pajuste FD, Kaplinski L, Möls M, Puurand T, Lepamets M, Remm M. FastGT: an alignment-free method for calling common SNVs directly from raw sequencing reads. Sci Rep. 2017;7(1):2537. 10.1038/s41598-017-02487-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Puurand T, Kukuškina V, Pajuste FD, Remm M. AluMine: alignment-free method for the discovery of polymorphic Alu element insertions. Mob DNA. 2019;10(1):31. 10.1186/s13100-019-0174-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Kaplinski L, Möls M, Puurand T, Pajuste F, Remm M. KATK: fast genotyping of rare variants directly from unmapped sequencing reads. Hum Mutat. 2021;42(6):777–86. 10.1002/humu.24197. [DOI] [PubMed] [Google Scholar]
- 33.Kaplinski L, Möls M, Puurand T, Remm M. DOCEST—fast and accurate estimator of human NGS sequencing depth and error rate. Bioinforma Adv. 2023;3(1):vbad084. 10.1093/bioadv/vbad084. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Pajuste F-D, Remm M. GeneToCN: an alignment-free method for gene copy number estimation directly from next-generation sequencing reads. Sci Rep. 2023;13(1):17765. 10.1038/s41598-023-44636-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Sauk M, et al. NIPTmer: rapid k-mer-based software package for detection of fetal aneuploidies. Sci Rep. 2018;8(1):5616. 10.1038/s41598-018-23589-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Puurand T, et al. Y-mer models. Zenodo. 2025. 10.5281/ZENODO.15089783.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Additional file 1: Tables S1-S19 contains key data of every step of current work: HG and samples selection, k-mers selection, HG prediction accuracy summaries for datasets and according data.
Additional file 2: Statistical model of k-mer counts. Describes basics of creating distance based model and calling HG-s by model.
Additional file 3: Describes PREDICTER.R output in detail.
Data Availability Statement
Y-mer is GenomeTester4 package, R and WGS data dependent. The workflow outlining the steps involving the R scripts (MWS.R, MODEL.R, DILUTER.R, and PREDICTER.R) is presented in Fig. 2. These R scripts, along with relevant documentation on their input and output formats, are available at https://github.com/bioinfo-ut/Y-mer/ [19]. Running MWS.R and MODEL.R requires 35 GB of storage space per individual. Ready-to-use model files, including relevant k-mer dictionaries, PREDICTER.R, and instructions, are available at https://bioinfo.ut.ee/randomtandem/mudelid/ and/or 10.5281/zenodo.15089783 [36]. The web-based tool for testing Y-mer’s models are available at https://bioinfo.ut.ee/Y-mer. The Y-mer’s source code is published under the GNU General Public License v3.0.




