Learning a refinement model for variant analysis in non-human primate genomes

Jeonghoon Choi; Bo Zhou; Giltae Song

doi:10.1186/s12864-025-11921-2

. 2025 Aug 25;26:775. doi: 10.1186/s12864-025-11921-2

Learning a refinement model for variant analysis in non-human primate genomes

Jeonghoon Choi ¹, Bo Zhou ^2,^✉, Giltae Song ^1,^3,^✉

PMCID: PMC12379468 PMID: 40855258

Abstract

Background

Accurate variant calling is essential for genomic studies but is highly dependent on sequence alignment (SA) quality. In non-human primates, the lack of well-curated variant resources limits alignment postprocessing, leading to suboptimal SA and increased miscalls. DeepVariant, a leading variant caller, demonstrates high accuracy in human genomes but exhibits performance degradation under suboptimal SA conditions.

Results

To address this, we developed a decision tree-based refinement model that integrates alignment quality metrics and DeepVariant confidence scores to filter miscalls effectively. We defined suboptimal SA and optimal SA based on the presence or absence of postprocessing steps and confirmed that suboptimal SA significantly increases miscalls in both human and rhesus macaque genomes. Applying the refinement model to human suboptimal SA reduced the miscalling ratio (MR) by 52.54%, demonstrating its effectiveness. When applied to rhesus macaque genomes, the model achieved a 76.20% MR reduction, showing its potential for non-human primate studies. Alternative base ratio (ABR) analysis further revealed that the model refines homozygous SNVs more effectively than heterozygous SNVs, improving variant classification reliability.

Conclusions

Our refinement model significantly improves variant calling in suboptimal SA conditions, which is particularly beneficial for non-human primate studies where alignment postprocessing is often limited. We packaged our model into the Genome Variant Refinement Pipeline (GVRP), providing for researchers working on population genetics and molecular evolution. This work establishes a framework for enhancing variant calling accuracy in species with limited genomic resources.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12864-025-11921-2.

Keywords: Genome variants, Non-Human Primates genome, Variant refinement pipeline, Machine learning

Background

Genomic variants, such as single nucleotide variants (SNVs) and small insertions and deletions (Indels), are critical for understanding genetic diversity and the mechanisms underlying various biological and pathological processes [1–3]. Accurate identification of these variants, a process known as variant calling, forms the foundation of modern genomic studies. Variant calling involves aligning sequencing reads from a target genome to a reference genome to identify positions where the sequences differ, allowing researchers to pinpoint genetic variations [4]. These variant identifications play essential roles in diverse applications, including predicting drug responses, understanding disease susceptibility, and exploring evolutionary relationships. For instance, SNVs are widely utilized in personalized medicine to tailor treatments based on an individual’s genetic profile [5, 6], while Indels can cause frameshift mutations or impact regulatory regions, influencing gene expression [7, 8]. Given their importance, ensuring the reliability and precision of variant calls is paramount to advancing both basic and applied genomic research.

Despite its importance, variant calling poses significant challenges due to the inherently skewed distribution of variants within sequencing data. Variants typically constitute only a tiny fraction of the overall sequence, making it difficult for classifiers to distinguish true variants from the overwhelming majority of non-variant regions [9, 10]. This imbalance mirrors a common issue in classification tasks, where models must operate under imbalanced data distributions. To improve sensitivity, classifiers often adopt an aggressive approach to identifying positive cases. However, this heightened sensitivity inevitably increases the likelihood of false positives (FPs), where non-target instances are mistakenly identified as positive cases [11, 12]. In the context of variant calling, such errors manifest as non-variant sites being misclassified as variants, further complicating the interpretation of genomic data. Addressing these issues is critical to improving the accuracy and reliability of variant calling in genomic studies.

In recent years, variant calling methodologies have seen remarkable advancements, transitioning from traditional statistical approaches to machine learning and deep learning techniques. Machine learning-based methods, such as SNooPer [13], RFcaller [14], and VEF [15], utilize features derived from alignment data, including mapping scores, read depth, and base letters of aligned reads, to improve the accuracy of variant detection. These models employ ensemble tree-based algorithms [16–18] to reduce false positives (FPs), significantly enhancing the reliability of variant calls. Building on these advancements, deep learning approaches have further expanded the potential of variant calling [19, 20]. Among them, deep learning-based DeepVariant [21] stands out as a state-of-the-art tool, consistently outperforming traditional methods and machine learning-based models [22–24]. DeepVariant employs a convolutional neural network (CNN) [25] to analyze sequencing data, transforming alignment features into image-like tensors. This innovative approach enables the model to capture complex patterns within sequencing data, achieving superior accuracy in variant detection.

However, despite these advancements, the quality of alignment remains a critical factor influencing the performance of both machine learning and deep learning-based variant callers. Since these models rely heavily on alignment-dependent features, such as mapping quality and sequence context, their effectiveness is inherently limited when alignment corrections are incomplete or omitted [26, 27]. This limitation is particularly pronounced in non-human primates, where alignment quality is often compromised due to the scarcity of curated reference data [28, 29] and the omission of alignment postprocessing steps. Non-human primates, such as Rhesus macaques and chimpanzees, share significant genetic similarities with humans, making them invaluable models for studying human biology, complex diseases, and drug responses [30, 31]. Their unique evolutionary position allows researchers to investigate genetic variations that offer insights into human-specific traits and vulnerabilities. However, the lack of robust genomic resources for these species poses significant challenges in accurately analyzing their genetic data. Addressing these challenges requires novel approaches to improve variant calling accuracy under such suboptimal conditions.

In this paper, we propose a refinement model for variant analysis in non-human primate genomes. The primary objective of our model is to cope with the limitations of existing tools in non-human genomic contexts by improving the accuracy of variant calls, particularly under suboptimal alignment conditions. These conditions, which reveal real-world scenarios with incomplete or inaccurate alignment corrections, lead to an increased rate of FP variant calls, even when using state-of-the-art variant callers like DeepVariant. To overcome this limitation, our approach introduces a lightweight refinement model that effectively filters FP variants, enhancing the reliability of variant calls without relying on alignment corrections that may be inaccessible or omitted due to data limitations. The model integrates DeepVariant likelihood predictions with supplementary alignment features, such as read depth, soft clipping ratio, and low mapping quality read ratio, to refine variant calls. These additional features enable the model to address miscalled variants caused by insufficient alignment correction, enhancing variant calling performance under suboptimal alignment conditions. Using this framework, the refinement model demonstrated the ability to significantly reduce the FP rate of variant call. Remarkably, the refined results not only matched but, in some cases, exceeded the performance of DeepVariant applied to fully corrected alignments.

We employed Light Gradient Boosting Model (LGBM) [32], a powerful gradient boosting framework that builds an ensemble of decision trees, as the foundation for our refinement model. LGBM sequentially trains trees, assigning higher weights to samples misclassified by previous iterations. This approach enables the model to capture complex patterns in the data [33], making it particularly effective for filtering FP variants. Additionally, LGBM's inclusion of a regularization term in its objective function mitigates overfitting, ensuring generalizability of the model. Using this robust machine learning approach, we demonstrated that refinement model effectively filters FP variants from variant calls generated without alignment correction. To facilitate its application, we packaged the refinement model into the Genome Variant Refinement Pipeline (GVRP), which features a user-friendly command-line interface. GVRP is publicly available at https://github.com/Jeong-Hoon-Choi/GVRP, making it accessible for broader applications in non-human primate genome analysis. With its ability to refine variant calls efficiently and effectively, GVRP would provide a promising solution to the challenges faced in non-human primate genome research.

Materials and methods

Genome sequence alignment under suboptimal conditions

Sequence alignment (SA) is the process of mapping sequencing reads onto a reference genome to identify sequence differences between the target genome and the reference. This process is a crucial step in variant calling, as it determines the accuracy of downstream analyses. To improve alignment accuracy and ensure high-quality variant detection, various postprocessing steps are applied, including alignment sorting, fixing mate information, tagging duplicate reads, Indel realignment, and base quality recalibration.

Alignment sorting arranges reads based on genomic coordinates, ensuring consistency for downstream variant calling. Fixing mate information corrects paired-end read inconsistencies, which can occur due to sequencing or alignment errors. Tagging duplicate reads identifies and marks duplicate reads resulting from PCR amplification, preventing overrepresentation of certain sequences and reducing bias in variant calling. Indel realignment adjusts misaligned reads around Indels, which helps reduce false positive variant calls by refining read placement in complex regions. Lastly, base quality recalibration refines base quality scores by correcting systematic sequencing errors, improving the accuracy of variant detection.

Postprocessing steps can be categorized based on the type of information required. Sorting, fixing, and tagging use only the reference and target genome sequences, whereas Indel realignment and base quality recalibration require additional known variant information. Although indel realignment is no longer performed separately in standard pipelines—since it has been incorporated into GATK’s HaplotypeCaller from version 3.6 onward—we employed DeepVariant for variant calling. As the DeepVariant explicitly includes indel realignment in the genome sequence alignment post-processing steps, we incorporated this step as part of the optimal SA procedure. In conclusion, we define optimal SA as an alignment in which all postprocessing steps are applied. Conversely, suboptimal SA refers to an alignment where Indel realignment and base quality recalibration are performed under limited conditions, as illustrated in Fig. 1. SA was performed using BWA [34] v0.7.17, while postprocessing steps were handled with SAMtools [35] v1.3.1, Genome Analysis Tool Kit (GATK) [36], and Picard [37]. Sorting was conducted using SAMtools, while fixing mate information and marking duplicate reads were performed using Picard v3.1.0 and GATK 4.4.0. To ensure consistency with the experimental conditions used in DeepVariant evaluations, Indel realignment and base quality recalibration were performed using GATK 3.5.0. The commands used for sequence alignment, post-processing, variant calling, and variant call evaluation are described in detail in Supplementary Notes 1 to 8, and the versions of the tools employed are summarized in Supplementary Table S7.

Fig. 1 — SA postprocessing workflow. (a) Optimal SA, where all postprocessing steps, including indel realignment and base quality recalibration, are applied. (b) Suboptimal SA, where Indel realignment and base quality recalibration are omitted

DeepVariant

We use DeepVariant, a variant caller developed by Google, to identify genome variants from processed SA data. DeepVariant is a leading variant caller that defines variant candidates based on allele counts at each sequence position within the SA. It then constructs pileup images by encoding the reference genome and aligned reads in the candidate window as RGB channels, incorporating base letters, alignment quality, and strand information. These images are processed using a CNN to classify variants with high accuracy. Although DeepVariant demonstrates outstanding performance as a variant caller [38], we identified certain limitations in its reliance on SA features. Specifically, DeepVariant utilizes only three SA features and does not incorporate CIGAR-related features, which describe complex alignment structures. Given that SA quality directly impacts variant calling accuracy, we hypothesize that DeepVariant’s performance would degrade under suboptimal SA conditions. To address this limitation, we identify additional SA features that could enhance variant calling accuracy.

Furthermore, while DeepVariant has been continuously updated since its initial release and remains a state-of-the-art tool, we employed an earlier version (v1.3.0) in our analysis. More recent versions incorporate enhancements such as base quality recalibration directly into the model. However, as its workflow is already optimized without the need for external base recalibration, such updates may introduce human genome-specific biases. To control for these variables and avoid potential overfitting, we deliberately used DeepVariant v1.3.0 to ensure consistent and generalizable evaluation in our non-human primate dataset.

Feature extraction for refinement model

To improve variant refinement, we extract additional features beyond the three SA features used in DeepVariant (base letter, mapping quality, and strand information), as summarized in Table 1. These features are categorized into three groups: (1) DeepVariant-used features, consisting of pre-existing features incorporated in DeepVariant’s variant calling process. (2) Confidence features, which capture variant call confidence and genotype likelihoods. (3) Additional alignment features, which provide alignment-based metrics to improve the refinement model’s accuracy.

Table 1.

Feature groups and descriptions for the genome variant refinement model

Feature group	Feature name	explanation
DeepVariant used features	Read base (not used)	Base letters of aligned reads
	Mapping quality	Average Alignment quality of aligned reads
	Strand mapping	Strand orientation of mapped reads
Confidence features	Variant likelihood	Phred-scaled quality score representing the confidence level of DeepVariant’s variant call
	Genotype quality	Probability of the correct genotype assignment
	Phenotype Likelihood	Likelihood estimates for genotype possibilities
Alignment features	Read depth	Number of aligned reads
	Allele depth	Number of reads supporting the reference and variant alleles
	Variant allele fraction	Proportion of reads supporting the variant allele
	Matching base ratio	Proportion of aligned bases that match the reference genome
	Soft clipping read ratio	Ratio of soft-clipped reads, which may indicate misalignment or structural variants
	Low mapping quality read ratio	Proportion of reads with low mapping quality
	Total aligned read count	Total number of reads aligned to variant site

Open in a new tab

The confidence feature group consists of variant call quality scores, extracted from DeepVariant’s variant calling output. These include variant likelihood, genotype quality, and phenotype likelihood, which collectively help assess the confidence and reliability of a called variant. These features serve as confidence metrics for variant calls, allowing the model to adjust the confidence learned from optimal SA and appropriately modify it for suboptimal SA conditions. This process is analogous to confidence calibration, a technique used in probabilistic modeling to refine prediction confidence and improve reliability in uncertain environments [39, 40]. By refining these confidence estimates, the model can effectively adapt to suboptimal SA, ensuring that variant calls remain reliable even when alignment quality is lower.

The additional alignment feature group consists of two subcategories: read-level features and Compact Idiosyncratic Gapped Alignment Report (CIGAR) string-based features. These features were not explicitly considered in DeepVariant’s original model but provide deeper insight into alignment quality and potential misalignment errors. Read-level features include read depth, allele depth, and variant allele fraction, which capture fundamental sequencing depth statistics at variant sites. CIGAR string-based features include matching base ratio, soft clipping read ratio, low mapping quality read ratio, and total aligned read count. CIGAR strings encode the alignment pattern of sequencing reads against a reference genome [41]. A CIGAR string consists of operations that describe how bases in a read align, including matches, insertions, deletions, soft clipping, and hard clipping. These operations help identify alignment artifacts and structural variations [42]. By leveraging CIGAR-derived features, the refinement model can detect alignment inconsistencies such as soft clipping and low mapping quality, which may indicate sequencing artifacts or structural variations. In total, the additional alignment information feature group consists of seven features, complementing the DeepVariant result features to enhance the refinement model’s ability to filter false positive variant calls.

Since base letter information is already represented by allele depth (AD) and matching base ratio, we exclude base letter information from the final feature set to avoid redundancy. As a result, the refinement model utilizes a total of 12 features. The DeepVariant result feature group is extracted from variant call format (VCF) file generated by DeepVariant, while the additional alignment information features are obtained by both VCF file of DeepVariant and parsing SA data using the Python Pysam library, which extracts alignment-related features at each variant position.

Genome variant refinement model

We adopt LGBM, a gradient boosting decision tree model, to refine genome variant calls under suboptimal SA conditions. LGBM excels in handling structured genomic features effectively, making it a strong choice for tabular data processing. For model training, we use high-confidence labeled genome sequence data from human individuals. To ensure model robustness, we evaluate the model on separate individual and non-human primate individuals, verifying its generalization capability. The refinement model uses a total of 12 features, derived from SA and DeepVariant results (Fig. 2). To optimize the model, we perform grid search hyperparameter tuning using python’s scikit-learn GridSearchCV, adjusting the number of estimators, maximum tree depth, and learning rate. We apply early stopping and L2 regularization to prevent overfitting. The dataset follows an 80:20 train-validation split, ensuring robust hyperparameter selection.

Fig. 2 — The refinement model training workflow

Human genomic data and reference variant resources for optimal SA

We conduct experiments using Illumina Platinum Genome and Genome In A Bottle (GIAB) datasets, which have been reported to achieve high variant calling accuracy with DeepVariant. Specifically, we use the HG001 (NA12878) genome, published by the Illumina Platinum Genome project, and the HG002 (NA24385) genome, released as part of the Ashkenazi Trio dataset by GIAB. HG001 data has a maximum coverage of 35x, which is comparable to standard genomic analysis coverage. In contrast, GIAB's HG002 dataset was sequenced at a much higher depth (up to 300 × coverage), leading to a significant difference in sequencing depth between the two datasets. To balance the differences in coverage, we downsample the HG002 dataset to approximately 12% of its original reads, ensuring that the final coverage closely matches the maximum 35 × coverage of HG001. This proportion reflects the coverage ratio between the two datasets. For Optimal SA, we follow DeepVariant's recommended settings and use Indel realignment data from the 1000 Genomes Project [43] and dbSNP138 from the Single Nucleotide Polymorphism Database (dbSNP) [44] for base quality recalibration. To validate the performance of suboptimal SA and optimal SA variant calling, we use high-confidence variant call datasets provided by GIAB, latest v4.2.1 truth set, for both HG001 and HG002.

Non-human primate genomic data: Rhesus macaque

To validate the robustness of the refinement model, we use genome sequence data from rhesus macaques. The rhesus macaque (Macaca mulatta) has been extensively studied as a model organism for understanding mutations relevant to human diseases [45–47]. Early efforts to characterize genomic variation in rhesus macaques [48], which analyzed whole-genome sequencing data from 133 individuals aligned to the rheMac2 reference genome. Recently, the latest rhesus macaque reference genome, MMul_10, was published [49], accompanied by population-scale variant calling on a cohort of 853 macaques aligned to the new reference. To ensure a representative sample for evaluating model robustness, we selected 18 previously published rhesus macaque whole genome datasets, accounting for factors such as colony source, ancestry, sex, and coverage.

The number of individuals sampled for each characteristic is presented in Table 2, while detailed information on colony source, ancestry, sex, coverage, and biosample names of each individual is provided in Supplementary Table S1. For these 18 rhesus macaques, we generate suboptimal SA under the same conditions used for human genomic data and perform variant calling using DeepVariant. Additionally, although DeepVariant is capable of calling both SNVs and Indels, the ground truth variant set for Mmul-10 used in this study [49] contains only SNVs. Therefore, our analysis focuses solely on SNVs for rhesus macaque evaluation.

Table 2.

Categorical distribution of rhesus macaque samples for model validation

Category	Element (# of each value)
Colony source	Oregon National Primate Research Center (10), California National Primate Research Center (5), Wild Caught (3)
Ancestry	Indian (10), Chinese (8)
Sex	Male (10), Female (8)
Coverage	8.6 ~ 60.7 (: 34.83, : 11.90)

Open in a new tab

Results

Variant call evaluation between optimal/suboptimal SA

Before training and evaluating the refinement model, we first compare the performance of variant calling under optimal and suboptimal SA conditions to assess the impact of SA postprocessing. Using DeepVariant's variant call results, we compare optimal SA vs. suboptimal SA against ground truth variant calls to quantify performance differences. Since variant callers report only positions with detected variants, the resulting dataset consists solely of the positive set from the caller’s perspective. Within this set, we distinguish between accurately and inaccurately predicted variants, and train a refinement model to filter out the miscalled variants. To quantify this, we define the Miscalling Rate (MR) as the proportion of miscalls within the total number of calls, as shown in Eq. 1.

We used DeepVariant version 1.3.0 to perform variant calling on HG001 and HG002 under both optimal and suboptimal sequence alignment (SA) conditions, covering the entire autosomal genome (chromosomes 1–22). For variant call evaluation, we employed GATK’s GenotypeConcordance tool, which compares called genotypes against the GIAB v4.2.1 ground truth VCF to compute concordance metrics at known variant positions. This approach enables direct assessment of genotype-level concordance, including zygosity agreement and mismatch rates, across different alignment conditions.

To define training labels for the refinement model, we classified all variants called by DeepVariant based on their concordance with the ground truth. Specifically, variants were labeled as accurate calls (TP) if the site, reference allele, alternate allele, and zygosity exactly matched the GIAB ground truth. Variants were labeled as miscalls (FP) if the site was absent from the ground truth, or if any of the genotype components differed from the truth set at a shared position.

The results for HG001 and HG002 datasets are summarized in Tables 3. While the overall F1 scores from the variant caller showed minimal differences between optimal and suboptimal SA across both datasets, we observed a higher number of miscalled variants under suboptimal SA, for both SNVs and Indels. In HG002, suboptimal SA increases miscall, leading to an MR increase to 16.55%. This effect is even more pronounced in HG001, where MR rises to 19.58% suggesting that alignment postprocessing plays a crucial role in minimizing false positives. These results highlight the limitations of existing variant callers under suboptimal SA conditions, reinforcing the need for a refinement model to improve variant accuracy.

Table 3.

DeepVariant’s miscall and accurate call result on HG001 and HG002 optimal/suboptimal SA

		Optimal SA			Suboptimal SA
		Miscall (#)	Accurate Call (#)	MR	Miscall (#)	Accurate Call (#)	MR
HG001	Indels	301,842	504,522	37.43%	376,077	515,540	42.18%
	SNVs	344,490	3,247,597	9.59%	543,627	3,262,744	14.28%
	Total Variants	646,332	3,752,119	14.69%	919,704	3,778,284	19.58%
HG002	Indels	314,942	561,731	35.92%	329,933	563,457	36.93%
	SNVs	340,881	3,349,442	9.24%	447,245	3,353,974	11.77%
	Total Variants	655,823	3,911,173	14.36%	777,178	3,917,431	16.55%

Open in a new tab

Refinement model evaluation in human suboptimal alignment

We conduct two experimental procedures to train, validate, and evaluate the refinement model. First, we evaluate the model using human genomic data to assess its potential applicability to non-human primate genomes. Specifically, we design an experiment where HG001 (NA12878) and HG002 (NA24385), originating from different ancestries, serve as the training and test sets, respectively. This approach allows us to assess the model’s applicability across diverse genetic backgrounds before adapting it for non-human primate genomes with fewer curated references. Second, we train a refinement model with a more generalized human dataset by mixing the SA of HG001 and HG002 and then splitting the dataset into two equal parts (50:50). This mixed dataset enables the model to generalize better for potential future applications in non-human primate variant refinement.

To determine the optimal refinement model, we compare multiple machine learning and deep learning approaches. The machine learning models include LGBM, XGBoost (XGB) [50], Random Forest (RF), Logistic Regression (LR) [51], k-Nearest Neighbor (k-NN) [52], and Naive Bayes (NB) [53]. Additionally, we evaluate deep learning models such as Multi-Layer Perceptron (MLP) [54] and FT-Transformer (FTT) [55]. For model evaluation, we employ widely used classification metrics, including precision, recall, F1-score, and area under curve of receiver operating characteristic curve (AUC-ROC), to measure predictive performance.

As shown in Tables 4 and 5, the LGBM-based refinement model trained on suboptimal SA from HG001 and HG002 effectively filters miscalled variants. In Table 4, it achieves f1 scores of 0.934 (HG001) and 0.949 (HG002) under suboptimal SA. To assess robustness, we applied the model to optimal SA and observed similarly high f1 scores of 0.948 (HG001) and 0.946 (HG002), indicating reliable performance even with high-quality alignments. Also, in Table 5, applying the refinement model led to a reduction in the miscall rate (MR) under suboptimal SA conditions, decreasing MR to 10.29% for HG001 and 6.92% for HG002. These results demonstrate that the refinement model improves variant calling performance even beyond the accuracy achieved under optimal SA conditions, as reported in Table 3. Further evaluation results, including ROC curves, are provided in Supplementary Tables S2 and S3 and Figure S1.

Table 4.

LGBM based refinement model performance on HG001 and HG002 dataset

Train data (Suboptimal SA)	Test data	F1 score	Precision	Recall	Accuracy	AUC-ROC
HG001	Optimal HG002	0.946	0.948	0.944	0.908	0.909
HG001	Suboptimal HG002	0.949	0.931	0.968	0.913	0.905
HG002	Optimal HG001	0.948	0.940	0.957	0.911	0.889
HG002	Suboptimal HG001	0.934	0.897	0.975	0.890	0.861

Open in a new tab

Table 5.

Variant call results before and after applying the refinement model

		Suboptimal SA Variant Call Result of DeepVariant			After Apply Refinement Model in Suboptimal SA
		Miscall (#)	Accurate Call (#)	MR	Miscall (#)	Accurate Call (#)	MR
HG001	Indels	376,077	515,540	42.18%	131,368	475,254	21.66%
	SNVs	543,627	3,262,744	14.28%	291,010	3,208,803	8.32%
	Total	919,704	3,778,284	19.58%	422,378	3,684,057	10.29%
HG002	Indels	329,933	563,457	36.93%	82,357	495,442	14.25%
	SNVs	447,245	3,353,974	11.77%	199,369	3,296,591	5.70%
	Total	777,178	3,917,431	16.55%	281,726	3,792,033	6.92%

Open in a new tab

Table 6 summarizes the performance of different models on the mixed suboptimal SA dataset. Among all evaluated models, LGBM and XGB achieve the highest F1-score of 0.946, outperforming both traditional machine learning models and deep learning architectures. Similarly, as illustrated in Fig. 3, the AUC-ROC curves indicate that LGBM and XGBoost consistently achieve the best performance, with LGBM marginally outperforming XGBoost. Based on these results, we select LGBM as the final model for non-human primate variant refinement.

Table 6.

Performance of refinement models on HG001 and HG002 mixed dataset

	F1 score	Precision	Recall	Accuracy	AUC-ROC
LGBM	0.946	0.920	0.973	0.909	0.894
XGB	0.946	0.920	0.973	0.909	0.894
RF	0.945	0.918	0.973	0.907	0.890
LR	0.940	0.909	0.972	0.898	0.869
KNN	0.945	0.918	0.973	0.907	0.883
NB	0.937	0.912	0.963	0.894	0.869
MLP	0.903	0.823	1.000	0.824	0.877
FTT	0.945	0.917	0.974	0.907	0.888

Open in a new tab

Fig. 3 — ROC curve of refinement models on HG001 and HG002 mixed data

Ablation experiments on geature groups

To assess the effectiveness of different feature groups in the refinement model, we conduct an ablation study, evaluating the impact of each feature set on model performance. The study is performed across the same three dataset configurations described previously, where HG001 and HG002 are used separately as training and test sets, as well as a mixed 50:50 train-test split. For each dataset setting, we train LGBM models using different feature groups to measure their individual contributions. Table 7 presents the ablation study results, highlighting the relative importance of confidence features (CF) and additional alignment features (AF).

Table 7.

Ablation study results of refinement model on HG001 and HG002

Train data	Test data	Training Feature	Suboptimal SA		Optimal SA
Train data	Test data	Training Feature	F1 score	AUC-ROC	F1 score	AUC-ROC
HG001	HG002	all	0.949	0.905	0.946	0.909
		w/o AF	0.946	0.882	0.955	0.914
		w/o CF	0.944	0.894	0.939	0.890
		w/o AF + CF	0.921	0.819	0.932	0.822
HG002	HG001	all	0.934	0.861	0.948	0.889
		w/o AF	0.934	0.85	0.944	0.880
		w/o CF	0.929	0.852	0.949	0.894
		w/o AF + CF	0.909	0.807	0.930	0.833
Mixed 50%	Mixed 50%	all	0.946	0.894	*
		w/o AF	0.941	0.876
		w/o CF	0.94	0.883
		w/o AF + CF	0.915	0.812

Open in a new tab

*By considering data leakage, this experiment was not performed

Across all datasets, CF and AF contribute similarly to model performance, indicating that both feature groups provide critical information for refinement. When using only one of the feature groups (either CF or AF), F1-scores remain comparable to the full-feature model. However, AUC-ROC scores show a consistent improvement when both feature groups are included, suggesting that the two feature sets complement each other in improving classification robustness. These results confirm that both CF and AF significantly contribute to the refinement model's performance, reinforcing the necessity of incorporating alignment-based and DeepVariant-derived features for optimal variant refinement.

Notably, under optimal SA conditions for both HG001 and HG002, the addition of AF and CF generally improved model performance. However, we observed that adding CF in HG001 and AF in HG002 led to a slight decrease in performance compared to using the other feature set alone. While the model was trained under suboptimal SA conditions, making direct interpretation under optimal conditions more challenging, this result indicates that applying both confidence calibration and additional alignment information to optimal SA may still contribute to effective refinement even when the underlying variant calls are of high quality. These findings suggest the potential of our approach to enhance DeepVariant calls beyond its default performance, even under ideal alignment scenarios.

Application of the refinement model to the non-human primate: Rhesus macaque

To validate the performance of the refinement model on non-human primates, we conducted experiments on 18 rhesus macaques. Since the ground truth for Mmul-10 includes only SNVs, we restrict our analysis to SNVs and further classify them into heterozygous SNVs (HT-SNVs) and homozygous SNVs (HM-SNVs) for a more detailed evaluation. To perform variant calling on rhesus macaques, we adopted the same pipeline used for generating suboptimal sequence alignments and variant calls in the human genome. However, this differs from the processing pipeline used to produce the ground truth dataset, which was generated using a GATK/HaplotypeCaller-based approach [48]. Although such methodological differences may introduce labeling discrepancies—particularly in regions sensitive to variant caller behavior—we used this dataset as a proxy for ground truth in order to systematically assess DeepVariant’s performance in non-human primates.

As shown in Table 8, the miscalling rate (MR) of DeepVariant on rhesus macaque samples reaches 20.77% for total SNVs, with similar MR levels observed in both HT-SNVs and HM-SNVs. This MR is higher than what was observed in human suboptimal SA, indicating that DeepVariant struggles with accurate variant calling in rhesus macaques. Encouragingly, the refinement model reduced the MR for total SNVs to 15.83%, highlighting its ability to improve variant calling accuracy. While this does not reach the variant calling accuracy of optimal SA in humans, it demonstrates a substantial improvement over the original DeepVariant calls.

Table 8.

Variant call results before and after applying the refinement model on rhesus macaque

	DeepVariant’s Variant Call Result for Rhesus Macaque			After Apply Refinement Model in Rhesus Macaque
	Miscall (#)	Accurate Call (#)	MR	Miscall (#)	Accurate Call (#)	MR
HT-SNVs	24,180,823	95,463,124	20.21%	14,876,718	78,445,215	15.94%
HM-SNVs	15,726,287	56,736,570	21.70%	7,143,601	38,645,533	15.60%
Total SNVs	39,907,110	152,199,694	20.77%	22,020,319	117,090,748	15.83%

Open in a new tab

Additionally, we observed a notable pattern related to sequencing coverage. As shown in Fig. 4(a), when plotting F1 scores of individual rhesus macaques against sequencing coverage, the refinement model exhibits poor performance at coverage levels below 25x. This decline is likely due to the model being trained on human genomic data with 35 × coverage, making it less effective for lower-coverage samples. To further validate this observation, we conducted a down sampling experiment using human suboptimal SA data, applying the refinement model to coverage levels of 5x, 10x, 15x, and 20x. Consistent with the trend observed in the rhesus macaque data, the model's performance improved with increasing coverage, indicating that the effectiveness of alignment-derived features is highly dependent on sequencing depth (Supplementary Figure S6). However, for samples with coverage above 25x, the model maintains stable performance, even at 60 × coverage. To further analyze this effect, we compared ROC curves for all individuals and those with coverage above 25x. As illustrated in Figs. 4(b) and 4(c), removing low-coverage individuals (< 25x) results in a notable improvement in AUC-ROC, demonstrating that the refinement model performs optimally when applied to higher-coverage samples.

Based on these findings, we conducted additional robustness evaluations on only the 25x + coverage rhesus macaque samples, focusing on the impact of colony source, ancestry, and sex. Table 9 presents the updated sample distribution, while Fig. 5 illustrates the model’s F1-score distributions across these categories. The refinement model demonstrates consistent performance across all tested subgroups, confirming its robustness across diverse rhesus macaque populations.

Table 9.

Categorical distribution of high coverage rhesus macaque samples for model validation

Category	Element (# of each value)
Colony source	Oregon National Primate Research Center (10), California National Primate Research Center (3), Wild Caught (2)
Ancestry	Indian (10), Chinese (5)
Sex	Male (9), Female (6)
Coverage	29.6 ~ 60.7 (

Open in a new tab

Fig. 5 — Comparative analysis of Refinement Model performance across diversity criteria: Colony Source (a), Ancestry (b), and Sex (c), at over 25 × rhesus macaque individuals

Alternative base ratio comparison between ground truth and refined SNVs

For the validation of GVRP in quality aspect, we check the Alternative Base Ratio (ABR) to the reference base of the rhesus macaque SNVs and draw the distribution of SNVs. The ABR is the proportion of sequencing reads that show an alternative base at each SNV position compared to the total number of reads covering that position. It is calculated as follows:

where the sequencing depth is the total number of reads that align to a particular genomic location. For ABR calculation, we did not apply any filtering based on read mapping quality or base quality scores. This decision aligns with our model design, where the refinement model is trained using alignment-derived features extracted from all mapped reads.

The ABR is crucial for zygosity classification, as HM-SNVs should exhibit an ABR close to 100%, whereas HT-SNVs are expected to have an ABR near 50%. To analyze whether the refinement model maintains the expected ABR distribution, we randomly sample 100 SNVs from both HT-SNVs and HM-SNVs for both the refined SNVs and the ground truth SNVs. We performed read-level analysis at the selected variant sites using the Pysam library in Python. For each site, we examined all aligned reads in the corresponding BAM file by generating a pileup at the variant position. We then visualize their distributions using Kernel Density Estimation (KDE) [56], a non-parametric method that estimates the probability density function of a dataset. KDE smooths the distribution by applying Gaussian kernels to each data point, aggregating them to form a continuous density estimate. This visualization was performed using the Seaborn library in Python, which supports Gaussian kernels for KDE.

Figure 6 (a) compares the ABR distribution of ground truth and refined HT-SNVs. The refined HT-SNVs exhibit a distribution closely matching that of the ground truth, indicating that the refinement model maintains the original variant distribution. A two-sample t-test was performed to assess statistical similarity, yielding a p-value of 0.262, confirming that there is no significant difference at the 95% confidence level. This suggests that the refinement model does not introduce bias into HT-SNV classification and preserves the original variant characteristics. In contrast, Fig. 6 (b) reveals an interesting discrepancy in the ABR distribution of ground truth HM-SNVs versus refined HM-SNVs. While the refined HM-SNVs cluster around 100% ABR, the ground truth HM-SNVs exhibit a bimodal distribution (peaks at 0 and 100%). This distribution pattern may arise from differences in data processing pipelines: whereas DeepVariant-based refinement was trained on all aligned reads regardless of quality, the ground truth data were generated via a GATK-based pipeline using hard filters. In our ABR computation, we also included all reads without applying read or alignment quality thresholds, potentially amplifying low-confidence signals. These discrepancies could result in apparent homozygous calls (ABR ≈ 0%) at sites that are in fact non-variant, suggesting the presence of artifacts in the ground truth.

To further investigate this phenomenon, we compare DeepVariant raw SNV calls with newly detected SNVs by the refinement model in Fig. 6 (c) and (d). The ABR distributions for HT-SNVs (Fig. 6c) and HM-SNVs (Fig. 6d) reveal key insights into the impact of refinement on SNV classification. In HT-SNVs, although the overall shape of the ABR distribution remains consistent between DeepVariant’s raw variant calls and the DeepVariant-refinement model's newly detected variants, the latter exhibits a slightly more concentrated distribution. This suggests a minor enhancement in ABR density while maintaining the integrity of HT- SNVs classifications, indicating that the refinement model refines variant confidence without significantly altering the original variant call landscape. To further validate the reliability of newly detected HT-SNVs, we compare their ABR distribution against the ground truth HT-SNVs. A t-test (p-value = 0.398, at a 95% confidence level) confirms that the two distributions are not significantly different, supporting the hypothesis that the refinement model’s newly detected HT-SNVs exhibit a similar ABR pattern to true variants.

For HM-SNVs, Fig. 6 (d) shows that DeepVariant’s raw HM-SNVs already cluster around 100% ABR, suggesting that DeepVariant assigns high confidence to HM-SNVs classifications. However, the density of refined HM-SNVs is noticeably higher, indicating that the refinement model reinforces HM-SNVs classifications while effectively filtering uncertain calls. These findings suggest that while DeepVariant already performs well in calling HT-SNVs, the refinement model enhances the robustness of HM-SNVs classification by reducing misclassified non-variant sites in the ground truth data. This highlights the potential of the refinement model to improve variant calling accuracy, particularly for HM-SNVs in non-human primate genomes. The sampled SNVs for our experiments, along with the ABR statistics of the sampled SNVs and the ABR statistics for each individual, are provided in Supplementary Tables S4, S5 and Figure S2 to S3.

Discussion

The decision tree-based refinement model outperformed simple deep neural network models and demonstrated performance comparable to the state-of-the-art FTT, which leverages attention mechanisms [57, 58] to capture contextual and positional relationships within sequences. Despite the similar performance, FTT requires extensive computational resources due to the need to calculate the attention matrix, which reduces its practicality for large-scale variant refinement tasks. This observation aligns with previous research [59, 60] suggesting that tree-based used in our refinement model. Recent studies have explored improving deep learning performance by refining input representations through contrastive learning [61] or optimizing attention mechanisms with additional contextual features [62]. Future work could integrate these techniques to explore whether deep learning approaches can further enhance variant refinement beyond traditional tabular data processing.

Despite its effectiveness in reducing miscalls, our refinement model also resulted in a slight reduction of accurate calls. In human genome data, miscalls filtering reached 63.75%, while accurate call loss was only 3.20%, indicating an efficient filtering process with minimal trade-offs. However, in rhesus macaque data, the refinement model showed a lower miscalls filtering rate (44.82%) but a higher accurate call loss (23.07%), suggesting a greater trade-off between improving specificity and reducing sensitivity. This trade-off may be partially due to the instability in ground truth labels for non-human primates, as identified in our ABR analysis. These results highlight the importance of refining SNV classification while minimizing the loss of true variants, particularly in species with less well-annotated reference genomes.

The improvement in variant calling through our refinement model was more pronounced for HM-SNVs compared to HT-SNVs. As shown in Table 8, the MR reduction for HM-SNVs was approximately 7% greater than for HT-SNVs. This pattern is further supported by ABR analysis, which revealed potential misclassification in the ground truth HM-SNVs. Specifically, the bimodal distribution of ground truth HM-SNVs (Fig. 6b) suggests that some variants may have been misclassified as homozygous when they were, in fact, non-variants. The refinement model effectively filtered out these uncertain calls, supporting its role in improving variant classification accuracy. This suggests that applying a refinement model in non-human primate variant calling could significantly enhance the quality of SNV annotations, particularly for homozygous variants.

These observations raise the possibility that certain discrepancies in HM-SNV classification may stem from limitations in the ground truth dataset. In addition, they suggest that DeepVariant may, in some cases, generate more biologically plausible genotype calls than the reference set. To explore this further, we compared the alternate base ratio (ABR) distributions between the non-refined DeepVariant calls and the ground truth, as shown in Fig. 7. For HT-SNVs (Fig. 7a), the two distributions were statistically similar, with a t-test yielding a p-value of 0.398, indicating no significant difference at the 95% confidence level. However, for HM-SNVs (Fig 7b), the distributions diverged considerably, suggesting that the DeepVariant calls may be more consistent with expected allele balance than the ground truth at certain sites.

Fig. 7 — ABR distributions between non-refined DeepVariant calls and ground truth SNVs in rhesus macaque at HT-SNVs (a) and HM-SNVs (b)

Despite our efforts to evaluate the performance of the refinement model using a publicly available ground truth dataset for the rhesus macaque genome, several important limitations should be noted. First, the ground truth variant set for the rhesus macaque used in this study was downloaded from the UCSC Genome Browser, with a last update dated April 29, 2020. This dataset contains variant calls from a cohort of 853 individuals but predates the release of mGAP v3.0, the most recent and extensively curated reference set. As a result, it may lack certain quality enhancements present in newer versions. Notably, all mGAP versions prior to v3.0, including the dataset used in this study, applied hard filters and masked variants in repetitive regions, which could lead to reduced sensitivity in complex genomic contexts. Second, ground truth variant set used in this paper does not explicitly document the exact variant calling pipeline used. In contrast, mGAP employs a standardized GATK/HaplotypeCaller-based pipeline, which may differ from our own approach based on DeepVariant. Lastly, it is important to recognize that the rhesus macaque ground truth data, regardless of source, does not reach the same level of validation and accuracy as the GIAB human datasets. The latter are derived from high-depth sequencing of idealized samples and undergo extensive manual and computational curation to ensure consensus-quality truth sets. Taken together, these limitations highlight the challenges in benchmarking variant calling methods in non-human primates and underscore the need for continued efforts to improve truth set quality for comparative genomic studies.

While our refinement model significantly enhances SNV calling accuracy, several limitations remain. These include its dependence on a specific variant caller, species-specific characteristics, and sensitivity to coverage depth. First, the model is not a standalone variant caller but instead relies on DeepVariant’s confidence scores for calibration. This dependency restricts its applicability to scenarios where DeepVariant or a similar variant caller is available. Future efforts should explore incorporating independent confidence calibration mechanisms to increase model flexibility. Additionally, our model was validated using rhesus macaque genome sequence data, which is among the most well-studied non-human primates. However, further validation is necessary across other species to assess its broader applicability. Different primates or non-model organisms may exhibit unique alignment challenges and variant-calling biases, necessitating further refinement. Another consideration is that the model is optimized for DeepVariant under suboptimal SA conditions. Investigating its compatibility with other variant callers, such as GATK HaplotypeCaller and FreeBayes, as well as alternative alignment strategies, could further improve its robustness. Furthermore, since our model was trained using alignment-derived features extracted from genome sequence data at 35 × coverage, its performance tends to degrade under lower coverage conditions—particularly below 25 × —due to reduced read depth and lower-quality alignment information.

Future research will focus on enhancing the model’s ability to learn alignment and variant caller confidence information more effectively, ensuring its adaptability across diverse genomic datasets. Additionally, we aim to expand its application beyond non-human primates to other species, evaluating its effectiveness in refining variant calls across a broader range of genomes. As part of this effort, we also consider the use of updated and standardized variant sets such as mGAP v3.0 in future evaluations, which may further improve benchmarking consistency and provide new opportunities for assessing or adapting the model to evolving reference data. Ultimately, this research seeks to establish a versatile and scalable refinement framework for genomic variant analysis.

Conclusion

In this study, we developed a decision tree-based refinement model for improving variant analysis in non-human primate genomes. Since certain postprocessing steps in SA require known variant sites, they are often unavailable for non-human primates. To address this limitation, we defined suboptimal SA and optimal SA based on the level of postprocessing applied and evaluated their impact on variant calling using DeepVariant, a leading variant caller. Our results demonstrated that in humans, suboptimal SA led to significantly higher FP compared to optimal SA, underscoring the challenges in non-human primate variant calling under similar conditions. To mitigate these issues, we incorporated DeepVariant’s confidence scores along with additional alignment features that were not originally considered by DeepVariant. The refinement model was trained to filter FP effectively, leading to a substantial reduction in the MR. When applied to rhesus macaque genomic data, our model demonstrated significant improvements in variant call accuracy, validating its applicability beyond human datasets. Furthermore, ABR analysis provided deeper insights into the effectiveness of our refinement approach. The ABR distribution of refined HT-SNVs closely matched the ground truth, confirming the model’s robustness in preserving true variants. Additionally, for HM-SNVs, the refinement model significantly improved DeepVariant’s results, producing a more stable ABR distribution concentrated around 100%. This suggests that our model not only enhances variant detection but also corrects potential inconsistencies present in existing ground truth datasets, particularly for HM-SNVs in rhesus macaques. To facilitate the use of our approach, we integrated the trained refinement model into the GVRP, providing a streamlined command-line interface for ease of use. This pipeline enables researchers to apply the refinement model efficiently across non-human primate datasets, addressing the limitations of existing variant calling methods. By improving variant calling accuracy in non-human primates, our approach has the potential to significantly contribute to population genetics and molecular evolution research. The framework established in this study can be extended to other non-human primates and even broader genomic studies, further advancing variant analysis in comparative genomics and evolutionary biology.

Supplementary Information

Supplementary Material 1.^{(1.6MB, docx)}

Acknowledgements

Authors thank Dr. Alexander Eckehart Urban in Department of Psychiatry and Behavioral Sciences at Stanford University for helpful discussions.

Abbreviations

SNV: Single Nucleotide Variant
Indel: Small Insertion and Deletion
FP: False Positive
TP: True Positive
CNN: Convolutional Neural Network
LGBM: Light Gradient Boosting Model
GVRP: Genome Variant Refinement Model
SA: Sequence Alignment
AD: Allele Depth
VCF: Variant Call Format
CF: Confidence Feature
AF: Alignment Feature
HT-SNVs: Heterozygous single nucleotide variant
HM-SNVs: Homozygous single nucleotide variant
GATK: Genome Analysis Tool Kit
GIAB: Genime In A Bottle
dbSNP: Single Nucelotivde Polymorphism dataase
ONPRC: Oregon National Primate Research Center
CNPRC: California National Primate Research Center
MR: Miscalling rate
XGB: XGBoost
RF: Random Forest
LR: Logistic Regression
k-NN: k-Nearest Neighbor
NB: Naive Bayes
MLP: Multi-Layer Perceptron
FTT: FT-Transformer
ABR: Alternative Base Ratio
KDE: Kernel Density Estimation
NCBI: National Center of Biotechnology Information
CIGAR: Compact Idiosyncratic Gapped Alignment Report
AUC-ROC: Area Under Curve of Receiver Operating Characteristic

Authors’ contributions

B.Z and G.S conceived the project and supervised the study. B.Z. and J.C. processed the raw data. J.C. designed model framework. J.C. wrote code and trained models. J.C. evaluates the model and analysis the results. J.C., B.Z. and G.S. drafted the manuscript. All authors read, commented on and approved the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. 2021R1A2C2010775) and the Institute of Information & Communications Technology Planning & Evaluation (IITP) under the Artificial Intelligence Convergence Innovation Human Resources Development (IITP-2023–00254177). This work was also supported by startup funds from Texas A&M University.

Data availability

All data used in this manuscript is publicly available. Genome sequence of HG001 and HG002 are from NCBI and ground truth of HG001 and HG002 are from GIAB. Known SNPs sites of human are from genome public data of Broad Institute, and the known Indels site of human are form 1000 Genomes Project. Mmul-10, the rhesus macaque reference genome and all the rhesus macaque individuals are from NCBI. Finally, ground truth of rhesus macaque is from UCSC Genome Browser The source code of refinement model training and analysis is available on the GitHub website. https://github.com/Jeong-Hoon-Choi/Learning-a-refinement-model-for-variant-analysis-in-non-human-primate-genomes The VCF files we generated are also uploaded in Google Drive. https://drive.google.com/drive/folders/1NnNwLoyMKejJRQfjwn55dqqJxGXAaAcG?usp = drive_link.

The source code of refinement model training and analysis is available on the GitHub website.

The VCF files we generated are also uploaded in Google Drive.

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Bo Zhou, Email: bo.zhou@tamu.edu.

Giltae Song, Email: gsong@pusan.ac.kr.

References

1.Veltman JA, Brunner HG. De novo mutations in human genetic disease. Nat Rev Genet. 2012;13(8):565–75. [DOI] [PubMed] [Google Scholar]
2.Kondrashov FA. Gene duplication as a mechanism of genomic adaptation to a changing environment. Proc R Soc B. 2012;279(1749):5048–57. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Alexandrov LB, et al. Signatures of mutational processes in human cancer. Nature. 2013;500(7463):415–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Kosugi S, et al. Coval: improving alignment quality and variant calling accuracy for next-generation sequencing data. PLoS ONE. 2013;8(10):e75402. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Zou H, et al. Significance of single-nucleotide variants in long intergenic non-protein coding RNAs. Front Cell Dev Biol. 2020;8:347. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Qian W, et al. Identification of novel single nucleotide variants in the drug resistance mechanism of Mycobacterium tuberculosis isolates by whole-genome analysis. BMC Genomics. 2024;25(1):478. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Rockah-Shmuel L, et al. Correlated occurrence and bypass of frame-shifting insertion-deletions (InDels) to give functional proteins. PLoS Genet. 2013;9(10):e1003882. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Lalonde S, et al. Frameshift indels introduced by genome editing can lead to in-frame exon skipping. PLoS ONE. 2017;12(6):e0178700. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Li H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics. 2014;30(20):2843–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Rimmer A. Calling Variants from Sequence Data, in Assessing Rare Variation in Complex Traits: Design and Analysis of Genetic Studies, E. Zeggini and A. Morris, Editors. 2015, Springer New York: New York, NY. p. 15–31.
11.He H, Ma Y. Imbalanced learning: foundations, algorithms, and applications. Hoboken: Wiley; 2013. 10.1002/9781118646106.
12.Duda RO, Hart PE, Stork DG. Pattern classification. 2nd ed. Hoboken: Wiley; 2000.
13.Spinella J-F, et al. SNooPer: a machine learning-based method for somatic variant identification from low-pass next-generation sequencing. BMC Genomics. 2016;17:1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Díaz-Navarro A, et al. RFcaller: a machine learning approach combined with read-level features to detect somatic mutations. NAR Genom Bioinform. 2023;5(2):lqad056. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Zhang C, Ochoa I. VEF: a variant filtering tool based on ensemble methods. Bioinformatics. 2020;36(8):2328–36. [DOI] [PubMed] [Google Scholar]
16.Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001:29(5);1189–232. 10.1214/aos/1013203451.
17.Freund Y, Schapire RE. A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci. 1997;55(1):119–39. [Google Scholar]
18.Breiman L. Random forests. Mach Learn. 2001;45:5–32. [Google Scholar]
19.Kolesnikov A, et al. DeepTrio: variant calling in families using deep learning. bioRxiv. 2021. 10.1101/2021.04.05.438434.
20.Khazeeva G, et al. DeNovoCNN: a deep learning approach to de novo variant calling in next generation sequencing data. Nucleic Acids Res. 2022;50(17):e97–e97. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Poplin R, et al. A universal SNP and small-indel variant caller using deep neural networks. Nat Biotechnol. 2018;36(10):983–7. [DOI] [PubMed] [Google Scholar]
22.Supernat A, et al. Comparison of three variant callers for human whole genome sequencing. Sci Rep. 2018;8(1):17851. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Lin Y-L, et al. Comparison of GATK and DeepVariant by trio sequencing. Sci Rep. 2022;12(1):1809. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Betschart RO, et al. Comparison of calling pipelines for whole genome sequencing: an empirical study demonstrating the importance of mapping and alignment. Sci Rep. 2022;12(1):21502. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst. 2012;25:1097–105.
26.McKenna A, et al. The genome analysis toolkit: a mapreduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20(9):1297–303. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011;43(5):491–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Prado-Martinez J, et al. Great ape genetic diversity and population history. Nature. 2013;499(7459):471–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Phillips KA, et al. Why primate models matter. Am J Primatol. 2014;76(9):801–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Varki A, Altheide TK. Comparing the human and chimpanzee genomes: searching for needles in a haystack. Genome Res. 2005;15(12):1746–58. [DOI] [PubMed] [Google Scholar]
31.Scally A, Durbin R. Revising the human mutation rate: implications for understanding human evolution. Nat Rev Genet. 2012;13(10):745–53. [DOI] [PubMed] [Google Scholar]
32.Ke G, et al. LightGBM: A highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst. 2017;30:3146–54.
33.Bentéjac C, Csörgő A, Martínez-Muñoz G. A comparative analysis of gradient boosting algorithms. Artif Intell Rev. 2021;54:1937–67. [Google Scholar]
34.Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. 2013 https://arxiv.org/abs/1303.3997.
35.Danecek P, et al. Twelve years of SAMtools and BCFtools. Gigascience. 2021;10(2):giab008. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Van der Auwera GA, O'Connor BD. Genomics in the cloud: using Docker, GATK, and WDL in Terra. Sebastopol: O'Reilly Media; 2020.
37.Toolkit P, Picard Toolkit. Broad Institute, GitHub Repository. 2019, Broad Institute. Retreived from http://broadinstitute.github.io/picard.
38.Zhao S, et al. Accuracy and efficiency of germline variant calling pipelines for human genome data. Sci Rep. 2020;10(1):20222. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Guo C, et al. On calibration of modern neural networks. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70. PMLR; 2017. pp. 1321–30.
40.Niculescu-Mizil A, Caruana R. Predicting good probabilities with supervised learning. In: Proceedings of the 22nd International Conference on Machine Learning. ACM; 2005. pp. 625–32. 10.1145/1102351.1102430.
41.Li H, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25(16):2078–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011;27(21):2987–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Consortium GP. A global reference for human genetic variation. Nature. 2015;526(7571):68. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Sherry ST, Ward M, Sirotkin K. dbSNP—database for single nucleotide polymorphisms and other classes of minor genetic variation. Genome Res. 1999;9(8):677–9. [PubMed] [Google Scholar]
45.Bailey JA, Eichler EE. Primate segmental duplications: crucibles of evolution, diversity and disease. Nat Rev Genet. 2006;7(7):552–64. [DOI] [PubMed] [Google Scholar]
46.Bimber BN, et al. Whole genome sequencing predicts novel human disease models in rhesus macaques. Genomics. 2017;109(3–4):214–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Gibbs RA, et al. Evolutionary and biomedical insights from the rhesus macaque genome. Science. 2007;316(5822):222–34. [DOI] [PubMed] [Google Scholar]
48.Xue C, et al. The population genomics of rhesus macaques (Macaca mulatta) based on whole-genome sequences. Genome Res. 2016;26(12):1651–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Warren WC, et al. Sequence diversity analyses of an improved rhesus macaque genome enhance its biomedical utility. Science. 2020;370(6523):6617. [DOI] [PMC free article] [PubMed] [Google Scholar]
50.Chen T, Guestrin C. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016. pp. 785–94. 10.1145/2939672.2939785.
51.Menard S. Applied logistic regression analysis, vol. 106. 2nd ed. Thousand Oaks: Sage Publications; 2002.
52.Mucherino A, et al. K-nearest neighbor classification. In: Data Mining in Agriculture. New York: Springer; 2009. pp. 83–106.
53.Zhang H. The optimality of naive Bayes. AAAI Conference on Artificial Intelligence. 2004;1:562–7.
54.Popescu M-C, et al. Multilayer perceptron and neural networks. WSEAS Transactions on Circuits and Systems. 2009;8(7):579–88. [Google Scholar]
55.Gorishniy Y, et al. Revisiting deep learning models for tabular data. Adv Neural Inf Process Syst. 2021;34:18932–43. [Google Scholar]
56.Chen Y-C. A tutorial on kernel density estimation and recent advances. Biostatistics & Epidemiology. 2017;1(1):161–87. [Google Scholar]
57.Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. Adv Neural Inf Process Syst. 2017;30:5998–6008.
58.Bahdanau D. Neural machine translation by jointly learning to align and translate. 2014. https://arxiv.org/abs/1409.0473
59.Shwartz-Ziv R, Armon A. Tabular data: Deep learning is not all you need. Information Fusion. 2022;81:84–90. [Google Scholar]
60.Grinsztajn L, Oyallon E, Varoquaux G. Why do tree-based models still outperform deep learning on typical tabular data? Adv Neural Inf Process Syst. 2022;35:507–20. [Google Scholar]
61.Le-Khac PH, Healy G, Smeaton AF. Contrastive representation learning: A framework and review. Ieee Access. 2020;8:193907–34. [Google Scholar]
62.Jumper J, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596(7873):583–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1.^{(1.6MB, docx)}

Data Availability Statement

The source code of refinement model training and analysis is available on the GitHub website.

The VCF files we generated are also uploaded in Google Drive.

[CR1] 1.Veltman JA, Brunner HG. De novo mutations in human genetic disease. Nat Rev Genet. 2012;13(8):565–75. [DOI] [PubMed] [Google Scholar]

[CR2] 2.Kondrashov FA. Gene duplication as a mechanism of genomic adaptation to a changing environment. Proc R Soc B. 2012;279(1749):5048–57. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Alexandrov LB, et al. Signatures of mutational processes in human cancer. Nature. 2013;500(7463):415–21. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Kosugi S, et al. Coval: improving alignment quality and variant calling accuracy for next-generation sequencing data. PLoS ONE. 2013;8(10):e75402. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.Zou H, et al. Significance of single-nucleotide variants in long intergenic non-protein coding RNAs. Front Cell Dev Biol. 2020;8:347. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Qian W, et al. Identification of novel single nucleotide variants in the drug resistance mechanism of Mycobacterium tuberculosis isolates by whole-genome analysis. BMC Genomics. 2024;25(1):478. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Rockah-Shmuel L, et al. Correlated occurrence and bypass of frame-shifting insertion-deletions (InDels) to give functional proteins. PLoS Genet. 2013;9(10):e1003882. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Lalonde S, et al. Frameshift indels introduced by genome editing can lead to in-frame exon skipping. PLoS ONE. 2017;12(6):e0178700. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Li H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics. 2014;30(20):2843–51. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Rimmer A. Calling Variants from Sequence Data, in Assessing Rare Variation in Complex Traits: Design and Analysis of Genetic Studies, E. Zeggini and A. Morris, Editors. 2015, Springer New York: New York, NY. p. 15–31.

[CR11] 11.He H, Ma Y. Imbalanced learning: foundations, algorithms, and applications. Hoboken: Wiley; 2013. 10.1002/9781118646106.

[CR12] 12.Duda RO, Hart PE, Stork DG. Pattern classification. 2nd ed. Hoboken: Wiley; 2000.

[CR13] 13.Spinella J-F, et al. SNooPer: a machine learning-based method for somatic variant identification from low-pass next-generation sequencing. BMC Genomics. 2016;17:1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Díaz-Navarro A, et al. RFcaller: a machine learning approach combined with read-level features to detect somatic mutations. NAR Genom Bioinform. 2023;5(2):lqad056. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Zhang C, Ochoa I. VEF: a variant filtering tool based on ensemble methods. Bioinformatics. 2020;36(8):2328–36. [DOI] [PubMed] [Google Scholar]

[CR16] 16.Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001:29(5);1189–232. 10.1214/aos/1013203451.

[CR17] 17.Freund Y, Schapire RE. A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci. 1997;55(1):119–39. [Google Scholar]

[CR18] 18.Breiman L. Random forests. Mach Learn. 2001;45:5–32. [Google Scholar]

[CR19] 19.Kolesnikov A, et al. DeepTrio: variant calling in families using deep learning. bioRxiv. 2021. 10.1101/2021.04.05.438434.

[CR20] 20.Khazeeva G, et al. DeNovoCNN: a deep learning approach to de novo variant calling in next generation sequencing data. Nucleic Acids Res. 2022;50(17):e97–e97. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.Poplin R, et al. A universal SNP and small-indel variant caller using deep neural networks. Nat Biotechnol. 2018;36(10):983–7. [DOI] [PubMed] [Google Scholar]

[CR22] 22.Supernat A, et al. Comparison of three variant callers for human whole genome sequencing. Sci Rep. 2018;8(1):17851. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Lin Y-L, et al. Comparison of GATK and DeepVariant by trio sequencing. Sci Rep. 2022;12(1):1809. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] 24.Betschart RO, et al. Comparison of calling pipelines for whole genome sequencing: an empirical study demonstrating the importance of mapping and alignment. Sci Rep. 2022;12(1):21502. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] 25.Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst. 2012;25:1097–105.

[CR26] 26.McKenna A, et al. The genome analysis toolkit: a mapreduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20(9):1297–303. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR27] 27.DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011;43(5):491–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Prado-Martinez J, et al. Great ape genetic diversity and population history. Nature. 2013;499(7459):471–5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.Phillips KA, et al. Why primate models matter. Am J Primatol. 2014;76(9):801–27. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR30] 30.Varki A, Altheide TK. Comparing the human and chimpanzee genomes: searching for needles in a haystack. Genome Res. 2005;15(12):1746–58. [DOI] [PubMed] [Google Scholar]

[CR31] 31.Scally A, Durbin R. Revising the human mutation rate: implications for understanding human evolution. Nat Rev Genet. 2012;13(10):745–53. [DOI] [PubMed] [Google Scholar]

[CR32] 32.Ke G, et al. LightGBM: A highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst. 2017;30:3146–54.

[CR33] 33.Bentéjac C, Csörgő A, Martínez-Muñoz G. A comparative analysis of gradient boosting algorithms. Artif Intell Rev. 2021;54:1937–67. [Google Scholar]

[CR34] 34.Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. 2013 https://arxiv.org/abs/1303.3997.

[CR35] 35.Danecek P, et al. Twelve years of SAMtools and BCFtools. Gigascience. 2021;10(2):giab008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR36] 36.Van der Auwera GA, O'Connor BD. Genomics in the cloud: using Docker, GATK, and WDL in Terra. Sebastopol: O'Reilly Media; 2020.

[CR37] 37.Toolkit P, Picard Toolkit. Broad Institute, GitHub Repository. 2019, Broad Institute. Retreived from http://broadinstitute.github.io/picard.

[CR38] 38.Zhao S, et al. Accuracy and efficiency of germline variant calling pipelines for human genome data. Sci Rep. 2020;10(1):20222. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR39] 39.Guo C, et al. On calibration of modern neural networks. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70. PMLR; 2017. pp. 1321–30.

[CR40] 40.Niculescu-Mizil A, Caruana R. Predicting good probabilities with supervised learning. In: Proceedings of the 22nd International Conference on Machine Learning. ACM; 2005. pp. 625–32. 10.1145/1102351.1102430.

[CR41] 41.Li H, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25(16):2078–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR42] 42.Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011;27(21):2987–93. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR43] 43.Consortium GP. A global reference for human genetic variation. Nature. 2015;526(7571):68. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR44] 44.Sherry ST, Ward M, Sirotkin K. dbSNP—database for single nucleotide polymorphisms and other classes of minor genetic variation. Genome Res. 1999;9(8):677–9. [PubMed] [Google Scholar]

[CR45] 45.Bailey JA, Eichler EE. Primate segmental duplications: crucibles of evolution, diversity and disease. Nat Rev Genet. 2006;7(7):552–64. [DOI] [PubMed] [Google Scholar]

[CR46] 46.Bimber BN, et al. Whole genome sequencing predicts novel human disease models in rhesus macaques. Genomics. 2017;109(3–4):214–20. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR47] 47.Gibbs RA, et al. Evolutionary and biomedical insights from the rhesus macaque genome. Science. 2007;316(5822):222–34. [DOI] [PubMed] [Google Scholar]

[CR48] 48.Xue C, et al. The population genomics of rhesus macaques (Macaca mulatta) based on whole-genome sequences. Genome Res. 2016;26(12):1651–62. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR49] 49.Warren WC, et al. Sequence diversity analyses of an improved rhesus macaque genome enhance its biomedical utility. Science. 2020;370(6523):6617. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR50] 50.Chen T, Guestrin C. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016. pp. 785–94. 10.1145/2939672.2939785.

[CR51] 51.Menard S. Applied logistic regression analysis, vol. 106. 2nd ed. Thousand Oaks: Sage Publications; 2002.

[CR52] 52.Mucherino A, et al. K-nearest neighbor classification. In: Data Mining in Agriculture. New York: Springer; 2009. pp. 83–106.

[CR53] 53.Zhang H. The optimality of naive Bayes. AAAI Conference on Artificial Intelligence. 2004;1:562–7.

[CR54] 54.Popescu M-C, et al. Multilayer perceptron and neural networks. WSEAS Transactions on Circuits and Systems. 2009;8(7):579–88. [Google Scholar]

[CR55] 55.Gorishniy Y, et al. Revisiting deep learning models for tabular data. Adv Neural Inf Process Syst. 2021;34:18932–43. [Google Scholar]

[CR56] 56.Chen Y-C. A tutorial on kernel density estimation and recent advances. Biostatistics & Epidemiology. 2017;1(1):161–87. [Google Scholar]

[CR57] 57.Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. Adv Neural Inf Process Syst. 2017;30:5998–6008.

[CR58] 58.Bahdanau D. Neural machine translation by jointly learning to align and translate. 2014. https://arxiv.org/abs/1409.0473

[CR59] 59.Shwartz-Ziv R, Armon A. Tabular data: Deep learning is not all you need. Information Fusion. 2022;81:84–90. [Google Scholar]

[CR60] 60.Grinsztajn L, Oyallon E, Varoquaux G. Why do tree-based models still outperform deep learning on typical tabular data? Adv Neural Inf Process Syst. 2022;35:507–20. [Google Scholar]

[CR61] 61.Le-Khac PH, Healy G, Smeaton AF. Contrastive representation learning: A framework and review. Ieee Access. 2020;8:193907–34. [Google Scholar]

[CR62] 62.Jumper J, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596(7873):583–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Learning a refinement model for variant analysis in non-human primate genomes

Jeonghoon Choi

Bo Zhou

Giltae Song

Abstract

Background

Results

Conclusions

Supplementary Information

Background

Materials and methods

Genome sequence alignment under suboptimal conditions

Fig. 1.

DeepVariant

Feature extraction for refinement model

Table 1.

Genome variant refinement model

Fig. 2.

Human genomic data and reference variant resources for optimal SA

Non-human primate genomic data: Rhesus macaque

Table 2.

Results

Variant call evaluation between optimal/suboptimal SA

Table 3.

Refinement model evaluation in human suboptimal alignment

Table 4.

Table 5.

Table 6.

Fig. 3.

Ablation experiments on geature groups

Table 7.

Application of the refinement model to the non-human primate: Rhesus macaque

Table 8.

Fig. 4.

Table 9.

Fig. 5.

Alternative base ratio comparison between ground truth and refined SNVs

Fig. 6.

Discussion

Fig. 7.

Conclusion

Supplementary Information

Acknowledgements

Abbreviations

Authors’ contributions

Funding

Data availability

Declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases