Improved detection of low-frequency within-host variants from deep sequencing: A case study with human papillomavirus

Sambit K Mishra; Chase W Nelson; Bin Zhu; Maisa Pinheiro; Hyo Jung Lee; Michael Dean; Laurie Burdett; Meredith Yeager; Lisa Mirabello

doi:10.1093/ve/veae013

. 2024 Feb 7;10(1):veae013. doi: 10.1093/ve/veae013

Improved detection of low-frequency within-host variants from deep sequencing: A case study with human papillomavirus

Sambit K Mishra ^1,^2,^†,^‡, Chase W Nelson ^3,^†,^§, Bin Zhu ⁴, Maisa Pinheiro ⁵, Hyo Jung Lee ^6,⁷, Michael Dean ^8,^**, Laurie Burdett ^9,¹⁰, Meredith Yeager ^11,¹², Lisa Mirabello ^13,^*

¹ Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, 9609 Medical Center Drive, Rockville, MD 20850, USA

² Cancer Genomics Research Laboratory, Leidos Biomedical Research, Inc., Frederick National Laboratory for Cancer Research, P.O. Box B, Bldg. 430, Frederick, MD 21702, USA

³ Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, 9609 Medical Center Drive, Rockville, MD 20850, USA

⁴ Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, 9609 Medical Center Drive, Rockville, MD 20850, USA

⁵ Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, 9609 Medical Center Drive, Rockville, MD 20850, USA

⁶ Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, 9609 Medical Center Drive, Rockville, MD 20850, USA

⁷ Cancer Genomics Research Laboratory, Leidos Biomedical Research, Inc., Frederick National Laboratory for Cancer Research, P.O. Box B, Bldg. 430, Frederick, MD 21702, USA

⁸ Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, 9609 Medical Center Drive, Rockville, MD 20850, USA

⁹ Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, 9609 Medical Center Drive, Rockville, MD 20850, USA

¹⁰ Cancer Genomics Research Laboratory, Leidos Biomedical Research, Inc., Frederick National Laboratory for Cancer Research, P.O. Box B, Bldg. 430, Frederick, MD 21702, USA

¹¹ Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, 9609 Medical Center Drive, Rockville, MD 20850, USA

¹² Cancer Genomics Research Laboratory, Leidos Biomedical Research, Inc., Frederick National Laboratory for Cancer Research, P.O. Box B, Bldg. 430, Frederick, MD 21702, USA

¹³ Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, 9609 Medical Center Drive, Rockville, MD 20850, USA

^✉

*Correspondence to: E-mail: mirabellol@mail.nih.gov

^†

Co-first author.

^‡

https://orcid.org/0000-0001-8319-0599

^§

https://orcid.org/0000-0001-6287-1598

^**

https://orcid.org/0000-0003-2234-0631

PMCID: PMC10919477 PMID: 38455683

Abstract

High-coverage sequencing allows the study of variants occurring at low frequencies within samples, but is susceptible to false-positives caused by sequencing error. Ion Torrent has a very low single nucleotide variant (SNV) error rate and has been employed for the majority of human papillomavirus (HPV) whole genome sequences. However, benchmarking of intrahost SNVs (iSNVs) has been challenging, partly due to limitations imposed by the HPV life cycle. We address this problem by deep sequencing three replicates for each of 31 samples of HPV type 18 (HPV18). Errors, defined as iSNVs observed in only one of three replicates, are dominated by C→T (G→A) changes, independently of trinucleotide context. True iSNVs, defined as those observed in all three replicates, instead show a more diverse SNV type distribution, with particularly elevated C→T rates in CCG context (CCG→CTG; CGG→CAG) and C→A rates in ACG context (ACG→AAG; CGT→CTT). Characterization of true iSNVs allowed us to develop two methods for detecting true variants: (1) VCFgenie, a dynamic binomial filtering tool which uses each variant’s allele count and coverage instead of fixed frequency cut-offs; and (2) a machine learning binary classifier which trains eXtreme Gradient Boosting models on variant features such as quality and trinucleotide context. Each approach outperforms fixed-cut-off filtering of iSNVs, and performance is enhanced when both are used together. Our results provide improved methods for identifying true iSNVs in within-host applications across sequencing platforms, specifically using HPV18 as a case study.

Keywords: deep sequencing, intrahost single nucleotide variant (iSNV), machine learning, sequencing error, VCFgenie

Introduction

Deep (high-coverage) next-generation sequencing (NGS) allows the study of genetic variation in ‘pooled’ samples that contain multiple genomes. Common targets of this approach include within-host (intrahost) virus populations (multiple virus genomes) (Nelson and Hughes 2015) and tumour cell populations (multiple somatic genomes) (Poduri et al. 2013). Deep coverage enables the evaluation of variants that occur in a low fraction of the sequence reads, including intrahost single nucleotide variants (iSNVs; Fig. 1). Here, the relative frequency of each iSNV is estimated from its variant allele fraction (VAF), e.g. a C→T iSNV in 10 per cent of reads is inferred to have a within-host allele frequency of 10 per cent. However, low-VAF variant calling can suffer from high rates of both false negatives and false positives (Kim et al. 2019). Sequencing error alone can produce false-positive low-VAF variants, necessitating careful quality control (McCrone, Lauring, and Dermody 2016). Given the increasing interest in evaluating low-VAF variants in biomedical and evolutionary research, improved approaches for accurate detection are warranted.

Figure 1. — Illustration of within- and between-host variants. iSNVs are contrasted with consensus-level SNPs between two hosts. Minor (VAF < 50 per cent; less common) alleles are indicated (bold, bright pink). iSNVs are present in host 1 at viral nucleotide sites 3, 5, and 8, and in host 2 at site 5. In contrast, one consensus-level SNP occurs between the hosts at site 6. iSNVs and SNPs may or may not occur at the same sites. AF = allele frequency (between-host).

Common criteria for curating iSNVs include fixed cut-offs such as (1) a minimum VAF (e.g. 5 per cent); (2) a minimum total read coverage (e.g. 200); (3) a minimum number of variant reads (e.g. 10); (4) elimination of variants displaying strand bias; and (5) elimination of variants in amplicons with primer mismatches (Grubaugh et al. 2019; Lauring 2020). Machine learning and deep learning approaches have also been employed for discriminating between true and false somatic SNVs in tumours (Spinella et al. 2016; Ainscough et al. 2018; Wu et al. 2020). However, in both approaches, the specific cut-offs chosen may be arbitrary and fail to incorporate useful metadata (e.g. nucleotide context, strand bias, and quality), and benchmarking may be prohibitive in certain biological systems. To our knowledge, machine learning has not been used to identify true iSNVs in deeply sequenced within-host virus populations.

Human papillomavirus (HPV) is a ∼7.9 kb, double-stranded DNA virus, and persistent infection with one of the ∼13 high-risk (HR)-HPV types causes over 690,000 cancers worldwide per year, including virtually all ∼604,000 cervical cancers (de Martel et al. 2020; Sung et al. 2021; Singh et al. 2023). Within a given HR-HPV type, viral genetic variation can be classified into evolutionary lineages and sublineages differing by 0.5–10 per cent, and on a finer scale, as individual single nucleotide polymorphisms (SNPs), and both have been related to cervical precancer and cancer risk (Nelson and Mirabello 2023). With deep whole-genome NGS of HPV, more recent studies have also identified iSNVs within hosts, including variants that may be clinically relevant (e.g. those induced by the antiviral human APOBEC3 enzyme) (Mirabello et al. 2017; Hirose et al. 2018; Zhu et al. 2020; Warren, Santiago, and Pyeon 2022; Kogure et al. 2023). Many of these deep NGS analyses have been made possible by a high-throughput Ion Torrent HPV sequencing assay (Cullen et al. 2015), particularly given the low single-base substitution error rates of updated Ion chemistry (Pereira et al. 2016). However, HPV is dependent on the full differentiation programme of host cells to complete its life cycle, and is therefore not amenable to laboratory culture (Rowson and Mahy 1967; Meyers et al. 1992; Fausch et al. 2003). This has made dilution benchmarking of iSNV frequencies (e.g. variants mixed at known ratios) and evaluations of accurate iSNV detection impossible.

In this study, we overcome the inability to perform HPV dilution iSNV benchmarking by instead sequencing 31 HPV type 18 (HPV18) samples three times (i.e. three technical replicates). After identifying a set of high-confidence iSNVs, we develop two distinct but complementary methods for identifying true variants: (1) VCFgenie, a binomial test filter that implements a dynamic VAF cut-off for each variant, replacing the need for arbitrary fixed cut-offs; and (2) a supervised machine learning technique that uses gradient boosting to analyse exhaustive sequencing metadata, here Ion Torrent data. VCFgenie and machine learning both perform better than fixed-cut-off filtering for identification of true iSNVs, and using the two approaches together further improves performance. Our study highlights the value of combining simple dynamic filtering and supervised approaches for variant detection. These methods can be easily applied to other sequencing platforms and biological systems, including both virus and somatic tumour cell populations.

Materials and methods

Study population

We selected 31 samples from the Persistence and Progression (PaP) cohort at Kaiser Permanente Northern California (KPNC) in the USA. The PaP cohort has been previously described (Castle et al. 2011), and includes ∼55,000 women, aged 21–70 years, who underwent routine cervical cancer screening between December 2007 and January 2011 using HPV and cytology co-testing. Cervical exfoliated cells were tested clinically using Hybrid Capture 2 (HC2; Qiagen Inc., Gaithersburg, MD) to detect 13 HR-HPV types. Typing of archived specimens in neutralized specimen transport medium (STM; Qiagen Inc., Gaithersburg, MD) was performed using a variety of assays, including Onclarity (BD, Franklin Lakes, NJ), Linear Array (Roche Diagnostics, Indianapolis, IN), and MY09-MY11 PCR. We randomly selected 31 HPV18-positive controls for our study, defined as women positive for HPV18 and diagnosed with cervical intraepithelial neoplasia grade 1 (CIN1) or lower (within normal limits or atypia), whose HPV18 infection subsequently became undetectable or who did not progress to precancer or cancer (i.e. CIN2+) throughout the study follow-up. The National Cancer Institute and Kaiser Permanente Institutional Review Boards approved this study.

NGS: Independent technical replicates

Three technical replicates were prepared for each sample as follows. DNA was extracted from exfoliated cervical cells using proteinase K, where 30 µL of the banked STM cells were transferred to 100 µL of K buffer containing 200 µg/mL proteinase K and incubated at 55°C for 2 h followed by a 10 min incubation at 95°C. Following DNA extraction, three separate replicates were aliquoted from each sample. A total of 93 HPV18-positive DNA reactions (31 samples × 3 technical replicates) were loaded onto a 96-well plate (5.5 µL per well) for independent PCR amplification, library preparation, and sequencing, performed as previously described (Cullen et al. 2015). Three wells with water were also included as negative controls. For NGS, distinctly barcoded libraries were prepared using Ion AmpliSeq Library Kit 2.0 (Thermo Fisher Scientific, Carlsbad, CA) and a custom AmpliSeq panel comprising 45 overlapping amplicons covering the entire 7857 bp HPV18 genome. The libraries were pooled and sequenced on the Ion GeneStudio S5 System (Thermo Fisher Scientific, Carlsbad, CA). Raw sequencing reads were subjected to adapter trimming and then aligned to the HPV18 reference genome, HPV18REF (NCBI Ref. Seq. AY262282; Burk, Harari, and Chen 2013; Van Doorslaer et al. 2017; https://pave.niaid.nih.gov/) using the Torrent Mapping Alignment software v5.4.11 (https://github.com/iontorrent/TS/tree/master/Analysis/TMAP). Variant calling was performed using the Torrent Variant Caller (TVC) suite (https://github.com/domibel/IonTorrent-VariantCaller). The Snakemake workflow and parameters used to process and analyse the sequenced data can be found at https://github.com/NCI-CGR/HPV_low_VAF_SNV_prediction.

Variant classification

From a total of 10,950 putative within-sample variants called across all replicates, 9,225 were iSNVs, of which we focused on the 9,155 that passed Ion Torrent flow correction (corrected allele count >0) (Supplementary File 1). Median flow-corrected iSNV coverage (sequencing depth) for individual replicates was 1,984 (interquartile range: 1980–1991; range: 13–1,997; Supplementary Fig. S1). Each iSNV was assigned a replicate frequency: 3/3 (present in all replicates; true iSNVs); 2/3 (two replicates; ambiguous iSNVs); and 1/3 (one replicate; errors). Replicate frequency was then used as a proxy for iSNV validity, i.e. we assumed that a true iSNV would be detected consistently across more than one replicate, while false iSNVs would not appear in all replicates. While these criteria are justified by the independence of our technical replicates, it is important to note that replicate frequency technically measures reproducibility rather than ground truth, e.g. true iSNVs could be missed (see Discussion). As we demonstrate below, our approach gives optimal performance when considering true iSNVs as those called in all replicates (3/3), false iSNVs as those called in only one replicate (1/3), and excluding the ambiguous iSNVs (2/3 replicates).

VCFgenie: A dynamic VAF cut-off using the binomial distribution

We developed the command-line tool VCFgenie, written in Python, to introduce a reproducible (no custom code) approach for filtering variant call format (VCF) files. This includes a new method for dynamic VAF-based filtering.

First, VCFgenie implements rules to filter and modify records based on values in the standard VCF file’s INFO or FORMAT/sample columns. For example, to eliminate variants with STB (strand bias) values falling outside the range 0.5–0.9, VCFgenie can be called with the command-line argument: --INFO_rules=“STB>0.5,STB<0.9”. Either reference (REF) or alternate (ALT) allele(s) will fail if they do not meet a criterion. All alleles at multi-allelic sites are considered concurrently, e.g. if a REF allele is the minor allele and fails, its reads are re-allocated to the passing ALT allele(s).

Second, VCFgenie implements a new method for VAF-based variant filtering (Nelson et al. 2020) which improves performance compared to fixed—often arbitrary—cut-offs (e.g. eliminating iSNVs with VAF <0.05). Specifically, each variant is tested relative to a binomial distribution defined by the sequencing error rate and the site’s coverage. The null hypothesis that the variant is due to sequencing error is tested using a P value calculated as P(x ≥ ac) = 1 – binom.cdf(x = ac – 1, n = DP, p = E), where E = the sequencing error rate (binomial ‘success’); DP = the site’s read depth (coverage); and ac = the variant’s allele count (the observed number of ‘successes’). If the P value is sufficiently small, this indicates that the observed ac is unlikely to be due to sequencing error, and the null hypothesis is rejected in favour of the alternative hypothesis that the variant is true. For example, given a single base substitution error rate of 0.003 for a particular iSNV type (e.g. C→T), if this iSNV is observed with ac = 8 and DP = 1,000 (i.e. VAF = 0.008), then its P value is P(x ≥ 8) = 1 – binom.cdf(7, 1000, 0.003) = 0.012 (Supplementary Fig. S2).

VCFgenie accepts as input the overall mean single base substitution error rate per read-base sequenced. This should reflect the total error rate from all sources, including sample preparation, library preparation, and sequencing. Because there are three possible single base errors at a given nucleotide site (e.g. C→A, C→G, or C→T at a C site), this total error rate is internally divided by 3, i.e. the above example would occur when the user supplied an error rate of 0.009 (0.009/3 = 0.003). More complex error models are possible but beyond the scope of this study. Our analysis used an error rate of 1.29 × 10^–4 (0.0049 single nucleotide errors per base × 2.64 per cent substitutions), based on results for the Ion Torrent Hi-Q chemistry used in our study (Pereira et al. 2016). Finally, we implemented a Bonferroni P value cut-off of 0.05/9,225 = 5.42 × 10^–6, where 9,225 is the number of putative iSNVs before flow correction or subsequent filtering.

Parameters and machine learning

For machine learning, we considered three possible definitions of true and false variants (i.e. ground truth) and four fixed VAF lower limits, ultimately testing all combinations of the parameter values shown in Table 1. Performance was maximized when considering iSNVs present in 3/3 replicates to be true; iSNVs present in 1/3 replicates to be errors; and using VCFgenie instead of a VAF lower limit.

Table 1.

Parameter definitions and thresholds.

Parameter	Description	Criteria
True and False iSNV (replicate frequencies)	The minimum/maximum number of replicates in which an iSNV should be detected to label it as a true/false variant	(1) ≥2/3 = True, 1/3 = False; (2) 3/3 = True, ≤2/3 = False; (3) 3/3 = True, 1/3 = False
VAF lower limit^a	The minimum value for the variant allele fraction of a variant	(1) 1%; (2) 2%; (3) 5%; (4) 10%

Open in a new tab

Performance was maximized using VCFgenie rather than a fixed VAF lower limit.

We identified 31 variant features for use in our machine learning models (Table 2). Of these, 29 were parameters obtained from the VCF files produced by Torrent Variant Caller (e.g. QUAL = quality). We also included the 5′ and 3′ flanking nucleotides of each variant position in the reference genome, i.e. trinucleotide context. To consider multicollinearity, these features were classified into one or more of three feature groups, termed Exhaustive, Moderate, or Strict (Table 3). Here, the Moderate group included only features that are not a direct function of another feature(s), while the Strict group included only features that are neither a function of, nor expected to be highly correlated with, another feature(s). For example, QD, calculated as 4 × QUAL/FDP, is included only in the Exhaustive category, because it is a function of QUAL and FDP.

Table 2.

Features used as predictors in machine learning.

Predictor	Description
AO	Alternate allele observations
DP	Total read depth at the iSNV position
FAO	Flow evaluator alternate allele observations
FDP	Flow evaluator read depth at the iSNV position
FRO	Flow evaluator reference allele observations
FSAF	Flow evaluator alternate allele observations on the forward strand
FSAR	Flow evaluator alternate allele observations on the reverse strand
FSRF	Flow evaluator reference observations on the forward strand
FSRR	Flow evaluator reference observations on the reverse strand
FWDB	Forward strand bias in prediction
FXX	Flow evaluator failed read ratio
GQ	Genotype quality, the Phred-scaled marginal (or unconditional) probability of the called genotype
HRUN	Run length: the number of consecutive repeats of the alternate allele in the reference genome
LEN	Allele length
MLLD	Mean log-likelihood delta per read
QD	Quality by depth; 4 × QUAL/FDP
QUAL	Quality
RBI	Distance of bias parameters from zero
REFB	Reference hypothesis bias in prediction
REVB	Reverse strand bias in prediction
RO	Reference allele observation count
SAF	Alternate allele observations on the forward strand
SAR	Alternate allele observations on the reverse strand
SRF	Number of reference observations on the forward strand
SRR	Number of reference observations on the reverse strand
SSSB	Strand-specific strand bias for allele
STB	Strand bias in variant relative to reference
STBP	P value of strand bias in variant relative to reference
VARB	Variant hypothesis bias in prediction
5PC	Five prime context; nucleotide flanking the iSNV site on the 5′ side
3PC	Three prime context; nucleotide flanking the iSNV site on the 3′ side

Open in a new tab

Parameter descriptions were obtained from VCF files produced using Torrent Variant Caller.

Table 3.

Feature categories for machine learning.

Category	Description	Features
Exhaustive (n = 31)	All features	AO, DP, FAO, FDP, FRO, FSAF, FSAR, FSRF, FSRR, FWDB, FXX, GQ, HRUN, LEN, MLLD, QD, QUAL, RBI, REFB, REVB, RO, SAF, SAR, SRF, SRR, SSSB, STB, STBP, VARB, 5PC, 3PC
Moderate (n = 20)	Only those features that are not a function of another feature(s)	FSAF, FSAR, FSRF, FSRR, FWDB, FXX, GQ, MLLD, QUAL, REFB, REVB, SAF, SAR, SRF, SRR, SSSB, STB, VARB, 5PC, 3PC
Strict (n = 14)	Non-redundant features that are neither a function of, nor expected to be highly correlated with, another feature(s)	FSAF, FSAR, FSRF, FSRR, FWDB, FXX, MLLD, QUAL, REFB, REVB, SSSB, VARB, 5PC, 3PC

Open in a new tab

Three feature categories used to train the machine learning models. Features are described in Table 2.

We tested all 36 combinations of the above parameter thresholds (Table 1) and feature categories (Table 3), i.e. 3 true/false thresholds × 4 VAF lower limits × 3 feature categories. Additionally, we tested each combination using VCFgenie instead of a VAF lower limit, or VCFgenie in addition to a VAF lower limit, for a total of 36 × 2 + 3 × 3 = 81 models. These models can be summarized as representing one of three strategies:

(i)
FM (VAF Filter + Machine learning): minimum VAF cut-off filtering; 36 parameter/feature combinations; machine learning.
(ii)
VM (VCFgenie + Machine learning): VCFgenie filtering (no minimum VAF cut-off); 9 parameter/feature combinations excluding a VAF lower limit; machine learning.
(iii)
FVM (VAF Filter + VCFgenie + Machine learning): minimum VAF cut-off filtering; VCFgenie filtering; 36 parameter/feature combinations; machine learning.

In practice, iSNVs were first identified for a given definition of true and false (e.g. True = 3/3; False = 1/3), a given VAF threshold (e.g. VAF lower limit = 1 per cent), and usage of VCFgenie (used or not used), followed by training and testing with metadata from each of the three feature categories. An overview is presented in Fig. 2.

Figure 2. — Overview of classification approaches used in this study. (A) Hybrid prediction method using the VCFgenie binomial filter and supervised machine learning (XGBoost). Raw VCF files from the variant calling pipeline are first processed using VCFgenie (VM, FVM) and/or a low-VAF cut-off (FM, FVM). The iSNVs and their labels (true/ambiguous/false based on replicate frequency) are extracted and used to train machine learning classification algorithms (see Fig. 3). (B) VCFgenie uses the binomial distribution to filter each variant based on its depth (DP) and allele count (AC) to generate processed VCF files, where alleles that FAIL have their AC and VAF values set to 0.

Model training/testing was implemented with discrete eXtreme Gradient Boosting (XGBoost) binary classification models (Chen and Guestrin 2016) using Python’s xgboost and scikit-learn packages (Pedregosa et al. 2011). XGBoost was chosen because of the method’s success in tumour variant calling pipelines and data science competitions (McLaughlin et al. 2023), and because our preliminary analyses with random forest yielded slightly inferior performance and higher variance (Supplementary Fig. S3). To prevent data leakage wherein the same iSNVs from different replicates of the same sample are present in both the training and testing data, we devised a sample-based strategy (Fig. 3): allocation of a sample to either the training or testing dataset involved placing all three of the sample’s replicates in that dataset. Six samples were randomly selected for placement in the testing dataset; the remaining 25 were placed in the training dataset. These groups were then used to train a single model for each of FM, VM, and FVM.

Figure 3. — Machine learning binary classification methodology. Depending on the definitions of false and true variants, each iSNV is labelled either as 0 (false) or 1 (true). For VM and FVM models, the iSNVs are obtained from the processed VCF files generated by VCFgenie; for the FM model, raw VCF files are used. To train prediction models for a parameter/feature combination, the samples are randomly split into training (25 samples) and testing (6 samples) datasets. To address class imbalance (i.e. more false than true iSNVs), the majority class is undersampled. The testing data are used to independently assess model performance using the AUC, MCC, F1 score, accuracy, and MSE metrics. Sample allocation, training, and testing are repeated until model rankings based on median MCC converge; 50 iterations were sufficient in our study (Supplementary Fig. S5).

To address class imbalance, each round of training involved undersampling the majority class (false iSNVs) to match the number of observations in the minority class (true iSNVs; Supplementary Fig. S4). Performance on testing data was evaluated using five metrics: area under the receiver operator characteristic curve (AUC), Matthews correlation coefficient (MCC), F1 score, accuracy, and mean squared error (MSE; Supplementary Information). Unless otherwise noted, models were compared using MCC, given this metric’s robustness to class imbalance. Finally, the entire process was iterated, starting at the training/testing split, until results converged with respect to which parameter/feature combination was ranked highest by a performance metric (stopping criterion). This was necessary given the small size of our dataset, as any one training/testing split is subject to substantial stochastic fluctuation. For our data, 50 iterations proved more than sufficient for convergence (Supplementary Fig. S5).

Initial model hyperparameter values were booster = gbtree, use_label_encoder = False, n_estimators = 100, eta = 0.2, max_depth = 10, gamma = 1, reg_lambda = 1, eval_metric = logloss, and otherwise the default value provided in xgboost v. 1.5. To evaluate whether performance could be improved by changing model structure, hyperparameter tuning was performed as follows. After identifying the top-performing model for each strategy (FM, VM, and FVM) above, a grid search was performed in which two or three possible values of eight important hyperparameters were evaluated (Supplementary Table S1). These values always included but were not limited to the default values used in xgboost and scikit-learn, for a total of 3⁶ × 2² = 2,916 hyperparameter combinations. Performance was evaluated using the same methodology and metrics as for initial training/testing (Supplementary Fig. S6). The top-scoring hyperparameter combinations were selected by comparing median performance across all 50 iterations for each of the FM, VM, and FVM strategies (Supplementary File 2), which was again sufficient for convergence (Supplementary Fig. S7).

Results

Characterization of iSNV replicates

We deep sequenced HPV18 whole genomes in triplicate from 31 HPV18-positive women with benign infections from the NCI-Kaiser PaP cohort, identifying 9,225 putative iSNVs (alternate alleles). Each iSNV was classified by presence in one, two, or all three replicates of a sample, as 1/3 (false), 2/3 (ambiguous), or 3/3 (true). The replicate frequency indicated reproducibility across fully independent technical replicates (PCR amplification, library preparation, and sequencing) and was considered a proxy for a variant’s veracity, i.e. an iSNV called in all three replicates was considered true (Table 1).

Of 9,225 initial iSNVs, 9,155 were retained after processing by the Ion Torrent flow correction algorithm (corrected observed allele count [FAO] ≥ 1). This set included 4,810, 673, and 3,672 iSNVs called in 3/3, 2/3, and 1/3 replicates, respectively (Supplementary Fig. S8A), and was used for all downstream analyses. The distribution of iSNVs by VAF (i.e. the site frequency spectrum relative to the reference allele) was bimodal (U shaped), with peaks at VAF ≤5 per cent and VAF >95 per cent (Fig. 4). Consistent with previous analyses of HPV iSNVs (Zhu et al. 2020), the lowest counts were observed near VAF = 60 per cent. Given our reference-based assembly and focus on low-VAF variants, we grouped the iSNVs into two frequency categories: minor (VAF <50 per cent; 4,614 iSNVs) and major (VAF ≥50 per cent; 4,541 iSNVs) (Fig. 4; Supplementary Fig. S8). Minor iSNVs included more false variants (77.9 per cent false), whereas major iSNVs were dominated by true variants (95.6 per cent true).

Figure 4. — Site frequency spectrum of iSNVs. Results are shown for all iSNVs passing Ion Torrent flow correction (FAO ≥1; n = 9,155) and reported as an ALT (alternate) allele relative to the PaVE reference sequence HPV18REF (Van Doorslaer et al. 2017; https://pave.niaid.nih.gov/). Bar height (y axis) shows the total number of iSNVs observed at a given within-host VAF (x axis) across all replicates. Colour denotes replicate frequency, with true iSNVs (turquoise) defined as those present in all three replicates. Minor = less common allele (VAF <50 per cent); Major = most common allele (VAF ≥50 per cent).

iSNV evaluations by trinucleotide context reveal a unique pattern for true vs. false variants

C→T variants were the most common SNV type among true (3/3), ambiguous (2/3), and false (1/3 replicates) iSNVs (Supplementary Fig. S9). True iSNVs were dominated by C→T (31.0 per cent), C→A (23.3 per cent), and T→C (20.2 per cent), but still included substantial numbers of other SNV types; strand enrichment was observed for T→C (the complement A→G was less common) and C→A (G→T was less common). In contrast, for false iSNVs, almost all variants were C→T (88.1 per cent), and no strand enrichment was observed (i.e. A→G and T→C did not differ; overlapping binomial 95 per cent confidence intervals) (Supplementary Fig. S9; Supplementary File 3).

We observed distinct context-dependent mutation spectra for each replicate frequency (Fig. 5). First, we determined the nucleotide, dinucleotide, and trinucleotide content of the reference genome, HPV18REF, obtained from PaVE (Burk, Harari, and Chen 2013; Van Doorslaer et al. 2017; https://pave.niaid.nih.gov/) (Fig. 5A; Supplementary File 4). Next, raw iSNV counts were divided by the frequency of the appropriate trinucleotide in the reference genome and normalized to yield trinucleotide context-based iSNV rates (i.e. rates depending on the flanking 5′ and 3′ nucleotides). This revealed that single base substitution sequencing errors are dominated by C→T in a largely context-independent manner (Fig. 5B). In contrast, true iSNVs show a more diverse profile, with C→T elevated in CCG context and C→A elevated in ACG context (Fig. 5B), both consistent with CpG mutability in vivo. CCG and ACG are among the rarest trinucleotides in the HPV18 genome (Fig. 5A).

Figure 5. — HPV18 sequence composition and molecular spectrum of iSNVs. (A) Sequence composition of the HPV18 reference genome, HPV18REF, obtained from PaVE (Van Doorslaer et al. 2017; https://pave.niaid.nih.gov/). Nucleotide, dinucleotide, and trinucleotide contents were calculated as the sum of all 1-, 2-, and 3-mer substrings of the genome, respectively, where each k-mer length category separately sums to 100 per cent. Colour denotes sequence contexts known to be subject to elevated mutation rates by deamination of methylated CpGs (yellow), mutation by APOBEC3 enzymes (blue), or both (green). (B) Molecular spectrum of iSNVs in trinucleotide context, stratified by replicate frequency. Trinucleotide labels on the x axis indicate the 5′ and 3′ flanking nucleotides, where the focal (central) nucleotide undergoes mutation. For each trinucleotide and substitution type, an initial rate was determined by dividing the observed number of iSNVs by the number of trinucleotides in the reference genome, as reflected in panel A. These rates were then normalized to sum to 100 per cent for each replicate frequency (row). S = strongly bonded (C:G) and W = weakly bonded (A:T) base pairs, where colour denotes category of change as S→W (orange; C→A, C→T), S→S/W→W (green; C→G, T→A), or W→S (blue; T→C, T→G). Source data: Supplementary File 4.

HPV18 G:C content was 40.4 per cent, with dinucleotides depleted for CpG and TpC, attributable to the deamination of methylated CpGs and mutagenic activity of APOBEC3 enzymes on TpCs (Nelson and Mirabello 2023). TCG was the least common trinucleotide (0.30 per cent). Consistent with previous reports on sequence content in papillomaviruses (Chen et al. 2021; King et al. 2022), TpCs placing the C at the third trinucleotide position are also very uncommon.

iSNV features and prediction approaches

We compiled 31 iSNV features: 29 from the VCF files produced by the variant calling pipeline, as well as the 5′ and 3′ sequence context from the HPV18 reference genome (Table 2). We first explored whether any of the 29 predictors could distinguish between true and false variants. Specifically, we compared the distributions of FAO, FDP, FSAF, FSAR, QD, and QUAL for different replicate frequencies for both minor and major iSNVs. For major variants, some predictors could clearly differentiate between true and false iSNVs, e.g. true variants had higher values of FAO, FDP, FSAF, FSAR, QD, and QUAL compared to false variants (Fig. 6). Conversely, for minor iSNVs, no clear distinction in predictor values was obvious between true and false variants (Fig. 6, Supplementary Fig. S10 and S11). Thus, more sophisticated methods were necessary for detecting true iSNVs at lower VAFs.

Figure 6. — Distribution of iSNV feature values by frequency class and replicate frequency. The 9,155 iSNVs were divided into minor (VAF < 50 per cent; less common allele) and major (VAF ≥ 50 per cent; most common allele) frequency classes, relative to the reference genome. Six VCF features are shown according to replicate frequency of false (1/3), ambiguous (2/3), and true (3/3) iSNVs. For major variants, true iSNVs have higher values of FAO, FDP, FSAF, FSAR, QD, and QUAL than the false iSNVs. No such pattern is apparent for the minor variants.

We developed two independent approaches for predicting true iSNVs: (1) VCFgenie, a binomial filter that processes VCF files from any sequencing platform or variant caller; and (2) machine learning that is dependent on the metadata generated from a particular sequencing platform (Ion Torrent in our study) (Fig. 2). We then combined the two approaches, as together they exhibit superior performance (see below).

VCFgenie performance

VCFgenie can be used to replace arbitrary, fixed-VAF and -depth cut-offs. When applied to all 9,155 iSNVs, VCFgenie excluded 16 variants as false using a Bonferroni P value cut-off of 5.42 × 10^–6 (Supplementary Fig. S8A, B,C). Thus, VCFgenie considered the vast majority (99.8 per cent) of iSNVs retained by the Torrent Variant Caller pipeline to be true. The 16 failing iSNVs had lower values for nearly all 19 features expected to be positively correlated with variant confidence; QUAL and QD were ambiguous (Fig. 7A). Complementarily, the failing iSNVs had higher values for nearly all 10 features expected to be negatively correlated with variant confidence; FXX and STB were ambiguous (Fig. 7B). Failing iSNVs were not simply restricted to the lowest VAFs; instead, their VAFs ranged from 0.05 per cent to 2.6 per cent, while other iSNVs with VAFs as low as 0.3 per cent were retained. Failing iSNVs also had coverage values ranging from 38 to 1,966 (Supplementary File 1). Thus, no single fixed-VAF or -coverage cut-off could identify those false variants eliminated by VCFgenie.

Figure 7. — Comparison of features between iSNVs passing and failing VCFgenie. Features are grouped into (A) those indicative of confidence/quality (higher values desirable) and (B) those indicative of uncertainty/bias (lower values desirable), with results based on the full initial set of 9,155 iSNVs. Variants that failed VCFgenie (n = 16) had inferior values for all features except QUAL (lower mean but higher median), QD (higher mean and median), FXX, and STB (higher means but lower medians). All features are described in Table 1. Source data: Supplementary File 1.

Because VCFgenie rejected only 16 variants, it had very high true and false positive rates (>99 per cent) but very low true and false negative rates (<1 per cent). Nevertheless, compared to the standard method of implementing a fixed-VAF cut-off, VCFgenie performed better at low VAFs. Indeed, a relatively high fixed-VAF cut-off of 2 per cent was necessary to outperform VCFgenie for most metrics, whereas VCFgenie retained variants with VAFs as low as 0.3 per cent. Thus, VCFgenie has improved power and overall performance for retaining low-VAF iSNVs when compared to the standard practice of implementing fixed cut-offs.

Machine learning performance

To determine if the detection of true iSNVs at low VAFs could be further improved, we developed machine learning binary classification models using XGBoost (see Supplementary Information). Only minor iSNVs were used, i.e. ALT alleles with VAF < 50 per cent. Models were trained and tested using three strategies (Figs 2 and 3). First, we applied a low-VAF filter without VCFgenie (FM = Filter + Machine learning). Second, we replaced the low-VAF filter with VCFgenie (VM = VCFgenie + Machine learning). Last, we applied both the low-VAF filter and VCFgenie (FVM = Filter + VCFgenie + Machine learning). We evaluated the median performance of models across 50 iterations for all three strategies, where each iteration involved randomly training on 25 samples (all replicates) and testing on 6 samples (all replicates).

For FM models, optimal performance (median AUC = 0.73, MCC = 0.33) was observed when employing the 14 features in the Strict category with the following parameters: true iSNV = 3/3 replicates, false iSNV = 1/3 replicates, VAF lower limit = 1 per cent (Table 4; Supplementary File 5). For VM models, optimal performance (AUC = 0.71, MCC = 0.35) was observed when employing Moderate features with the following parameters: true iSNV = 3/3 replicates, false iSNV = 1/3 replicates, VCFgenie with no lower VAF limit (Table 5; Supplementary File 6). Finally, for FVM models, optimal performance (AUC = 0.73, MCC = 0.32) was observed when employing Strict features with the following parameters: true iSNV = 3/3 replicates, false iSNV = 1/3 replicates, VCFgenie with VAF lower limit = 1 per cent (Table 6; Supplementary File 7). For all models, the worst performance was observed when defining true iSNVs as those present in ≥2/3 replicates (i.e. including the ambiguous group).

Table 4.

Prediction performance for FM (VAF Filter + Machine learning) models. Median performance metrics across 50 iterations (random training/testing datasets) for the parameter/feature combinations with the top 10 MCC values.

True var	False var	VAF lower limit	Feature category	AUC	MCC	F1 score	Accuracy	MSE
3/3	1/3	0.01	Strict	0.726	0.331	0.399	0.781	0.220
3/3	1/3	0.01	Moderate	0.712	0.328	0.402	0.788	0.213
3/3	1/3	0.01	Exhaustive	0.715	0.321	0.386	0.792	0.209
3/3	1/3 and 2/3	0.01	Exhaustive	0.698	0.292	0.336	0.766	0.234
3/3	1/3	0.02	Strict	0.709	0.291	0.319	0.779	0.222
3/3	1/3	0.02	Moderate	0.711	0.290	0.322	0.784	0.216
3/3	1/3 and 2/3	0.01	Moderate	0.695	0.282	0.356	0.769	0.232
3/3	1/3 and 2/3	0.01	Strict	0.711	0.279	0.341	0.771	0.230
3/3	1/3	0.02	Exhaustive	0.699	0.265	0.316	0.781	0.219
3/3	1/3 and 2/3	0.02	Moderate	0.694	0.257	0.287	0.765	0.235

Open in a new tab

Each row represents a distinct parameter/feature combination. Separate FM models were trained for each combination. Each FM model was evaluated on 50 randomly sampled training/testing datasets, and the median performance value is reported. True var = replicate frequency for true iSNVs; False var = replicate frequency for false iSNVs; VAF = variant allele fraction; AUC = area under the receiver operating characteristic curve; MCC = Matthews correlation coefficient; MSE = mean squared error. Rows are ordered by descending MCC value. See Supplementary File 5 for all combinations.

Table 5.

Prediction performance for VM (VCFgenie + Machine learning) models. Median performance metrics across 50 iterations (random training/testing datasets) for the parameter/feature combinations with the top 10 MCC values.

True var	False var	VAF lower limit^a	Feature category	AUC	MCC	F1 score	Accuracy	MSE
3/3	1/3	None	Moderate	0.714	0.352	0.421	0.803	0.198
3/3	1/3	None	Exhaustive	0.722	0.337	0.415	0.806	0.195
3/3	1/3	None	Strict	0.725	0.336	0.407	0.795	0.205
3/3	1/3 and 2/3	None	Strict	0.725	0.336	0.407	0.795	0.205
3/3	1/3 and 2/3	None	None	0.699	0.699	0.314	0.365	0.780
3/3	1/3 and 2/3	None	None	0.707	0.707	0.308	0.369	0.780
2/3 and 3/3	1/3	None	Strict	0.641	0.252	0.434	0.689	0.311
2/3 and 3/3	1/3	None	Moderate	0.638	0.239	0.432	0.686	0.314
2/3 and 3/3	1/3	None	Exhaustive	0.625	0.231	0.424	0.687	0.314

Open in a new tab

For all VM models, a VAF lower limit was replaced with filtering by VCFgenie. All other details as in Table 4. See Supplementary File 6 for all combinations.

Table 6.

Prediction performance for FVM (VAF Filter + VCFgenie + Machine learning) models. Median performance metrics across 50 iterations (random training/testing datasets) for the parameter/feature combinations with the top 10 MCC values.

True var	False var	VAF lower limit^a	Feature category	AUC	MCC	F1 score	Accuracy	MSE
3/3	1/3	0.01	Strict	0.730	0.323	0.394	0.785	0.215
3/3	1/3	0.01	Exhaustive	0.720	0.320	0.387	0.798	0.203
3/3	1/3	0.01	Moderate	0.715	0.317	0.394	0.792	0.209
3/3	1/3 and 2/3	0.01	Moderate	0.695	0.297	0.354	0.772	0.228
3/3	1/3 and 2/3	0.01	Exhaustive	0.699	0.295	0.336	0.770	0.231
3/3	1/3	0.02	Moderate	0.701	0.289	0.323	0.788	0.213
3/3	1/3	0.02	Strict	0.704	0.287	0.321	0.781	0.220
3/3	1/3 and 2/3	0.01	Strict	0.710	0.282	0.343	0.775	0.225
3/3	1/3	0.02	Exhaustive	0.704	0.264	0.316	0.781	0.220
3/3	1/3 and 2/3	0.02	Moderate	0.688	0.253	0.281	0.763	0.238

Open in a new tab

In addition to filtering by VCFgenie, an explicit low-VAF cut-off was imposed for FVM models. All other details as in Table 4. See Supplementary File 7 for all combinations.

VM outperformed the two other strategies (Fig. 8). Specifically, across its 50 iterations, VM exhibited superior performance compared to the FM and FVM models, for both all (Fig. 8A) and optimal (Fig. 8B) parameter/feature combinations. Following hyperparameter tuning based on the optimal parameter/feature combinations (top rows in Tables 4, 5, and 6), VM remained the best strategy and its median performance slightly improved (AUC increased from 0.71 to 0.73; MCC increased from 0.35 to 0.37; Table 7; Supplementary File 8). Our iterative approach also allowed us to evaluate the impact of testing set characteristics on performance: the best performance occurred when the testing set had a false/true iSNV ratio of 4.4 (AUC = 0.85, MCC = 0.69), while the worst performance occurred when this ratio was 15.4 (AUC = 0.44, MCC = –0.06; VM models), suggesting that testing set imbalance substantially impacts prediction (Supplementary File 8). Finally, we assessed feature importance by calculating SHAP (SHapley Additive exPlanations) values (Lundberg and Lee 2017). QUAL (variant quality) was ranked by far as the most important feature, followed by FSRR, GQ, SRR, and FWDB (Fig. 9).

Figure 8. — Performance on testing data. (A) Performance on 50 iterations (randomly sampled training/testing datasets) for the FM, VM, and FVM prediction strategies across all parameter/feature combinations. (B) Performance on 50 iterations for the FM, VM, and FVM prediction strategies for the optimal parameter/feature combinations (first rows in Tables 4, 5, and 6). Source data: Supplementary Files 5, 6, and 7.

Table 7.

Hyperparameter tuning: performance metrics for optimal hyperparameter values.

Strategy	AUC^a	MCC^a	F1 score^a	Accuracy^a	MSE^a^,^b
FM	0.45, 0.72, 0.84	−0.04, 0.35, 0.65	0.09, 0.41, 0.71	0.52, 0.80, 0.89	0.48, 0.20, 0.10
VM	0.44, 0.73, 0.85	−0.06, 0.37, 0.69	0.08, 0.43, 0.75	0.54, 0.81, 0.91	0.46, 0.19, 0.09
FVM	0.46, 0.72, 0.83	−0.04, 0.36, 0.64	0.09, 0.42, 0.70	0.54, 0.80, 0.89	0.46, 0.20, 0.11

Open in a new tab

Worst, median, and optimal performance values. All optimal hyperparameter and performance values are given in Supplementary File 8.

For MSE, lower values signify superior performance.

Figure 9. — iSNV feature importance in machine learning models. SHAP (SHapley Additive exPlanations) feature importance for the 20 features used to train and test the optimal parameter combination for the VM strategy (Moderate category features). (A) Mean feature importance, with features sorted by decreasing value. Mean values were calculated across 50 iterations (randomly sampled training/testing datasets) using the optimal hyperparameters. (B) Dot plot representing the impact of feature value on prediction success. Each dot represents a single iSNV in the training set in a single iteration, where results from all 50 iterations of random training sets are shown. The colour scale denotes feature value from low (blue) to high (red). For example, the predominance of red dots with positive SHAP values for QUAL indicates that high values of this feature contribute substantially to prediction success.

Discussion

In this study, we developed new methods that go beyond using fixed allele frequency, coverage, and other cut-offs for identifying true iSNVs. As a case study, we used HPV, a virus which is not amenable to laboratory culture, and therefore not suited to dilution benchmarking. Instead, 31 samples were each assayed in triplicate to yield sets of true (3/3 replicates), ambiguous (2/3 replicates), and false (1/3 replicates) iSNVs. These benchmarking sets allowed us to define the error profile of Ion Torrent, as well as the features most important for determining true iSNVs. Of note, while replicates were used to establish ground truth in our study, none of our classification approaches require them, e.g. dilution benchmarking could be used instead of replicates for viruses amenable to culture.

Our study provides a basis for distinguishing between true iSNVs and those resulting from sequencing error. Specifically, Ion Torrent sequencing errors were dominated by C→T iSNVs and were relatively independent of trinucleotide context. In contrast, two important HPV mutation mechanisms are context-dependent, namely deamination of methylated CpGs and APOBEC3-induced mutation of TpCs. Thus, accounting for context can disentangle these mechanisms from error and confirms that the observation of context-specific signals in previous studies (e.g. Zhu et al. 2020) were not artefactual.

We designed VCFgenie to introduce a reproducible method of dynamic variant filtering. Variant calling typically yields a VCF file, but current VCF processing tools are designed for filtering but not modifying VCF records; use a predefined set of filtering options; treat reference and alternate alleles inconsistently; or do not provide options for handling sites that are multiallelic (i.e. more than one alternate allele). In addition to meeting these needs, and without the requirement for writing custom code, VCFgenie also implements a dynamic low-VAF cut-off that performs better than fixed cut-offs. This approach yields improved power for detecting iSNVs without sacrificing specificity, is applicable to studies using any sequencing technology or variant caller, and can be used with or without downstream machine learning. Finally, even though VCFgenie is limited to considering only the allele count, coverage, and error rate of an iSNV, failing variants nevertheless display inferior values for nearly all features expected to be indicative of variant confidence.

Subsequent to filtering with VCFgenie, we developed supervised non-linear machine learning methods to further improve power for detecting true iSNVs at low VAFs. This hybrid filter/supervised learning approach allows model training to take advantage of preliminary data processing, resulting in superior performance that is consistent with earlier studies (Omta et al. 2020). Such complex models are useful given that individual features can readily help distinguish between true and false iSNVs at high but not low VAFs, the latter of which is challenging due to the preponderance of sequencing errors at low frequencies. Machine learning improves performance and identifies metadata important for the identification of true iSNVs. Furthermore, our random selection of training and testing sets over many iterations allowed robust assessment of model performance and highlights the impact of testing set characteristics: despite balanced training sets, imbalanced testing sets impair performance.

Our study has some limitations. First, our findings depend on our definitions of true vs. false variants, which are based on the frequency of replicates in which an iSNV is observed. Thus, true vs. false could instead be conceptualized as reproducible vs. non-reproducible variants. Although all reproducible iSNVs are likely to be true, all true iSNVs may not be reproducible. For example, some iSNVs categorized as errors could in fact be true, but present in such a small fraction of the sequenced material that they do not replicate easily. Improving benchmarking sets for biological systems that are not amenable to dilution is an important direction for future research. Next, the parameters for our variant calling pipeline were specifically designed to detect HPV variants from Ion Torrent Ampliseq NGS data, and may therefore need to be customized for other platforms. Moreover, our pipeline resulted in an already high-quality set of initial iSNVs, such that VCFgenie failed very few variants and it was not possible to empirically determine a P value cut-off. We opted for a conservative Bonferroni cut-off, but an optimal value may be determinable for other datasets, and is likely to differ by biological system, methodology, and goal. The machine learning models are also specific to our study (although our scripts may be used to re-train models on other datasets), and were limited to minor iSNVs in a reference-based context (i.e. minor alleles that match the reference were not included). Finally, given the scope of our study, we have not examined datasets generated using other platforms or sample types (e.g. acute respiratory viruses or somatic tumour cells).

Conclusions

We report new methods for detecting true iSNVs at low frequencies in deeply sequenced samples. Benchmarking is performed by sequencing samples in triplicate, a first for within-host HPV analyses. The results provide VCFgenie, a binomial filtering tool applicable to any sequencing platform, as well as supervised machine learning models that can be customized to various technologies and biological systems Together, the combined methods show substantial improvement in accuracy for distinguishing low-frequency errors from true variants in within-host viral studies, and are extendable to other types of within-sample data.

Supplementary Material

veae013_Supp

veae013_supp.zip^{(9.3MB, zip)}

Acknowledgements

We thank Xinzhu (April) Wei, Zachary Ardern, Kristine Jones, and two anonymous reviewers for feedback and discussion; Leonardo Varuzza for assistance interpreting Ion Torrent errors rates and variant caller output; and Ming-Hsueh Lin for feedback on figures.

Contributor Information

Sambit K Mishra, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, 9609 Medical Center Drive, Rockville, MD 20850, USA; Cancer Genomics Research Laboratory, Leidos Biomedical Research, Inc., Frederick National Laboratory for Cancer Research, P.O. Box B, Bldg. 430, Frederick, MD 21702, USA.

Chase W Nelson, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, 9609 Medical Center Drive, Rockville, MD 20850, USA.

Bin Zhu, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, 9609 Medical Center Drive, Rockville, MD 20850, USA.

Maisa Pinheiro, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, 9609 Medical Center Drive, Rockville, MD 20850, USA.

Hyo Jung Lee, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, 9609 Medical Center Drive, Rockville, MD 20850, USA; Cancer Genomics Research Laboratory, Leidos Biomedical Research, Inc., Frederick National Laboratory for Cancer Research, P.O. Box B, Bldg. 430, Frederick, MD 21702, USA.

Michael Dean, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, 9609 Medical Center Drive, Rockville, MD 20850, USA.

Laurie Burdett, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, 9609 Medical Center Drive, Rockville, MD 20850, USA; Cancer Genomics Research Laboratory, Leidos Biomedical Research, Inc., Frederick National Laboratory for Cancer Research, P.O. Box B, Bldg. 430, Frederick, MD 21702, USA.

Meredith Yeager, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, 9609 Medical Center Drive, Rockville, MD 20850, USA; Cancer Genomics Research Laboratory, Leidos Biomedical Research, Inc., Frederick National Laboratory for Cancer Research, P.O. Box B, Bldg. 430, Frederick, MD 21702, USA.

Lisa Mirabello, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, 9609 Medical Center Drive, Rockville, MD 20850, USA.

Data availability

All data and statistical analyses were implemented using R v4.2.2 (tidyverse) (Team 2021), Python (v3.8), or Microsoft Excel. Visualization was performed using R (ggplot2, scales, patchwork), Python (seaborn), and Microsoft PowerPoint. Data and scripts are freely available at https://github.com/NCI-CGR/HPV_low_VAF_SNV_prediction. VCFgenie is licensed under GNU General Public License v3.0 and is freely available at https://github.com/chasewnelson/VCFgenie.

Supplementary data

Supplementary data is available at Virus Evolution Journal online.

Funding

This research was funded by the intramural research program of the Division of Cancer Epidemiology and Genetics, National Cancer Institute, NIH. C.W.N. was supported by the NCI Research Participation Program administered by the Oak Ridge Institute for Science and Education (ORISE) through an interagency agreement between the U.S. Department of Energy (DOE) and the National Institute of Health (NIH). ORISE is managed by ORAU under DOE contract number DESC0014664. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does it mention that trade names, commercial products, or organizations imply endorsement by the U.S. Government. All opinions expressed in this paper are the authors’ and do not necessarily reflect the policies and views of NIH, NCBI, DOE, or ORAU/ORISE.

Conflict of interest:

The authors declare no conflicts of interest. Maisa Pinheiro is currently an employee of GlaxoSmithKline (GSK) (Rockville, Maryland) but completed all work associated with this project while employed at the NCI.

References

Ainscough B. J. et al. (2018) ‘A Deep Learning Approach to Automate Refinement of Somatic Variant Calling from Cancer Sequencing Data’, Nature Genetics, 50: 1735–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
Burk R. D., Harari A., and Chen Z. (2013) ‘Human Papillomavirus Genome Variants’, Virology, 445: 232–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
Castle P. E. et al. (2011) ‘Human Papillomavirus (HPV) Genotypes in Women with Cervical Precancer and Cancer at Kaiser Permanente Northern California’, Cancer Epidemiology Biomarkers and Prevention, 20: 946–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen Z. et al. (2021) ‘K-mer Analyses Reveal Different Evolutionary Histories of Alpha, Beta, and Gamma Papillomaviruses’, International Journal of Molecular Sciences, 22: 9657. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen T., and Guestrin C. (2016). XGBoost: A Scalable Tree Boosting System. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17-August-2016. [Google Scholar]
Cullen M. et al. (2015) ‘Deep Sequencing of HPV16 Genomes: A New High-throughput Tool for Exploring the Carcinogenicity and Natural History of HPV16 Infection’, Papillomavirus Research, 1: 3–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
de Martel C. et al. (2020) ‘Global Burden of Cancer Attributable to Infections in 2018: A Worldwide Incidence Analysis’, The Lancet Global Health, 8: e180–90. [DOI] [PubMed] [Google Scholar]
Fausch S. C. et al. (2003) ‘HPV Protein/peptide Vaccines: From Animal Models to Clinical Trials’, In Frontiers in Bioscience, 8: 81–91. [DOI] [PubMed] [Google Scholar]
Grubaugh N. D. et al. (2019) ‘An Amplicon-based Sequencing Framework for Accurately Measuring Intrahost Virus Diversity Using PrimalSeq and iVar’, Genome Biology, 20: 1–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hirose Y. et al. (2018) ‘Within-Host Variations of Human Papillomavirus Reveal APOBEC Signature Mutagenesis in the Viral Genome’, Journal of Virology, 92: 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kim J. et al. (2019) ‘The Use of Technical Replication for Detection of Low-level Somatic Mutations in Next-generation Sequencing’, Nature Communications, 10: 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
King K. M. et al. (2022) ‘Synonymous Nucleotide Changes Drive Papillomavirus Evolution’, In Tumour Virus Research, 14: 200248. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kogure G. et al. (2023) ‘Intra-Patient Genomic Variations of Human Papillomavirus Type 31 in Cervical Cancer and Precancer’, Viruses, 15: 2104. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lauring A. S. (2020) ‘Within-Host Viral Diversity: A Window into Viral Evolution’, Annual Review of Virology, 7: 63–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lundberg S. M., and Lee S.-I. (2017) A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems 30 (NIPS 2017). 10.48550/arXiv.1705.07874
McCrone J. T., Lauring A. S., and Dermody T. S. (2016) ‘Measurements of Intrahost Viral Diversity are Extremely Sensitive to Systematic Errors in Variant Calling’, Journal of Virology, 90: 6884–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
McLaughlin R. T. et al. (2023) ‘Fast, Accurate, and Racially Unbiased Pan-cancer Tumor-only Variant Calling with Tabular Machine Learning’, Npj Precision Oncology, 7: 4. [DOI] [PMC free article] [PubMed] [Google Scholar]
Meyers C. et al. (1992) ‘Biosynthesis of Human Papillomavirus from a Continuous Cell Line upon Epithelial Differentiation’, Science, 257: 971–3. [DOI] [PubMed] [Google Scholar]
Mirabello L. et al. (2017) ‘HPV16 E7 Genetic Conservation Is Critical to Carcinogenesis’, Cell, 170: 1164–1174.e6. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nelson C. W. et al. (2020) ‘Dynamically Evolving Novel Overlapping Gene as a Factor in the SARS-CoV-2 Pandemic’, ELife, 9: e59633. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nelson C. W., and Hughes A. L. (2015) ‘Within-host Nucleotide Diversity of Virus Populations: Insights from Next-generation Sequencing’, Infection Genetics & Evolution, 30: 1–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nelson C. W., and Mirabello L. (2023) ‘Human Papillomavirus Genomics: Understanding Carcinogenicity’, Tumour Virus Research 15, 200258. [DOI] [PMC free article] [PubMed] [Google Scholar]
Omta W. A. et al. (2020) ‘Combining Supervised and Unsupervised Machine Learning Methods for Phenotypic Functional Genomics Screening’, Slas Discovery: Advancing the Science of Drug Discovery, 25: 655–64. [DOI] [PubMed] [Google Scholar]
Pedregosa F. et al. (2011) ‘Scikit-learn: Machine Learning in Python’, Journal of Machine Learning Research, 12: 2825–30. [Google Scholar]
Pereira F. L. et al. (2016) ‘Evaluating the Efficacy of the New Ion PGM Hi-Q Sequencing Kit Applied to Bacterial Genomes’, Genomics, 107: 189–98. [DOI] [PubMed] [Google Scholar]
Poduri A. et al. (2013) ‘Somatic Mutation, Genomic Variation, and Neurological Disease’, Science, 341: 1237758. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rowson K. E., and Mahy B. W. (1967) ‘Human Papova (Wart) Virus’, Bacteriological Reviews, 31: 110–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
Singh D. et al. (2023) ‘Global Estimates of Incidence and Mortality of Cervical Cancer in 2020: A Baseline Analysis of the WHO Global Cervical Cancer Elimination Initiative’, The Lancet Global Health, 11: e197–206. [DOI] [PMC free article] [PubMed] [Google Scholar]
Spinella J. F. et al. (2016) ‘SNooPer: A Machine Learning-based Method for Somatic Variant Identification from Low-pass Next-generation Sequencing’, BMC Genomics, 17: 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sung H. et al. (2021) ‘Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries’, Ca A Cancer Journal for Clinicians, 71: 209–49. [DOI] [PubMed] [Google Scholar]
Team R. C. (2021) ‘R: A Language and Environment for Statistical Computing’, In R Foundation for Statistical Computing. [Google Scholar]
Van Doorslaer K. et al. (2017) ‘The Papillomavirus Episteme: A Major Update to the Papillomavirus Sequence Database’, Nucleic Acids Research, 45: D499–506. [DOI] [PMC free article] [PubMed] [Google Scholar]
Warren C. J., Santiago M. L., and Pyeon D. (2022) ‘APOBEC3: Friend or Foe in Human Papillomavirus Infection and Oncogenesis?’, Annual Review of Virology, 9: 375–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wu C. et al. (2020) ‘Using Machine Learning to Identify True Somatic Variants from Next-generation Sequencing’, Clinical Chemistry, 66: 239–46. [DOI] [PubMed] [Google Scholar]
Zhu B. et al. (2020) ‘Mutations in the HPV16 Genome Induced by APOBEC3 are Associated with Viral Clearance’, Nature Communications, 11: 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

veae013_Supp

veae013_supp.zip^{(9.3MB, zip)}

Data Availability Statement

[R1] Ainscough B. J. et al. (2018) ‘A Deep Learning Approach to Automate Refinement of Somatic Variant Calling from Cancer Sequencing Data’, Nature Genetics, 50: 1735–43. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Burk R. D., Harari A., and Chen Z. (2013) ‘Human Papillomavirus Genome Variants’, Virology, 445: 232–43. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Castle P. E. et al. (2011) ‘Human Papillomavirus (HPV) Genotypes in Women with Cervical Precancer and Cancer at Kaiser Permanente Northern California’, Cancer Epidemiology Biomarkers and Prevention, 20: 946–53. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Chen Z. et al. (2021) ‘K-mer Analyses Reveal Different Evolutionary Histories of Alpha, Beta, and Gamma Papillomaviruses’, International Journal of Molecular Sciences, 22: 9657. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Chen T., and Guestrin C. (2016). XGBoost: A Scalable Tree Boosting System. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17-August-2016. [Google Scholar]

[R6] Cullen M. et al. (2015) ‘Deep Sequencing of HPV16 Genomes: A New High-throughput Tool for Exploring the Carcinogenicity and Natural History of HPV16 Infection’, Papillomavirus Research, 1: 3–11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] de Martel C. et al. (2020) ‘Global Burden of Cancer Attributable to Infections in 2018: A Worldwide Incidence Analysis’, The Lancet Global Health, 8: e180–90. [DOI] [PubMed] [Google Scholar]

[R8] Fausch S. C. et al. (2003) ‘HPV Protein/peptide Vaccines: From Animal Models to Clinical Trials’, In Frontiers in Bioscience, 8: 81–91. [DOI] [PubMed] [Google Scholar]

[R9] Grubaugh N. D. et al. (2019) ‘An Amplicon-based Sequencing Framework for Accurately Measuring Intrahost Virus Diversity Using PrimalSeq and iVar’, Genome Biology, 20: 1–19. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Hirose Y. et al. (2018) ‘Within-Host Variations of Human Papillomavirus Reveal APOBEC Signature Mutagenesis in the Viral Genome’, Journal of Virology, 92: 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Kim J. et al. (2019) ‘The Use of Technical Replication for Detection of Low-level Somatic Mutations in Next-generation Sequencing’, Nature Communications, 10: 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] King K. M. et al. (2022) ‘Synonymous Nucleotide Changes Drive Papillomavirus Evolution’, In Tumour Virus Research, 14: 200248. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Kogure G. et al. (2023) ‘Intra-Patient Genomic Variations of Human Papillomavirus Type 31 in Cervical Cancer and Precancer’, Viruses, 15: 2104. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Lauring A. S. (2020) ‘Within-Host Viral Diversity: A Window into Viral Evolution’, Annual Review of Virology, 7: 63–81. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Lundberg S. M., and Lee S.-I. (2017) A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems 30 (NIPS 2017). 10.48550/arXiv.1705.07874

[R16] McCrone J. T., Lauring A. S., and Dermody T. S. (2016) ‘Measurements of Intrahost Viral Diversity are Extremely Sensitive to Systematic Errors in Variant Calling’, Journal of Virology, 90: 6884–95. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] McLaughlin R. T. et al. (2023) ‘Fast, Accurate, and Racially Unbiased Pan-cancer Tumor-only Variant Calling with Tabular Machine Learning’, Npj Precision Oncology, 7: 4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Meyers C. et al. (1992) ‘Biosynthesis of Human Papillomavirus from a Continuous Cell Line upon Epithelial Differentiation’, Science, 257: 971–3. [DOI] [PubMed] [Google Scholar]

[R19] Mirabello L. et al. (2017) ‘HPV16 E7 Genetic Conservation Is Critical to Carcinogenesis’, Cell, 170: 1164–1174.e6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Nelson C. W. et al. (2020) ‘Dynamically Evolving Novel Overlapping Gene as a Factor in the SARS-CoV-2 Pandemic’, ELife, 9: e59633. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Nelson C. W., and Hughes A. L. (2015) ‘Within-host Nucleotide Diversity of Virus Populations: Insights from Next-generation Sequencing’, Infection Genetics & Evolution, 30: 1–7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Nelson C. W., and Mirabello L. (2023) ‘Human Papillomavirus Genomics: Understanding Carcinogenicity’, Tumour Virus Research 15, 200258. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Omta W. A. et al. (2020) ‘Combining Supervised and Unsupervised Machine Learning Methods for Phenotypic Functional Genomics Screening’, Slas Discovery: Advancing the Science of Drug Discovery, 25: 655–64. [DOI] [PubMed] [Google Scholar]

[R24] Pedregosa F. et al. (2011) ‘Scikit-learn: Machine Learning in Python’, Journal of Machine Learning Research, 12: 2825–30. [Google Scholar]

[R25] Pereira F. L. et al. (2016) ‘Evaluating the Efficacy of the New Ion PGM Hi-Q Sequencing Kit Applied to Bacterial Genomes’, Genomics, 107: 189–98. [DOI] [PubMed] [Google Scholar]

[R26] Poduri A. et al. (2013) ‘Somatic Mutation, Genomic Variation, and Neurological Disease’, Science, 341: 1237758. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Rowson K. E., and Mahy B. W. (1967) ‘Human Papova (Wart) Virus’, Bacteriological Reviews, 31: 110–31. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Singh D. et al. (2023) ‘Global Estimates of Incidence and Mortality of Cervical Cancer in 2020: A Baseline Analysis of the WHO Global Cervical Cancer Elimination Initiative’, The Lancet Global Health, 11: e197–206. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Spinella J. F. et al. (2016) ‘SNooPer: A Machine Learning-based Method for Somatic Variant Identification from Low-pass Next-generation Sequencing’, BMC Genomics, 17: 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Sung H. et al. (2021) ‘Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries’, Ca A Cancer Journal for Clinicians, 71: 209–49. [DOI] [PubMed] [Google Scholar]

[R31] Team R. C. (2021) ‘R: A Language and Environment for Statistical Computing’, In R Foundation for Statistical Computing. [Google Scholar]

[R32] Van Doorslaer K. et al. (2017) ‘The Papillomavirus Episteme: A Major Update to the Papillomavirus Sequence Database’, Nucleic Acids Research, 45: D499–506. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] Warren C. J., Santiago M. L., and Pyeon D. (2022) ‘APOBEC3: Friend or Foe in Human Papillomavirus Infection and Oncogenesis?’, Annual Review of Virology, 9: 375–95. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] Wu C. et al. (2020) ‘Using Machine Learning to Identify True Somatic Variants from Next-generation Sequencing’, Clinical Chemistry, 66: 239–46. [DOI] [PubMed] [Google Scholar]

[R35] Zhu B. et al. (2020) ‘Mutations in the HPV16 Genome Induced by APOBEC3 are Associated with Viral Clearance’, Nature Communications, 11: 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Improved detection of low-frequency within-host variants from deep sequencing: A case study with human papillomavirus

Sambit K Mishra

Chase W Nelson

Bin Zhu

Maisa Pinheiro

Hyo Jung Lee

Michael Dean

Laurie Burdett

Meredith Yeager

Lisa Mirabello

Abstract

Introduction

Figure 1.

Materials and methods

Study population

NGS: Independent technical replicates

Variant classification

VCFgenie: A dynamic VAF cut-off using the binomial distribution

Parameters and machine learning

Table 1.

Table 2.

Table 3.

Figure 2.

Figure 3.

Results

Characterization of iSNV replicates

Figure 4.

iSNV evaluations by trinucleotide context reveal a unique pattern for true vs. false variants

Figure 5.

iSNV features and prediction approaches

Figure 6.

VCFgenie performance

Figure 7.

Machine learning performance

Table 4.

Table 5.

Table 6.

Figure 8.

Table 7.

Figure 9.

Discussion

Conclusions

Supplementary Material

Acknowledgements

Contributor Information

Data availability

Supplementary data

Funding

Conflict of interest:

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases