Skip to main content
NAR Molecular Medicine logoLink to NAR Molecular Medicine
. 2024 Jan 24;1(1):ugae003. doi: 10.1093/narmme/ugae003

Geno2pheno: recombination detection for HIV-1 and HEV subtypes

Martin Pirkl 1,2,, Joachim Büch 3,4, Georg Friedrich 5, Michael Böhm 6,7, Dan Turner 8, Olaf Degen 9, Rolf Kaiser 10,11,12, Thomas Lengauer 13,14,15
PMCID: PMC12430010  PMID: 41255566

Abstract

Even after three decades of antiretroviral therapy for HIV-1 (human immunodeficiency virus 1), therapy failure is a continual challenge. This is especially so if the viral variant is a recombinant of subtypes. Thus, improved diagnosis of recombined subtypes can help with the selection of therapy. We are using a new implementation of the previously published computational method recco to detect de novo recombination of known subtypes, independent of and in addition to known circulating recombinant forms (CRFs). We detect an optimal path in a multiple alignment of viral reference sequences based on mutation calls and probable breakpoints for recombination. A tuning parameter is used to favor either mutation calls or breakpoints. Besides novel recombinants, our tool g2p-recco integrated in the geno2pheno web service (https://geno2pheno.org) can successfully detect known recombinant events given only the full consensus references (without CRFs) of the involved subtypes with breakpoints. In addition, the tool can be applied to other viruses, i.e. hepatitis E virus (HEV). In this fashion, we could also detect several previously unknown recombinations in HEV.

Graphical Abstract

Graphical Abstract.

Graphical Abstract

Introduction

The human immunodeficiency virus 1 (HIV-1) pandemic continues to be a major burden on society worldwide. Therapeutic success is still not ensured, especially in low-income countries (1). The identification of the correct HIV-1 subtype can support the choice of a successful therapy (2). Recombination events (3–6), known to occur in HIV-1, provide additional insight into the epidemiology of HIV-1 and may become more prominent. Like HIV-1, hepatitis E virus (HEV) is prevalent worldwide, occurring in different genotypes (1–4) and subtypes. Genotype 3 is pandemic in animals (pork, rabbit) in Europe. In most cases, HEV genotype 3 causes a mild disease and a nonchronic infection, except in immunocompromised patients who can suffer from severe and chronic courses. So far, HEV subtyping has not been a major focus of interest. In contrast to HIV-1, recombinants of different subtypes of the HEV are considered rare, but some have been reported (7,8).

The majority of subtyping tools focus on either known subtypes or known circulating recombinant forms (CRFs). However, novel recombination events are hardly considered or considered just in specific observed and scrutinously analyzed cases (9,10). Our objective was therefore to develop a subtyping tool that not only identifies known subtypes or recombinants, but also detects new recombinant virus variants.

We are employing a new implementation g2p-recco of the computational method recco (11) to detect recombination events, in general, and we are able to detect de novo recombinants of known subtypes, independent of and in addition to CRFs. Our method scans along the positions of the multiple alignment of reference sequences for relevant subtypes and penalizes mutations in the query sequence with a cost factor α in (0, 1). If the cost of mutations along the current reference sequence can be offset by a jump to another reference sequence, the method does so, eventually calculating an optimal path through the multiple alignment. That path either follows just one subtype or combines different subtypes to a recombined sequence. The method efficiently identifies the optimal path with dynamic programming.

Besides novel recombinants, our tool can successfully detect known recombination events given only the full references (without CRFs) of the participating subtypes. However, the recombinant sequences are given in more detail with explicit breakpoints. For example, if a CRF is annotated as part subtype A, our method differentiates between all subtypes from A1 up to A8. g2p-recco is the default tool in our geno2pheno (https://geno2pheno.org) software suite used for subtyping of HIV-1. In addition, the tool can be applied to other viruses, i.e. HEV (https://hev.geno2pheno.org). By this, we could also detect several previously unknown recombinations in HEV. For example, from the unidentified genotype 4 sequences (HEV-GLUE: hev.glue.cvr.ac.uk), the longest one is a recombinant with parts from 4g (77%), 4c (16%) and 4h (26%).

Materials and methods

g2p-recco

We review the method recco (11) and introduce our new simplified implementation (g2p-recco).

The idea of recco is that we move along a multiple sequence alignment A = (Aip) with reference i ∈ {1, …, M} and position p ∈ {1, …, N}. We move along a possible path and tally the cost for the accumulated mutations. Each mutation is adding a cost penalty α ∈ (0, 1). If the number of mutations exceeds a certain threshold, a jump to a different reference with fewer mutations is preferred. For example, let α = 0.1. If reference i = 1 has two mutations compared to a query sequence s = (s1, …, sN) between position p = 1 and position 100 and reference 2 has 20 mutations, the best path from 1 to 100 is along reference 1. However, if the mutation count between position 101 and position 200 is 1 in reference 2 and 11 in reference 1, a jump from reference 1 to reference 2 with a penalty of 1 − α = 0.9 is preferred. In this case, the number of mutations in reference 1 exceeds Inline graphic compared to reference 2. Thus, a recombination is introduced to the subtype.

Formally let kp ∈ {1, …, M} be a decision variable indicating the reference on our path at position p. We can compute the cost of each path k = (k1, …, kN) for a given query sequence s and cost factor α. The best path is the one that minimizes the function

graphic file with name M0001a.gif (1)
graphic file with name M0002.gif (2)

with the Kronecker symbol δ.

We solve this optimization problem with dynamic programming. We can reformulate Equation (2) to be used in the forward algorithm.

graphic file with name M0003.gif (3)

with fj,0 = 0 for all j. The backward algorithm is an alternative to the forward algorithm (11).

The version of recco (g2p-recco) used in our web service (https://geno2pheno.org) is optimized for speed and implemented in C++. It does not provide any robustness analysis like P-values or confidence intervals. The full web service is implemented in R (12–15).

We compute the fit of a query sequence s to the predicted path in the following way. Let Inline graphic be the optimal path for s. We compute the cost by Inline graphic. The fit is computed as Inline graphic, with Cmax = N × min{α, 1 − α}. Cmax is the theoretical maximum cost of a sequence for a given path. That is, if α < 0.5, the sequence s has a mutation at every position of the path, and if α ≥ 0.5, the path jumps to a different reference at every position to avoid mutations.

Accuracy

Prediction accuracy

We compare the original labels of CRFs to our inferred recombinants. Subtype A is mostly not further specified in known recombinants. Hence, if we predict a recombinant including any of the subtypes A1–A8, we reduce this prediction to A for the accuracy calculations.

We compute two different measures of assessing the accuracy of our predictions. The first measure counts how many of the known subtypes are included in our prediction (reference in prediction). For example, when the known subtype is a recombinant of B, C, A and G, and our prediction is B, A and G, we compute an accuracy of 75%. However, if we predict B, C, A, G and F, the accuracy is 100%, because all original subtypes are predicted. The second measure is the exact same, but switches known subtypes with our prediction (prediction in reference). The first measure focuses on false-negative (sensitivity) predictions and the second focuses on false-positive (specificity) predictions.

Sequence similarity

We compute the similarity s(S1, S2) of two aligned sequences S1 = {s1,1, …, s1,N} and S2 = {s2,1, …, s2,N} by

graphic file with name M0007.gif (4)

We do not count gaps in the alignment. We define the similarity matrix Inline graphic.

Breakpoint accuracy

We compute the accuracy of a breakpoint by computing the absolute distance to the closest predicted breakpoint. For example, if the simulated sequence has a breakpoint at position 500 and our predicted path has a breakpoint at position 550, the distance is 50. That is, the higher the distance, the less accurate is our prediction. If we predict no breakpoint, we use half of the number of positions of our multiple sequence alignment as distance (4446).

Sequences and references

Our final reference set consists of subtypes A1–A4, A6–A7, B, C, D, F1, F2, G, H, J, K, L, AE, O, N and P and 26 nonhuman primate sequences (not included in https://geno2pheno.org). We included CRF_01AE because there is no reference E. Hence, we could never detect a recombination of A and E. Reference G is included in our set and we can detect a recombination of A and G without including CRF02_AG. We reduced the reference set to 22 by creating a consensus sequence for each distinct subtype. For the special subtype A6, we added the Russian reference ‘A6.RU.2020.RU_ERS_3635_2020.MZ427710’ to cover more sequence variability. Potential A6 sequences not in our reference set were collected by Tel Aviv, Israel (14), Hamburg, Germany (28) and from the GenBank (1017). All sequences come from A6 relevant regions, mostly Ukraine. A test set used by COMET (16) (https://comet.lih.lu/) for training holds 572 sequences.

HEV references consist of 135 sequences from subtypes 1a–1f, 2, 3a–3c, 3e–3h, 4a–4d, 4f–4i and 7.

The references used in the analysis were aligned against each other with the tool ‘mafft’ (17).

Synthetic sequences

References used for synthetic recombination were A1–A4, A6–A7, B, C, D, F1, F2, G, H, J, K, L, AE, O, N and P. For each sequence, we first uniformly sampled the number r of references (r ∈ {2, …, 5}). Next, we sampled r − 1 breakpoints, and for each section between two breakpoints or start/end point, we sampled a corresponding reference. Additionally, we uniformly replaced a fraction of nucleotides to introduce noise. We randomly substituted 0%, 10%, 25% and 50% of all nucleotides in the synthetic recombination. For each noise level, we created 100 independent synthetic recombinations.

Results

Predicting CRFs

We analyzed the performance of our new tool g2p-recco implementation by excluding CRFs from our set of known references and trying to correctly predict the excluded recombinations. In this analysis, we excluded complex (cpx) CRFs, because we decided to analyze them separately. Our differentiation of simple and complex CRFs is based on the annotation by the Los Alamos Sequence Database (https://www.hiv.lanl.gov/).

α ∈ (0, 1) is the tuning parameter. In general, the larger α, the more recombinations are favored over mutations. We optimize α by running the prediction once for each α ∈ {0.01, 0.02, …, 0.49, 0.5}. We use two accuracy measures with the first computing the fraction of reference subtypes annotated to the query sequence that are also included in the predicted subtypes (reference in prediction) and the second computing the predicted subtypes included in the references annotated to the query sequence (prediction in reference). A low accuracy for reference in prediction implies a high false-negative rate, annotated references are not predicted, and a low prediction in reference implies a high false-positive rate, predicted subtypes are not included in the annotated reference.

We choose the optimal α at the point where the two opposing accuracy measures are approximately equal (α = 0.09, Figure 1A). We also computed a conservative α that minimizes the false-positive rate (α = 0.03). We take the maximum of the ‘prediction in reference’, rounded to two decimal points, at the maximum value of the ‘reference in prediction’ accuracy. Conversely, we computed an α = 0.14 minimizing the false-negative rate. However, since both accuracies are already high at their equal point, we choose the unbiased, neither biased toward false-positive rate nor biased toward false-negative rate, value α = 0.09 for all subsequent analyses unless stated otherwise. Predicting CRFs purely from consensus base subtypes shows that we find the correct recombination in most cases (Figure 1B).

Figure 1.

Figure 1.

Optimized α for noncomplex CRF prediction. We assessed the accuracy of the CRF prediction over a wide range of different α values (A). Accuracy for the optimized value α = 0.09 (B) shows a high accuracy for both predicted subtypes included in the CRF references and CRF references included in the prediction. Black bars denote the median and circles denote outliers. We also optimized α for a minimal false-positive rate (α = 0.03) and a minimal false-negative rate (α = 0.14).

We performed an analogous analysis and predicted all sequence subtypes for the much larger training set (572 sequences) used by COMET (16) (https://comet.lih.lu/) with the same consensus reference set as before. The results are virtually identical (Supplementary Figure S1).

Complex CRFs

Complex CRFs are recombinants of more than two subtypes, including unclassified subtypes. However, ignoring unclassified subtypes, two complex CRFs consist of only two and several of three subtypes. Supplementary Table S1 shows our manually curated list of complex subtypes and the literature references. We used g2p-recco to predict complex references (Supplementary Table S1, third column) and computed the accuracy as in the previous analysis. With regard to unclassified subtypes, for CRF45_cpx our results show no evidence of any third known subtype, because the predictions for both available sequences only include the classified subtypes. Conversely, for CRF25_cpx we predict an additional subtype J in two out of three sequences. In this case, subtype J could be the previously unclassified subtype.

Predictions of complex CRFs entail more false positives than for predicting noncomplex CRFs (Figure 2A), involving a predicted subtype, which is not contained in the annotation of the complex CRF. However, median accuracy over all complex CRFs still remains at 100%. The drop in accuracy is not surprising, because complex CRFs are complex not only by annotation but evidently also by their recombination complexity, which is defined by the number of breakpoints (Figure 2B). Complex CRFs have more predicted breakpoints, on average, than noncomplex CRFs (Wilcoxon rank-sum test, one-sided, P-value <10−9). There are also complex CRFs for which we predict only 2 breakpoints and other CRFs for which we predict up to 12. A high number of breakpoints usually correlates with a larger number of different subtypes, but not necessarily. For example, the large number of breakpoints could just be the result of jumping back and forth between few subtypes along the sequence. In the next analysis, we show an example. The lower accuracy for ‘prediction in reference’ of complex CRFs compared to noncomplex CRFs implies that the number of subtypes also increases compared to the annotation; that is, we predict more and different subtypes than in the annotation.

Figure 2.

Figure 2.

Complex CRFs. The accuracies for the complex CRFs (A) show more false positives than for the noncomplex CRFs. Black bars denote the median, circles denote outliers, the thin black bar denotes the minimum and the box denotes the range of the second quartile. The number of identified breakpoints is significantly greater (P-value <10−9, Wilcoxon rank-sum test, one-sided) for complex CRFs compared to noncomplex CRFs (B). For visualization purposes, the distribution of breakpoints has been normalized (density area equal to 1) for both groups.

The similarity (fraction of identical nucleotides at the same position; see the ‘Materials and methods’ section) matrix among the paths of all complex sequences shows that, in general, sequences annotated by the same complex CRF have the highest pairwise similarity (Figure 3A). However, the paths for complex CRFs of type CRF36_cpx have the highest within-group diversity (Figure 3B). Two sequences share many segments along the path except for the end where they divert between subtypes G and AE. The third sequence is more similar to subtypes A3 and G, where the other two share the similarity to A1. A1 and A3 are similar sub-subtypes, but G is evidence for a strong difference in that segment.

Figure 3.

Figure 3.

Complex diversity. The similarity matrix among all paths over all complex CRFs (A). Similarity is shown as a gradient of percentages from low to high. As expected, the different complex CRF groups cluster together showing that they are more similar to each other than to other complex CRFs. CRF36_cpx sequences are the most diverse (B), but still share many parts along their paths.

Synthetic recombinations

We randomly combined known references into new recombinant forms (n = 100) and assessed our predictive performance as before. We first uniformly sampled the number r of references (r ∈ {2, …, 5}). Next, we sampled r − 1 breakpoints, and for each section between two breakpoints, we sampled a corresponding reference. Additionally, we uniformly replaced a fraction of nucleotides to introduce noise. We randomly substituted 0%, 10%, 25% and 50% of all nucleotides in the synthetic recombinations. We compared the performance of our implementation g2p-recco (G2P) to COMET (16) (https://comet.lih.lu/) and the Stanford algorithm (STA) (18) (https://hivdb.stanford.edu/hivdb/by-sequences/). COMET is also designed to detect new recombinations (without breakpoints), while STA is restricted to known CRFs. Additionally, COMET uses a very fast algorithm and can therefore use 572 unique sequences as references for prediction.

Both G2P and COMET perform similarly well without noise (Figure 4A). STA is already unable to detect most recombinations without noise, but at least most of what STA predicts is included in the recombinations (low false-positive rate). As noise increases, performance naturally drops (Figure 4BD). However, G2P’s accuracy remains high over all noise levels compared to COMET. Curiously, at 50% noise we found that COMET basically only predicts chimpanzee sequences. While we consider G2P to be a subtyper for HIV-1, we included chimpanzee reference sequences into our reference set for this analysis as well for a fair comparison. We have to add that the highest noise level (50%) in our analysis seems quite unrealistic. The median similarity fit falls significantly below 70% (Figure 4E). Sequences with such a low similarity are usually discarded as not of type HIV-1. However, up to noise levels of 10%, similarity is still around a quite reasonable 90%.

Figure 4.

Figure 4.

Accuracy predicting synthetic recombinants. Left violins show the accuracy of references included in the prediction and right violins show the accuracy of our prediction included in the references for different noise levels (AD) for each method, respectively. An X marks the mean accuracy. Black bars denote median (thick) and minimum/maximum (thin), circles denote outliers and boxes denote the second and third quartiles. The histograms (E) show the G2P fit of the query sequences to the predicted path over all noise levels. The violin plots corresponding to the histograms show the distance (in log to base 10) of synthetic breakpoints to predicted breakpoints (F).

For g2p-recco, we additionally examined the predicted breakpoints. We computed the distance to the synthetic breakpoints as described in the ‘Materials and methods’ section. The median distance between synthetic and predicted breakpoints was <100 (broken down to noise levels: 94, 63, 73, 48) positions (Figure 4F). We consider this a low number relative to the length of the multiple sequence alignment, i.e. the maximal distance of 8891 (length of alignment minus 1). This maximal distance was actually achieved by some outliers. Like the accuracy of subtype prediction, the precision of predicted breakpoints was stable against noise.

A6 analysis

HIV-1 subtype A6 has been getting more widespread over the last few years (19). Correct identification is important for deciding on appropriate treatment strategies (20,21). G2P supports A6 identification and also the detection of A6 as part of a recombination. We assess the practicability by the analysis of a data set of potential A6 sequences. The sequences were deemed to be potential A6 subtypes due to their sequence similarity to A6 references and their country of origin.

G2P predicts most (928) of the collected sequences as subtype A6 (Figure 5A). A substantial number of sequences (71) are classified as subtype B. The rest are mostly mixtures of different subtypes (A, B, D, AE, G) with some recombinants mostly including A6. Fit to the optimal reference(s) is very high (88–98%), which means that the non-A6 subtypes are unlikely to be artifacts (Figure 5B). The sequences from Israel and Germany are all mostly A6 and not recombinant (Supplementary Figures S2 and S3).

Figure 5.

Figure 5.

A6 analysis. Distribution of different subtypes (A) and the corresponding fit of each sequence to the prediction (B).

We further analyzed all A6 sequences for their stability. How large can the parameter α get until a sequence obtains the first breakpoint, second breakpoint and so on? For this purpose, we divided the sequences into the two groups ‘recombinant’ and ‘non-recombinant’ defined by our previous results (Figure 5A). Naturally, the number of breakpoints increases monotonically (Figure 6A). The number of breakpoints is significantly higher in the recombinant group than in the nonrecombinant group for each α value. A closer look at a conservative (i.e. high breakpoint penalty) interval for α ∈ (0, 0.26) reveals that there seems to be a distinct difference in diversity between the recombinant and nonrecombinant groups (Figure 6B). Our default value of 0.09 identifies all recombinants as such, as expected. A smaller value of 0.08 would also be sufficient to identify 97.5% of all recombinants. Up until α = 0.15, all of the nonrecombinant group sequences have exactly zero identified breakpoints. This shows that this group is much more stable, on average, and we need to increase the cost of a mutation by almost twice (0.17) of our default value.

Figure 6.

Figure 6.

A6 recombinant population. The median and 95% intervals of the number of breakpoints as a function of α. The number of breakpoints is significantly larger for each increasing α for the sequences identified as recombinants with default α = 0.09 (A). Especially for small α, the sequences identified as single sequence subtypes show no breakpoints at all (zoomed in, B).

Hepatitis E virus

Unclassified sequences

Recombinations of different HEV genotypes or subtypes are not considered to be frequent events. We analyzed previously unidentified subtypes of HEV genotype 4, but without having any prior information. That is, we do not know whether they are of genotype 4.

Out of nine previously unidentified subtypes, three were predicted as 4c, others as 4a, 4b, 4d and 4h, one sequence had a fit of below 70% to its predicted subtype and one was predicted as the only recombinant of 4g (77%), 4c (16%) and 4h (7%). While the percentage (7%) is low, the length of the partial sequence to be predicted as 4h is over 500 bp. The fit to the predicted subtype of all identified sequences ranged from 86% to 94%. However, the recombinant sequence is the only one with a reasonable length of 7227 bp, while all other sequences are mere fractions of lengths <350 bp. For comparison, we submitted all nine sequences to https://www.genomedetective.com/app/typingtool/hev/, which uses a phylogenetic approach to subtype sequences. The recombinant sequence was put closest to 4g, which is the subtype predicted by g2p-recco for 77% of the sequence. The next closest subtypes were 4h, 4c and 4f, which, except for 4f, is also in agreement with the g2p-recco prediction. From the other eight short sequences, both methods made the same prediction for only two sequences. In general, predictions are quite uncertain for sequences as short as these.

The output of our G2P web services consists of not only the information of subtypes and their respective segment along the sequence, but additionally the identified breakpoints with respect to the query itself. Additionally, g2p-recco returns a figure of the sequence’s path with respect to the predicted references (Figure 7).

Figure 7.

Figure 7.

Path of an HEV recombinant. The path along references for the previously unidentified HEV sequence. Rows show the sequences and columns show the positions. This picture was taken from our HEV tool at https://hev.geno2pheno.org. The recombinant sequence starts the path with 4h, jumps to 4g, then 4c and back to 4g till the end.

We inspected the recombinant HEV sequence in more detail. We inspected the cost as a function of α and noted all values of α at which a novel breakpoint was introduced (Figure 8). The first breakpoint already occurs at α = 0.05, almost half of our default value (0.09). At α = 0.06, the number of breakpoints and subtype references has already reached three as shown in Figure 7. The number of breakpoints is monotonically increasing with α as expected. The number of involved references could theoretically decrease, but is also mostly increasing, except from α = 0.06 to 0.07 where it decreases from three to two only to increase again to three at α = 0.09 (not shown), while the number of breakpoints stays constant at three. The cost is also monotonically increasing with α up to 0.5, when cheap jumps decrease the overall cost. We consider results from large α values with up to 602 breakpoints and 32 of 34 subtypes not realistic.

Figure 8.

Figure 8.

Breakpoints and subtypes for the HEV sequence identified as recombinant as a function of α. Vertical lines indicate values of α at which the number of breakpoints changes. For some such lines, the numbers for involved subtypes are shown in the top row and the numbers for breakpoints are shown below to indicate the magnitude. As the number of breakpoints increases, the color of the vertical lines change.

Worldwide HEV recombinations

We looked at recombination events of the HEV-GLUE database stratified by continent (Africa, Americas, Asia, Europe, Oceania). We used the R package ‘countrycode’ (22) to assign sequences to continents. Apart from Asia and Europe, sample sizes were low (Table 1). Recombination events with 1–3% are very rare except for African sequences with 10% recombinants. Fisher’s exact test (one-sided) implies that African sequences are recombinants more often than Asian or European sequences (P-values <0.01). The average fit is also among the highest for Africa with 0.96 and is similar to that of the sequences without recombinations. Our data imply that recombinants have a worse fit, in general.

Table 1.

Recombination events (rec) per continent

Continent Recombinants Fit (rec) Fit (overall) N (rec) N (overall)
Asia 0.01 0.86 0.94 27 2353
Europe 0.02 0.89 0.94 68 4414
Africa 0.1 0.96 0.96 4 40
Oceania 0 NA 0.96 0 6
Americas 0.03 0.93 0.95 3 113

The table shows the continent, fraction of recombinants, fit of sequences classified as recombinants, average fit over all sequences, and number of recombinants and number of sequences.

Discussion

We have introduced a new web service for geno2pheno (https://g2pi.geno2pheno.org) that predicts HIV-1 subtypes based on a nucleotide sequence. G2P can predict novel recombinants using a C++ implementation of the algorithm recco (11) included in https://geno2pheno.org (g2p-recco).

We have shown that a limited set of consensus HIV-1 subtypes without CRFs is enough to successfully predict CRFs and synthetic recombinants with high accuracy. Even highly diverse complex CRFs are well predicted by geno2pheno. The slightly higher false-positive rate of on average 8% (6% for noncomplex CRFs)—we predict more subtypes than annotated—could be evidence for a much more complex picture than previously thought. Our inferred high number of breakpoints for complex variants, significantly higher than for other CRFs, supports this hypothesis. While the different groups of complex CRFs were highly distinct from each other, within-group variance for sequences of the same complex CRF can still be high.

Our simulations with synthetic recombinants show that COMET is much more sensitive to noise than g2p-recco. Both methods are conservative in the sense that they have a low false-positive rate, in general. That is, most of the predicted subtypes are correct. g2p-recco is extremely robust. Even in the setting, where the sequences are so highly mutated that we would not consider them HIV-1 anymore, g2p-recco’s predictive power remains strong. STA can only detect normal subtypes and CRFs at most, and in no place, the algorithm was able to compete with g2p-recco and COMET. However, when it comes to speed, COMET easily outperforms g2p-recco by almost an order of magnitude (15 s versus 110 s for 100 full genome HIV-1 sequences). Hence, if thousands of high-quality (low-noise) sequences need a speedy analysis, COMET might be the better choice. However, g2p-recco additionally provides accurate breakpoints with reference to the input sequence.

Due to recent sociopolitical developments, the HIV-1 subtype A6 is again of high interest. Correct classification can offer therapy options with a high success rate. For the most part, the sequences in our cohort of A6 sequences were predicted as pure A6. However, almost one eighth of them were predicted as B and several were predicted as recombinants, sometimes but not always involving A6. Our analysis showed that the recombinant sequences were highly distinct from the other sequences. Varying the values of our tuning parameter α showed that the sequences not predicted as recombinants were quite stable. Lowering the cost of a recombination event down to α = 0.17 led to breakpoints. On the other hand, the sequences predicted to be recombinants got their first breakpoints at a much lower level of α (α = 0.08). This is evidence for the fact that recombinants are not formed continuously, e.g. by iterative accumulation of mutations, but by more drastic events, i.e. the breaking and fusing of two or more different virus variants.

Our subtyper is not restricted just to HIV-1. Currently, we also include HEV into our prediction (https://hev.geno2pheno.org). Our proof of concept showed that recombination events in HEV seem infrequent indeed (worldwide). However, we showed for a special case that recombinations are possible even with a much stricter penalty (α = 0.05) than our default in g2p-recco (α = 0.09). The extension of the tool to other viruses is planned. Our tool solely dedicated to subtype prediction without any other analysis on top is accessible at https://subtyping.geno2pheno.org.

In conclusion, we have shown that our implementation g2p-recco into geno2pheno together with a limited set of consensus subtypes without CRFs is a competitive and useful tool for subtype prediction. Currently, speed is our major limitation, which we plan to address in future versions. Novel emerging subtypes can be integrated into the current version without much effort and time.

Supplementary Material

ugae003_Supplemental_File

Acknowledgements

Author contributions: M.P., R.K. and T.L. conceptualized the study. M.P., J.B., G.F. and M.B. integrated the method into the web service (https://geno2pheno.org). D.T. and O.D. provided A6 sequences. D.T., M.B. and O.D. managed data curation and access. M.P. analyzed the data under the supervision of T.L. M.P., R.K. and T.L. interpreted the results. M.P., R.K. and T.L. wrote the initial draft of the manuscript. All authors have read and agreed to the published version of the manuscript.

Contributor Information

Martin Pirkl, Institute of Virology, Faculty of Medicine and University Hospital Cologne, University of Cologne, 50935 Cologne, Germany; German Center for Infection Research (DZIF), Partner Site Cologne-Bonn, 50935 Cologne, Germany.

Joachim Büch, Institute of Virology, Faculty of Medicine and University Hospital Cologne, University of Cologne, 50935 Cologne, Germany; German Center for Infection Research (DZIF), Partner Site Cologne-Bonn, 50935 Cologne, Germany.

Georg Friedrich, Max Planck Institute for Informatics, Saarland Informatics Campus, 66123 Saarbrücken, Germany.

Michael Böhm, Institute of Virology, Faculty of Medicine and University Hospital Cologne, University of Cologne, 50935 Cologne, Germany; German Center for Infection Research (DZIF), Partner Site Cologne-Bonn, 50935 Cologne, Germany.

Dan Turner, Crusaid Kobler AIDS Center, Tel Aviv Sourasky Medical Center, affiliated to the Sackler Faculty of Medicine, Tel Aviv University, 69978 Tel Aviv, Israel.

Olaf Degen, Outpatient Center of UKE GmbH, Division of Infectious Diseases, University Medical Center Hamburg-Eppendorf (UKE), 20251 Hamburg, Germany.

Rolf Kaiser, Institute of Virology, Faculty of Medicine and University Hospital Cologne, University of Cologne, 50935 Cologne, Germany; German Center for Infection Research (DZIF), Partner Site Cologne-Bonn, 50935 Cologne, Germany; EuResist Network GEIE, 00152 Rome, Italy.

Thomas Lengauer, Institute of Virology, Faculty of Medicine and University Hospital Cologne, University of Cologne, 50935 Cologne, Germany; German Center for Infection Research (DZIF), Partner Site Cologne-Bonn, 50935 Cologne, Germany; Max Planck Institute for Informatics, Saarland Informatics Campus, 66123 Saarbrücken, Germany.

Data availability

HIV-1 reference sequences for known subtypes and CRFs are available at the Los Alamos Sequence Database (https://www.hiv.lanl.gov/). We downloaded HIV-1 premade alignments of full genome HIV-1 subtype references. Most A6 sequences are publicly available at GenBank via accession numbers OK474701–OK474757, MW756383–MW756427, OL792300–OL792612, OP352923–OP353328 and OP353329–OP353518. A test set used by COMET (16) (https://comet.lih.lu/) for training was downloaded from the COMET website. A6 sequences from Tel Aviv and Hamburg have been deposited at GenBank with accession numbers PP116030–PP116071. HEV sequences were downloaded from HEV-GLUE (http://hev.glue.cvr.ac.uk/), including nine unidentified genotype 4 sequences. The source code accompanying this article is available at https://doi.org/10.5281/zenodo.10554802.

Supplementary data

Supplementary Data are available at NARMME Online.

Funding

This work was funded by DZIF TTU HIV Immunecontrol and EuResist Network GEIE.

Conflict of interest statement. None declared.

References

  • 1.Deng P., Chen M., Si L.. Temporal trends in inequalities of the burden of HIV/AIDS across 186 countries and territories. BMC Public Health. 2023; 23:981. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Lessells R., Katzenstein D., de Oliveira T.. Are subtype differences important in HIV drug resistance?. Curr. Opin. Virol. 2012; 2:636–643. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Burke D.S. Recombination in HIV: an important viral evolutionary strategy. Emerg. Infect. Dis. 1997; 3:253–259. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Levy D.N., Aldrovandi G.M., Kutsch O., Shaw G.M.. Dynamics of HIV-1 recombination in its natural target cells. Proc. Natl Acad. Sci. U.S.A. 2004; 101:4204–4209. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Rhodes T.D., Nikolaitchik O., Chen J., Powell D., Hu W.-S.. Genetic recombination of human immunodeficiency virus type 1 in one round of viral replication: effects of genetic distance, target cells, accessory genes, and lack of high negative interference in crossover events. J. Virol. 2005; 79:1666–1677. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Song H., Giorgi E.E., Ganusov V.V., Cai F., Athreya G., Yoon H., Carja O., Hora B., Hraber P., Romero-Severson E.et al.. Tracking HIV-1 recombination to resolve its contribution to HIV-1 evolution in natural infection. Nat. Commun. 2018; 9:1928. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.W. Z. He Y., Wang H., Shen Q., Cui L., Wang X., Shao S., Hua X.. Hepatitis E virus genotype diversity in eastern China. Emerg. Infect. Dis. 2010; 16:1630–1632. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Shen H., Liu S., Ding M., Gu H., Chang M., Li Y., Wang H., Bai X., Shen H.. A quadruple recombination event discovered in hepatitis E virus. Arch. Virol. 2021; 166:3405–3408. [DOI] [PubMed] [Google Scholar]
  • 9.Feng Y., Wei H., Hsi J., Xing H., He X., Liao L., Ma Y., Ning C., Wang N., Takebe Y.et al.. Identification of a novel HIV type 1 circulating recombinant form (CRF65_cpx) composed of CRF01_AE and subtypes B and C in western Yunnan, China. AIDS Res. Hum. Retroviruses. 2014; 30:598–602. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Reis M.N.G., Guimarães M.L., Bello G., Stefani M.M.A.. Identification of new HIV-1 circulating recombinant forms CRF81_cpx and CRF99_BF1 in central western Brazil and of unique BF1 recombinant forms. Front. Microbiol. 2019; 10:97. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Maydt J., Lengauer T.. Recco: recombination analysis using cost optimization. Bioinformatics. 2006; 22:1064–1071. [DOI] [PubMed] [Google Scholar]
  • 12.R Core Team R: A Language and Environment for Statistical Computing. 2022; Austria: R Foundation for Statistical Computing Vienna. [Google Scholar]
  • 13.Bodenhofer U., Bonatesta E., Horejs-Kainrath C., Hochreiter S.. msa: an R package for multiple sequence alignment. Bioinformatics. 2015; 31:3997–3999. [DOI] [PubMed] [Google Scholar]
  • 14.Charif D., Lobry J., Bastolla U., Porto M., Roman H., Vendruscolo M.. SeqinR 1.0-2: a contributed package to the R project for statistical computing devoted to biological sequences retrieval and analysis. Structural Approaches to Sequence Evolution: Molecules, Networks, Populations (Biological and Medical Physics, Biomedical Engineering). 2007; NY: Springer; 207–232. [Google Scholar]
  • 15.Bengtsson H. A unifying framework for parallel and distributed processing in R using futures. R J. 2021; 13:208–227. [Google Scholar]
  • 16.Struck D., Lawyer G., Ternes A.-M., Schmit J.-C., Bercoff D.P.. COMET: adaptive context-based modeling for ultrafast HIV-1 subtype identification. Nucleic Acids Res. 2014; 42:e144. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Katoh K., Toh H.. Parallelization of the MAFFT multiple sequence alignment program. Bioinformatics. 2010; 26:1899–1900. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Liu T.F., Shafer R.W.. Web resources for HIV type 1 genotypic-resistance test interpretation. Clin. Infect. Dis. 2006; 42:1608–1618. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Serwin K., Chaillon A., Scheibe K., Urbańska A., Aksak-Wąs B., Ząbek P., Siwak E., Cielniak I., Jabłonowska E., Wójcik-Cichy K.et al.. Circulation of human immunodeficiency virus 1 A6 variant in the eastern border of the European Union—dynamics of the virus transmissions between Poland and Ukraine. Clin. Infect. Dis. 2023; 76:1716–1724. [DOI] [PubMed] [Google Scholar]
  • 20.Orkin C., Schapiro J.M., Perno C.F., Kuritzkes D.R., Patel P., DeMoor R., Dorey D., Wang Y., Han K., Van Eygen V.et al.. Expanded multivariable models to assist patient selection for long-acting cabotegravir + rilpivirine treatment: clinical utility of a combination of patient, drug concentration, and viral factors associated with virologic failure. Clin. Infect. Dis. 2023; 77:1423–1431. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Cutrell A.G., Schapiro J.M., Perno C.F., Kuritzkes D.R., Quercia R., Patel P., Polli J.W., Dorey D., Wang Y., Wu S.et al.. Exploring predictors of HIV-1 virologic failure to long-acting cabotegravir and rilpivirine: a multivariable analysis. AIDS. 2021; 35:1333–1342. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Arel-Bundock V., Enevoldsen N., Yetman C.. countrycode: an R package to convert country names and country codes. J. Open Source Softw. 2018; 3:848. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ugae003_Supplemental_File

Data Availability Statement

HIV-1 reference sequences for known subtypes and CRFs are available at the Los Alamos Sequence Database (https://www.hiv.lanl.gov/). We downloaded HIV-1 premade alignments of full genome HIV-1 subtype references. Most A6 sequences are publicly available at GenBank via accession numbers OK474701–OK474757, MW756383–MW756427, OL792300–OL792612, OP352923–OP353328 and OP353329–OP353518. A test set used by COMET (16) (https://comet.lih.lu/) for training was downloaded from the COMET website. A6 sequences from Tel Aviv and Hamburg have been deposited at GenBank with accession numbers PP116030–PP116071. HEV sequences were downloaded from HEV-GLUE (http://hev.glue.cvr.ac.uk/), including nine unidentified genotype 4 sequences. The source code accompanying this article is available at https://doi.org/10.5281/zenodo.10554802.


Articles from NAR Molecular Medicine are provided here courtesy of Oxford University Press

RESOURCES