Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2025 Nov 25;21(11):e1013676. doi: 10.1371/journal.pcbi.1013676

T-shaped alignments integrating HIV-1 near full-length genome and partial pol sequences can improve phylogenetic inference of transmission clusters

August Guang 1,*, Casey W Dunn 2, Vlad Novitsky 3, Mark Howison 4, Rami Kantor 3
Editor: Roger Dimitri Kouyos5
PMCID: PMC12685204  PMID: 41289307

Abstract

Molecular epidemiology and HIV-1 transmission networks reconstruction can provide insights into transmission dynamics and inform public health strategies. Long HIV sequences, such as near full-length (nFL) genomes, can improve the accuracy of phylogenetic inference. However, relatively short pol sequences are still broadly used for inferring molecular HIV clusters. Whether a mix of long and short HIV-1 sequences can improve phylogenetic inference of molecular HIV clusters remains unknown.

We propose a flexible approach called T-shaped alignments that incorporates both nFL HIV-1 genomes and partial pol sequences, and investigate whether this approach improves phylogenetic reconstruction of molecular clusters. Under the assumption that clustering from 100% of long sequences is the most accurate, we obtained 1196 subtype B nFL HIV-1 sequences from the Los Alamos National Laboratory Database and a single-study subset, varied the proportion of long and short sequences in our T-shape alignments, systematically masked all non-pol regions with missing characters in proportional increments, and compared tree similarity and cluster inference among datasets.

With the full dataset, we found that when more than 50% of available sequences are nFL, the T-shaped alignment gradually yields results closer to the 100% n, with more and larger clusters identified. However, below the 50% threshold accuracy did not increase. Stringent bootstrap thresholds decreased cluster accuracy gaps but also decreased number of clusters found and mean cluster size. For the subset dataset, we found that the introduction of nFL sequences to the T-shaped alignment improves accuracy in clustering either after a 30% threshold or immediately depending on bootstrap choice.

Our new approach and results suggest that using T-shape alignments to mix HIV-1 sequences of different lengths can improve phylogenetic and clustering accuracy, with needed nFL proportion depending on analysis goals. The T-shape alignment provides a straightforward method for utilizing all available sequences to improve phylogenetic analysis.

Author summary

We introduce and explore a novel approach to analyzing HIV-1 clustering through advanced genomic sequence analysis techniques. Unlike traditional molecular cluster inference methods that focus on widely available short segments of the virus’s genetic code, our research leverages longer sequences, potentially offering a more detailed view of the virus’s transmission patterns, paving the way for improved public health strategies and more effective efforts to control the spread of HIV-1.

Recognizing that obtaining long sequences from each individual diagnosed with HIV is not always feasible, our approach combines both short and long sequences, enabling us to utilize all available data effectively. This innovative method, which we term “T-shaped alignment,”, allows integration of sequences of varying lengths without requiring new software, maximizing flexibility and efficiency in analysis.

Using publicly available global sequence data, we found that this method may improve accuracy in inferring HIV-1 molecular clusters when more than a certain proportion of the data consists of long sequences. The threshold differs between different datasets. This finding, which should be validated in regional HIV sequence datasets, may ensure better use of genetic data to understand and control HIV-1 transmission. Our work underscores the importance of sequence length proportions in genetic analysis, providing preliminary insights that may be relevant for future public health initiatives.

Introduction

The use of molecular epidemiology in inferring transmission clusters and shared characteristics between individuals with HIV-1 has been critical to providing insights into disease transmission dynamics and informing public health strategies [13]. Molecular epidemiology defines clusters of infections through molecular sequencing of viruses and other pathogens, utilizing metrics such as genetic distance and phylogenetic bootstrap. While public health contact tracing methods aim to provide known personal connection histories, molecular epidemiology can infer partial or complete transmission chains through viral sequence similarities and reconstructing viral phylogenies [4]. Researchers have traditionally relied on HIV-1 partial pol sequences to construct phylogenetic trees and infer transmission clusters as they contain substantial phylogenetic signal [5], can be accurately used for HIV-1 transmission reconstruction [1], and are available through routine drug resistance testing in clinical care [6].

The development of long-read and next generation sequencing (NGS) covering near-full length (nFL) HIV-1 genomes have made them increasingly available for use in phylogenetic analysis and clustering [7,8]. Studies using nFL HIV-1 genomes have demonstrated improved accuracy when using long nFL genomes over short partial pol sequences in both phylogenetic and clustering analyses [911]. While using nFL genomes may be desirable, it may not be feasible in most settings. Many individuals previously diagnosed with HIV may be virally suppressed, have transferred care, or may not be interested in providing samples for sequencing, and resources might be limited. Importantly, combining partial pol sequences with newly generated nFL sequences in a single alignment for phylogenetic inference can present technical challenges, and trimming sequencing to the shortest length prior to phylogenetic inference is common.

A flexible approach that can incorporate both nFL HIV-1 genome sequences and partial pol sequences to possibly improve transmission cluster inference would be greatly advantageous because of the additional resolution that nFL genome sequences may provide. One such approach is to treat all partial pol sequences in the alignment as having missing characters for the rest of the genome, while still using the non-pol data, rather than only using pol sequences from the whole dataset. We refer to this as a T-shaped alignment, where, if sequences were sorted by completeness, the top of the T is the nFL genome sequences and the stem of the T is the pol-only sequences. Alignment methods that deal with sequence length heterogeneity in this way exist already, typically by computing a backbone alignment with the nFL sequences and then aligning fragments to the backbone alignment. [1214] Standard phylogenetic methods can incorporate the T-shape alignment, bypassing the need for new software implementations. An additional advantage of this approach is that any length sequence from any overlapping regions of the genome can be incorporated into this alignment. To the best of our knowledge, no prior studies addressed such heterogeneous HIV-1 alignments.

We investigated whether the T-shape alignment approach would improve HIV phylogenetic reconstruction and molecular cluster analysis by looking into alignments with different proportions of nFL HIV-1 genomes and partial pol sequences from publicly available data, under the assumption that a 100% nFL genome alignment would provide the most accurate results. While the impact of different kinds of missing data on phylogenetic inference is a continuous area of research [15,16], generally increasing data in both characters and taxa has been found to improve phylogenetic tree resolution [17,18], which suggests that the T-shaped alignment should improve HIV phylogenetic inference and clustering, potentially better informing public health.

However, we found that simply adding nFL genome sequences into the alignment does not always lead to increased accuracy in both the phylogenetic tree topology and cluster inference, and in fact appears to be more misleading at lower mixture proportions. Instead, the topology and cluster inference gradually improves after a majority of sequences in the alignment are nFL genomes.

Materials and methods

We downloaded a filtered nFL genome alignment of all HIV-1 subtype B sequences published through the end of 2018 from the Los Alamos National Laboratory HIV database (full dataset) as our initial input [19] (Fig 1A) and assessed hypermutation in the alignment with HYPERMUT [20]. We found that out of the 1196 sequences, 4 were hypermutated. As this was a minimal number, we kept all sequences. To address potential biases arising from dataset heterogeneity and sequence source, we additionally created a subset of the filtered nFL genome alignment comprising of the 116 subtype B nFL genomes sequenced in [21] (subset dataset), selected because it is the largest subset of sequences from within a single study and contains geographic linkage information as well. This subset dataset served as a sensitivity analysis.

Fig 1. Workflow for alignment dataset generation and phylogenetic analysis.

Fig 1

(A) The process begins with a downloaded, filtered near full-length (nFL) HIV-1 subtype B genome alignment as input. (B) To generate multiple alignment samples, the order of taxa within the input alignment is randomly shuffled. (C) For each shuffled sample, various datasets are created by masking specific genomic regions outside the protease-reverse transcriptase (PRRT) region, using HIV-1 HXB2 coordinates to introduce gaps. (D) The processed alignment datasets are then used as input for IQTree to generate maximum likelihood phylogenies.

For both the full alignment dataset and the subset dataset, we created 100 alignment samples by shuffling the order of the taxa in the alignment randomly with a different seed each time (Fig 1B). For each sample, we created 13 datasets with different mixtures of nFL (referred to as wgs for whole genome sequences) and pol sequences by using the HIV-1 HXB2 (Genbank accession number K03455) genome coordinates to replace (mask) all characters outside of the protease-reverse transcriptase (PRRT) region with gaps according to the mixture of pol specified (Fig 1C). Eleven of the datasets had mixtures in increments of 10% of the taxa being nFL in the alignment starting from 0%, i.e. from 10% up to 100%, while 2 of the datasets were were mixtures of 95% nFL and 5% pol and 99% nFL and 1% pol, respectively. Alignments were labeled by the proportion of nFL sequences in the mixture, i.e. wgs40 represents 40% wgs and 60% pol sequences. This created 13,000 alignment datasets total.

Each alignment as well as the 100% whole genome alignment (wgs100) were run through the maximum likelihood phylogenetic inference program IQtree [22] to generate phylogenies with bootstrapping (Fig 1D). The IQTree algorithm treats gaps and missing characters as having no information, similar to most other maximum likelihood-based software such as RAxML and PhyML, and thus a site with missing taxa will have the same likelihood as if the missing taxa were not there. The midpoint clustering algorithm with bootstrapping thresholds 70, 80, 85, 90, 95, 99 was used to identify clusters on all phylogenies [23]. Genetic distance was not used due to the differing lengths of the sequences in the analysis and lack of established threshold for molecular clusters based on the entire HIV genome sequences.

For each sample we compared phylogenies and clustering between the different mixture datasets and wgs100 under the assumption that the wgs100 phylogeny and clustering was the most accurate and thus could act as the reference results. For phylogenies we evaluated tree similarity to the wgs100 reference by computing the Robinson-Foulds distances [24] between wgs100 and all other mixture trees. We also ran a one-tailed paired t-test on the Robinson-Foulds distances.

The Robinson-Foulds distances give us an idea of how similar mixture trees and the wgs100 tree are, but not how similar mixture trees are to other mixture trees. To further visualize and assess the similarity between phylogenies inferred from the different mixtures, we computed the Clustering Information Distance from the R package TreeDist as the tree distance metric between all pairs of trees, based on the recommendations outlined in [25], then projected the distance matrix onto 2D with a Principal Coordinates Analysis (PCoA).

For clustering we computed the mean number of clusters and mean cluster size for each mixture dataset at each bootstrap threshold to understand how cluster attributes changed as mixtures and bootstrap thresholds changed. We evaluated clustering congruence through two metrics: the Adjusted Mutual Index (AMI) [26] and the overlap coefficient. AMI was chosen due to the unbalanced cluster sizes [27], while overlap cofficient was chosen due to its ability to capture subset relationships as it directly measures the proportion of shared elements relative to the smaller of the two clusters. Both metrics were computed between 100% wgs and all other mixtures at the same bootstrap, i.e. clustering between 100% wgs and mixtures at 80 bootstrap were compared to each other but not to clustering at 90 bootstrap. For AMI, clusters were defined as matching if they shared all of the same members, and non-matching otherwise, including cases where one cluster was a superset of another. An AMI closer to 1 represented higher congruence, while an AMI closer to 0 represented lower congruence. For the overlap coefficient, a coefficient closer to 1 meant that more members of one cluster are also all members of the other cluster, while a coefficient closer to 0 meant that almost no members of either cluster are members of the other.

To investigate our first hypothesis for trends in phylogeny and clustering as mixture proportions changed, we looked at solely the wgs50 phylogeny and computed the number of clusters that had all pol sequences, all wgs sequences, and a mix of both. To investigate our second hypothesis, we masked all gp120 regions from the alignment and then reran the same methodology as above, i.e. IQTree followed by bootstrap-based clustering at different thresholds along with computing Robinson-Foulds distances and AMI. To investigate our third hypothesis, we extracted substitution model parameters from each prior IQTree sample run and created a boxplot of the parameters. We then reran the same methodology as above on all samples, but with the substitution model parameters from the wgs100 tree instead.

All analysis source code is available from: https://github.com/dunnlab/hiv_wide.

Results

Tree similarity increases after at least 40-50% of sequences are near full-length genomes

We assessed similarity between all pol- and whole genome sequence (wgs) mixture phylogenies and the wgs100 trees by plotting the normalized Robinson-Foulds distances, a measure of topological similarity between phylogenetic trees (Figs 2 and 3). For the full dataset, the wgs50 through wgs99 mixtures show decreasing distance to wgs100, indicating that their topologies are closer to wgs100 as the proportion of wgs grow (Fig 2). However, pol through wgs40 have varying distances to wgs100 and do not necessarily increase. The one-tailed paired t-test showed that at wgs50 the distance to wgs is closer than pol to wgs (p-value 4.791e-14), but mixtures with lower wgs proportions are not. Additionally, the spread of distance to wgs100 appeared to increase as the proportion of wgs increased. For the subset dataset, the wgs40 through wgs99 mixtures show decreasing distance to wgs100 (Fig 3), and the one-tailed paired t-test showed that at wgs40 the distance to wgs is closer than pol to wgs (p-value 0.002534), but mixtures with lower wgs proportions are not. As the proportion of wgs grows, the variation in distances to wgs100 also increases (larger whiskers and boxes), likely due to the longer sequences with more variable regions (e.g. env) filling in gaps and thus leading to more variation at those sites.

Fig 2. Phylogenetic distance between partial pol-wgs mixtures and nFL genome for full dataset.

Fig 2

Boxplot of Robinson-Foulds distance (Y axis) between maximum-likelihood phylogenies generated from different mixtures of pol and wgs sequences (X axis; e.g. wgs60 means 60-40 wgs-pol, respectively) to the phylogeny generated from a nFL genome alignment. Boxplots contain the first and third quartiles as well as the mean, with the vertical lines indicate the range. Points represent individual outlier distances.

Fig 3. Phylogenetic distance between partial pol-wgs mixtures and nFL genome for subset dataset.

Fig 3

Boxplot of Robinson-Foulds distance (Y axis) between maximum-likelihood phylogenies generated from different mixtures of pol and wgs sequences (X axis; e.g. wgs60 means 60-40 wgs-pol, respectively) to the phylogeny generated from a nFL genome alignment. Boxplots contain the first and third quartiles as well as the mean, with the vertical lines indicate the range. Points represent individual outlier distances.

Plotting mixtures onto the two-dimensional Principal Coordinates Analysis (2D PcoA) projection, percent variance explained over both axes (the measure of how much the first 2 principal coordinates contribute to variation in the distance matrix) was fairly low (Fig 4, 10.98% x-axis+3.25% y-axis; Fig 5, 24.4% x-axis+3.63% y-axis). For both the full dataset and the subset dataset, along the first (X) variance dimension, similarity to wgs100 directly increased as proportion of wgs sequences increased, (moving from right to left on the X-axis) suggesting that there is a direct relationship between increasing the proportion of wgs sequences and similarity to the wgs100 tree. Along the second variance dimension (Y-axis), similarity to wgs100 did not directly increase as proportion of wgs sequences increased, with the points forming a u-shape instead. Additionally, for the full dataset trees from different mixtures clearly clustered and separated on the 2D projection. For the subset dataset, trees from different mixtures did not clearly cluster at lower proportions of wgs. This suggests that while increasing the proportion of wgs sequences is the dominant factor and does linearly increase similarity to wgs100, there is a second non-linear factor impacting phylogenetic distance.

Fig 4. Principle Coordinates Analysis (PCoA) projected onto 2 dimensions between pol-wgs mixtures and nFL genome, full dataset.

Fig 4

X-axis is the first principle coordinate which explains the largest amount of data change, Y-axis is the second principle coordinate which explains the second largest amount of data change. Distances on axes summarize variability and are relative to the samples plotted and colors indicate different sequence mixtures according to the legend. Along the first (highest) variance dimension, different mixtures varied directly from pol to wgs, while along the second variance dimension, pol and wgs were closer to each other than to wgs30.

Fig 5. Principle Coordinates Analysis (PCoA) projected onto 2 dimensions between pol-wgs mixtures and nFL genome, subset dataset.

Fig 5

X-axis is the first principle coordinate which explains the largest amount of data change, Y-axis is the second principle coordinate which explains the second largest amount of data change. Distances on axes summarize variability and are relative to the samples plotted and colors indicate different sequence mixtures according to the legend. Along the first (highest) variance dimension, different mixtures varied directly from pol to wgs, while along the second variance dimension, most between-sample distances fell between 10 and -10. pol through wgs30 had higher spreads with less distinct clustering compared to the rest of the mixtures.

Cluster numbers and mean cluster size decrease until wgs50, then gradually increase

For each bootstrap threshold and dataset mixture, we computed the mean number of clusters and mean cluster size averaged over the dataset. For the full dataset, the number of clusters and mean cluster size decreases as bootstrap thresholds increase (Fig 6 and S1 Table). At bootstrap thresholds of 90, 95 and 99 the mean number of clusters decreases by a large degree as wgs proportion increases up to wgs50, then increases as wgs proportion increases from there. At bootstrap thresholds of 70, 80 and 85 the mean number of clusters decrease slightly or stay about the same as wgs proportion increases up to wgs50, then increase slightly as wgs proportion increases from there. Additionally, mean cluster size slightly decreases as wgs proportion increases up to wgs50, then increases from there.

Fig 6. Scatter and line plots of cluster number and mean cluster size as wgs percentage in mixture increases, full dataset.

Fig 6

The figure demonstrates (A) the number of clusters (Y axis) for each pol-wgs mixture dataset (X axis) at different bootstrap thresholds (colors; legend) and (B) mean cluster size (Y axis) for each pol-wgs mixture dataset (X axis) at different bootstrap thresholds (colors; legend). Trend lines with error bars are drawn for each bootstrap threshold as well as individual sample points. Sample points were plotted with a jitter to better display the spread across cluster number and mean cluster sizes, but only represent mixtures in proportions of 10, i.e. wgs0 (pol), wgs10, wgs20, wgs30, etc.

For the subset dataset, at bootstrap thresholds of 90, 95, and 99 the mean number of clusters stays around the same until wgs50, then increases as wgs proportion increases from there (Fig 7 and S2 Table). At bootstrap thresholds of 70, 80, and 85 the mean number of clusters appears to decrease slightly or stay about the same as wgs proportion increases up to wgs50, then increase slightly. Mean cluster size appears to increase steadily as wgs proportion increases. This provides some indication that increasing wgs sequence proportion after wgs50 can lead to detecting more and larger clusters.

Fig 7. Scatter and line plots of cluster number and mean cluster size as wgs percentage in mixture increases, subset dataset.

Fig 7

The figure demonstrates (A) the number of clusters (Y axis) for each pol-wgs mixture dataset (X axis) at different bootstrap thresholds (colors; legend) and (B) mean cluster size (Y axis) for each pol-wgs mixture dataset (X axis) at different bootstrap thresholds (colors; legend). Trend lines with error bars are drawn for each bootstrap threshold as well as individual sample points. Sample points were plotted with a jitter to better display the spread across cluster number and mean cluster sizes, but only represent mixtures in proportions of 10, i.e. wgs0 (pol), wgs10, wgs20, wgs30, etc.

Clustering improves after wgs50 for full dataset, and depends on bootstrap threshold for subset dataset

For each bootstrap threshold, we plotted the Adjusted mutual information (AMI; a measure to evaluate clustering similarity that accounts for chance) between clustering from each mixture phylogeny and the wgs phylogeny (Figs 8 and 9). For the full dataset, at all bootstrap thresholds, there is no linear relationship between including wgs and closeness of clustering results to the wgs tree (Fig 8). Wgs20 has the lowest AMI across all bootstrap thresholds, indicating that the transmission clusters inferred with the wgs20 tree are least congruent with wgs100. Pol, wgs10, wgs30, wgs40, and wgs50 all share similar AMIs. After wgs50 the congruence to wgs100 increases as proportion of wgs sequences increases, with higher similarity to wgs100 than to pol based on the t-test. This indicates that including wgs sequences can increase clustering accuracy, but only after at least 50% of sequences are wgs.

Fig 8. Adjusted mutual information (AMI) boxplot between phylogenetic clustering of different mixtures of pol and wgs sequences and all wgs, full dataset.

Fig 8

The figure demonstrates six panels for different bootstrap thresholds (indicated in gray in the top bars). Boxes in panels indicate the central 50% (top and bottom of boxes) and the median (thicker black line in boxes) of the AMI (Y axes) for a given set of pol-wgs mixture datasets (X axes). Whiskers indicate the range, and dots indicate outliers. Higher values of AMI indicate higher congruency in cluster inference between the mixture dataset and the wgs100 tree at the given bootstrap threshold.

Fig 9. Adjusted mutual information (AMI) boxplot between phylogenetic clustering of different mixtures of pol and wgs sequences and all wgs, subset dataset.

Fig 9

The figure demonstrates six panels for different bootstrap thresholds (indicated in gray in the top bars). Boxes in panels indicate the central 50% (top and bottom of boxes) and the median (thicker black line in boxes) of the AMI (Y axes) for a given set of pol-wgs mixture datasets (X axes). Whiskers indicate the range, and dots indicate outliers. Higher values of AMI indicate higher congruency in cluster inference between the mixture dataset and the wgs100 tree at the given bootstrap threshold.

For the subset dataset, at bootstrap thresholds of 70 and 99, AMI increases with the addition of wgs sequences, while at the other bootstrap thresholds, AMI increases after wgs30 based on t-test (Fig 9). At all other bootstrap thresholds, the AMI increases immediately. We note that the AMI is relatively low across all bootstrap thresholds – this result is expected because of the exponentially large number of possible clusters.

We additionally measured the overlap coefficient due to its ability to capture subset relationships where one cluster is contained within another. We plotted the proportion of wgs100 clusters recovered at two different overlap coefficient thresholds: 50% and 66% (Figs 10 and 11). If a cluster in a mixture phylogeny had an overlap coefficient of at least 50% or 66% with a cluster in the wgs100 tree, then we considered the cluster recovered in the mixture phylogeny. We found that for the full dataset (Fig 10), the pattern of improvement followed as previously, with at least 50% of sequences needing to be nFL before the proportion of recovered wgs100 clusters improved in accuracy relative to an all pol phylogeny. For the subset dataset (Fig 11), the increase in proportion of recovered wgs100 clusters varied depending on bootstrap or overlap coefficient threshold, but generally increased steadily after 30%, if not before. With this metric the proportion of clusters recovered was generally higher, with recovery proportion approaching 100% at all bootstrap and overlap coefficient thresholds for the subset dataset.

Fig 10. Proportion of clusters recovered between different mixtures of pol and wgs sequences and all wgs based on overlap coefficient, full dataset.

Fig 10

The figure demonstrates six panels for different bootstrap thresholds (indicated in gray in the top bars). Boxes in panels indicate the central 50% (top and bottom of boxes) and the median (thicker black line in boxes) of the proportion of clusters recovered (Y axes) for a given set of pol-wgs mixture datasets (X axes). Whiskers indicate the range, and dots indicate outliers. Colors indicate overlap coefficient threshold to consider a cluster recovered. Higher values on the Y-axis indicate higher congruency in cluster inference between the mixture dataset and the wgs100 tree at the given bootstrap threshold.

Fig 11. Proportion of clusters recovered between different mixtures of pol and wgs sequences and all wgs based on overlap coefficient, subset dataset.

Fig 11

The figure demonstrates six panels for different bootstrap thresholds (indicated in gray in the top bars). Boxes in panels indicate the central 50% (top and bottom of boxes) and the median (thicker black line in boxes) of the proportion of clusters recovered (Y axes) for a given set of pol-wgs mixture datasets (X axes). Whiskers indicate the range, and dots indicate outliers. Colors indicate overlap coefficient threshold to consider a cluster recovered. Higher values on the Y-axis indicate higher congruency in cluster inference between the mixture dataset and the wgs100 tree at the given bootstrap threshold.

Possible explanations

We looked into three possible explanations for why combining long nFL HIV-1 sequences and short pol sequences does not always improve clustering and in lower proportions of wgs can even lead to less accurate results for large datasets with many unrelated sequences such as in the full dataset.

Pol and WGS do not cluster together.

One hypothesis about the reasoning for the incongruence in mixtures is that pol-only tips could spuriously group together, and wgs tips could spuriously group together. We found a slight bias towards pol sequences grouping together. We calculated the mean proportion of pol sequences in clusters at each mixture proportion, with the expectation that the proportion of pol sequences in clusters should be equal to the proportion of pol sequences in the phylogeny, but found that for every mixture proportion it was higher. For example, for wgs50, the mean proportion of pol sequences in clusters was around 0.64, suggesting a bias towards pol sequences in clusters.

Fastest evolving sites do not contribute to phylogeny.

Another hypothesis was that the fastest evolving and polymorphic sites outside of pol skewed the phylogeny due to their extremely fast mutation rates leading more wgs tips to pull together due to long branch attraction [28,29]. In order to investigate this, we masked the gp120 region from all phylogenies and reproduced the analysis described above. However, we did not see any difference compared to using the whole genome.

Substitution model parameters do not affect phylogeny.

The third hypothesis was that due to the missing data, the model would be unable to infer the correct substitution model parameters for the phylogenetic tree inference, which would then impact both the phylogeny and the clustering results. To investigate this, we first looked at the substitution model parameters inferred from IQTree; the parameters differed significantly between the wgs alignment and the mixture datasets. The parameter estimates from the pol alignment in these cases were very different from the mixtures, with the pattern of estimates becoming closer to wgs as the proportion of wgs increased (Fig 12, S1 Figure, and S2 Figure). This suggests that when there is lots of missing data, the model is unable to estimate parameters for those regions appropriately. The pol alignment having very different parameters is unsurprising as pol evolves quite differently from the other regions. However, when we reran IQTree fixing the parameter estimates to those derived from wgs to see if the substitution model parameters affected the phylogeny, we found that they did not.

Fig 12. Impact of different substitution rate category parameters on cluster inference in different mixture datasets.

Fig 12

The figure explores the impact of different substitution rates in our choice of substitution model for IQTree (GTR+F+I+G4), with the 6 different substitution rates (A-C, A-G, A-T, C-G, C-T, G-T, gray bar on top of each panel) on the Y axis, ordered by the different mixture datasets (X axis). Boxes in panels indicate the central 50% (top and bottom of boxes) and the median (thicker black line in boxes) of the substitution rate, whiskers indicate the range, and dots indicate outliers.

Discussion

We assessed a T-shape alignment approach that can combine sequences of different lengths into phylogenetic analyses and cluster inference and does not require new software implementations. Applying this approach to HIV-1 subtype B sequences, we assessed its potential advantages over conventional short-only or long-only sequence datasets. We found that for large datasets with many unrelated sequences, at lower proportions of nFL genome sequences, the T-shape alignment does not seem to offer any improvement over the use of just pol. For smaller datasets, the T-shape alignment can offer improvement over the use of just pol, but the proportion of nFL genome sequences needed depends on choice of bootstrap threshold as well. The difference between the large dataset and smaller dataset could be due to the larger dataset coming from different studies and techniques, while the smaller study comes from a singular region.

When over 50% of sequences are whole genome, combining whole genome and pol sequences into a T-shape alignment generates phylogenetic trees and clusters that are gradually closer to those from a 100% whole genome alignment with increasing proportion of nFL genome sequences in the alignment. Assuming that wgs has benefits over sequencing a single gene, this suggests that mixing sequences of different lengths can improve phylogenetic and clustering accuracy contingent on reaching a nFL proportion threshold. The nFL proportion threshold depends on both dataset features and on bootstrap threshold. These results are somewhat unexpected, as the phylogenetic literature suggests adding characters may improve tree resolution [17]. Contrary to the conventional belief that increasing data, both in characters and taxa, invariably enhances phylogenetic tree resolution, our findings underscore the nuanced dynamics of incorporating diverse sequence types.

We explored possible explanations for this result. We looked at excluding fast evolving genes, which has been shown to produce different phylogenies in other contexts [30], but we did not see any difference compared to using the whole genome. We also looked at whether changing substitution model parameters would affect the results, but did not see any effect, although there is some prior phylogenetic literature on the limited impact of different substitution model parameters in other systems [31]. Finally, we assessed whether pol sequences may be clustering with other pol sequences, which would subsequently bias the clustering results as well. We found a slight bias towards pol sequences grouping together, suggesting a possible reason behind needing a specific nFL proportion threshold and an avenue for future investigation.

Our findings have potential implications for HIV-1 phylogenetic and clustering studies. Specifically, in molecular HIV cluster analyses, mixing nFL HIV-1 genome sequences with partial pol sequences may be advantageous only if the proportion of nFL genome sequences exceeds a specific threshold. The threshold appears to differ based on dataset features. In analyses of large datasets with many unrelated sequences, improvement occurs when the majority of sequences are nFL, but in more local settings a lower proportion of nFL sequences may be sufficient. Future work will look into how the T-shaped alignment performs on a Rhode Island regional dataset, as regional clustering in a real setting may provide greater insight into transmission networks that could be relevant for informing public health interventions [3235].

Bootstrap thresholds further influence clustering outcomes. Stricter thresholds reduce the gap in cluster accuracy between mixed datasets and wgs100, likely due to the conservative nature of bootstrap support [36], and also decrease the number of clusters and size of clusters found. For smaller datasets with many sequences from the same region, bootstrap threshold affects the proportion of nFL sequences needed to improve accuracy as well. At the most relaxed and strictest bootstrap thresholds (70 and 99), the addition of nFL sequences into a T-shaped phylogeny always improves accuracy, while at thresholds in-between, at least 30% of nFL sequences are needed to improve accuracy. The number of clusters and mean cluster size are also influenced by the proportion of wgs sequences and bootstrap thresholds, with increases in number of clusters and mean cluster size after wgs50. These trends suggest that increasing wgs proportions facilitates the detection of larger and potentially more meaningful clusters, but depends on reaching a certain threshold of nFL sequences. Bootstrap threshold choice is often dependent on the stated public health or scientific goals, with stricter thresholds suggested for rapidly growing clusters or low viral diversity epidemics, and more relaxed thresholds suggested for routine public health tracking and longer time periods [37]. Our findings from varying the bootstrap threshold reinforce these suggestions. We further suggest that regardless of bootstrap threshold choice, incorporating at least 50% of wgs sequences may provide an improvement in cluster accuracy, number, and size of clusters detected. The impact of this approach on regional HIV epidemics and public health interventions is yet to be determined.

One possible explanation for our results we did not explore is that the clusters fluctuate in size and composition but maintain a degree of consistency depending on the bootstrap threshold chosen. A preliminary analysis of a set of 10 randomly chosen clusters revealed that about half of the clusters simply lost or gained members as bootstrap threshold varied, but that the other half completely disappeared or reappeared. Further analysis following the categorizations and cluster typing outlined in [33] could shed further light on this.

Our study comes with limitations. First, we did not choose a genetic distance threshold when performing our analyses due to the heterogeneity of sequence types as well as lack of studies validating genetic distance thresholds for nFL genome sequences. Genetic distance thresholds are commonly used in pol based phylogenetic analyses, particularly as related to defining public health associated clusters, but choosing the relevant threshold is an active area of research [3739]. Recent methods such as AUTO-TUNE [40] that systematically select distance thresholds based only on the sequencing data could be applied to our data in order to generate a set of distance thresholds for different mixture proportions that create more comparable clustering with consistent and desirable properties such as ratio of largest cluster to second-largest cluster size. It is also possible that within this set of distance thresholds, one in nFL genome sequences exists that makes clustering results congruent regardless of pol and nFL genome mixture level or makes results more congruent to 100% nFL genome alignments. We did perform an initial investigation into this approach, but did not find any clear distance thresholds. Second, recent work also suggests that bootstrap and genetic distance thresholds are heuristics and proposes using sampling dates to better estimate emerging clusters [41]. As the dataset used in this study does not contain precise sequencing or sampling dates, repeating this analysis with a dataset with available sample or sequencing dates would shed further light on these questions. Lastly, we used the wgs100 dataset, and the number and sizes of clusters identified in it, as the gold standard; however, the accuracy of this approach still needs to be determined.

In conclusion, we introduced T-shape alignment as an approach to leverage sequences of different lengths for phylogenetic cluster inference without requiring new software implementations. Practically, results suggest feasibility of integrating such datasets into analyses of HIV molecular clusters, but recommend a cautious approach - their inclusion may be justified only when the majority of sequences are nFL genomes, and should take bootstrap threshold selection into careful consideration. Since phylogenetic trees enhance pure genetic distance-based methods and supplement contact tracing in public health settings, being able to use nFL HIV-1 genome sequences may provide additional measures of cluster certainty when performing phylogenetic-based analyses.

Supporting information

S1 Table. Mean number of clusters and cluster size at different bootstrap thresholds and wgs proportions, full dataset.

Each row represents a different mixture of wgs and pol sequences, going from 100% pol to 100% wgs. Each column represents a bootstrap threshold for clustering, with the subcolumns indicating the mean number of clusters and mean cluster size found at that bootstrap. Cluster number was averaged over all samples for a given mixture, and cluster size was averaged over both all cluster sizes in a given sample, and over all samples.

(DOCX)

pcbi.1013676.s001.docx (66KB, docx)
S2 Table. Mean number of clusters and cluster size at different bootstrap thresholds and wgs proportions, subset dataset.

Each row represents a different mixture of wgs and pol sequences, going from 100% pol to 100% wgs. Each column represents a bootstrap threshold for clustering, with the subcolumns indicating the mean number of clusters and mean cluster size found at that bootstrap. Cluster number was averaged over all samples for a given mixture, and cluster size was averaged over both all cluster sizes in a given sample, and over all samples.

(DOCX)

pcbi.1013676.s002.docx (15.5KB, docx)
S1 Fig. Rates of different base frequencies from different mixture datasets.

The figure explores the impact of the 4 different base frequency proportions (pi(A), pi(C), pi(G), pi(T), gray bar on top of each panel) in our choice of substitution model for IQTree (GTR+F+I+G4). Proportion is on the Y-axis, and the different mixture datasets are ordered from pol to wgs100 on the X axis. Boxes in panels indicate the central 50% (top and bottom of boxes) and the median (thicker black line in boxes) of the proportions, whiskers indicate the range, and dots indicate outliers. There is a monotonic relationship between relative rate values and wgs proportion beginning with wgs10, when wgs sequences are introduced.

(TIFF)

pcbi.1013676.s003.tif (9.8MB, tif)
S2 Fig. Rates of different gamma relative rate categories from different mixture datasets.

The figure explores the impact of the 4 different gamma rate categories (gray bar on top of each panel) in our choice of substitution model for IQTree (GTR+F+I+G4). Rate is on the Y-axis, and the different mixture datasets are ordered from pol to wgs100 on the X axis. Boxes in panels indicate the central 50% (top and bottom of boxes) and the median (thicker black line in boxes) of the proportions, whiskers indicate the range, and dots indicate outliers. There is a monotonic relationship between relative rate values and wgs proportion beginning with wgs10, when wgs sequences are introduced.

(TIFF)

pcbi.1013676.s004.tif (9.2MB, tif)

Acknowledgments

Part of this research was conducted using computational resources and services at the Center for Computation and Visualization (CCV), Brown University. Thank you to Ashok Ragavendran, Joselynn Wallace, Eric Salomaki, and the rest of the CCV team for feedback and suggestions.

Data Availability

All data required to replicate the study’s findings can be found here https://www.hiv.lanl.gov/content/sequence/HIV/mainpage.html. All analysis source code is available at https://github.com/dunnlab/hiv_wide.

Funding Statement

This study was supported by grants from the National Institute of Allergy & Infectious Diseases of the National Institutes of Health (RO1AI136058, K24AI134359, P30AI042853) to RK and by an Institutional Development Award (IDeA) from the National Institute of General Medical Sciences of the National Institutes of Health (P20GM109035). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Leitner T, Escanilla D, Franzén C, Uhlén M, Albert J. Accurate reconstruction of a known HIV-1 transmission history by phylogenetic tree analysis. Proc Natl Acad Sci U S A. 1996;93(20):10864–9. doi: 10.1073/pnas.93.20.10864 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Peters PJ, Pontones P, Hoover KW, Patel MR, Galang RR, Shields J, et al. HIV infection linked to injection use of oxymorphone in Indiana 2014 -2015. N Engl J Med. 2016;375(3):229–39. doi: 10.1056/NEJMoa1515195 [DOI] [PubMed] [Google Scholar]
  • 3.Fauci AS, Lane HC. Four decades of HIV/AIDS - much accomplished, much to do. N Engl J Med. 2020;383(1):1–4. doi: 10.1056/NEJMp1916753 [DOI] [PubMed] [Google Scholar]
  • 4.Gore DJ, Schueler K, Ramani S, Uvin A, Phillips G 2nd, McNulty M, et al. HIV response interventions that integrate HIV molecular cluster and social network analysis: a systematic review. AIDS Behav. 2022;26(6):1750–92. doi: 10.1007/s10461-021-03525-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Hué S, Clewley JP, Cane PA, Pillay D. HIV-1 pol gene variation is sufficient for reconstruction of transmissions in the era of antiretroviral therapy. AIDS. 2004;18(5):719–28. doi: 10.1097/00002030-200403260-00002 [DOI] [PubMed] [Google Scholar]
  • 6.Hassan AS, Pybus OG, Sanders EJ, Albert J, Esbjörnsson J. Defining HIV-1 transmission clusters based on sequence data. AIDS. 2017;31(9):1211–22. doi: 10.1097/QAD.0000000000001470 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Banin AN, Tuen M, Bimela JS, Tongo M, Zappile P, Khodadadi-Jamayran A, et al. Development of a versatile, near full genome amplification and sequencing approach for a broad variety of HIV-1 group M variants. Viruses. 2019;11(4):317. doi: 10.3390/v11040317 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Lambrechts L, Bonine N, Verstraeten R, Pardons M, Noppe Y, Rutsaert S, et al. HIV-PULSE: a long-read sequencing assay for high-throughput near full-length HIV-1 proviral genome characterization. Nucleic Acids Res. 2023;51(20):e102. doi: 10.1093/nar/gkad790 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Novitsky V, Moyo S, Lei Q, DeGruttola V, Essex M. Importance of viral sequence length and number of variable and informative sites in analysis of HIV clustering. AIDS Res Hum Retroviruses. 2015;31(5):531–42. doi: 10.1089/AID.2014.0211 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Yebra G, Hodcroft EB, Ragonnet-Cronin ML, Pillay D, Brown AJL, PANGEA_HIV Consortium, et al. Using nearly full-genome HIV sequence data improves phylogeny reconstruction in a simulated epidemic. Sci Rep. 2016;6:39489. doi: 10.1038/srep39489 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Guang A, Howison M, Ledingham L, D’Antuono M, Chan PA, Lawrence C, et al. Incorporating within-host diversity in phylogenetic analyses for detecting clusters of new HIV diagnoses. Front Microbiol. 2022;12:803190. doi: 10.3389/fmicb.2021.803190 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Katoh K, Frith MC. Adding unaligned sequences into an existing alignment using MAFFT and LAST. Bioinformatics. 2012;28(23):3144–6. doi: 10.1093/bioinformatics/bts578 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Shen C, Zaharias P, Warnow T. MAGUS+eHMMs: improved multiple sequence alignment accuracy for fragmentary sequences. Bioinformatics. 2022;38(4):918–24. doi: 10.1093/bioinformatics/btab788 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Park M, Ivanovic S, Chu G, Shen C, Warnow T. UPP2: fast and accurate alignment of datasets with fragmentary sequences. Bioinformatics. 2023;39(1):btad007. doi: 10.1093/bioinformatics/btad007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Simmons MP. A confounding effect of missing data on character conflict in maximum likelihood and Bayesian MCMC phylogenetic analyses. Mol Phylogenet Evol. 2014;80:267–80. doi: 10.1016/j.ympev.2014.08.021 [DOI] [PubMed] [Google Scholar]
  • 16.Smith BT, Mauck WM, Benz BW, Andersen MJ. Uneven missing data skew phylogenomic relationships within the lories and Lorikeets. Genome Biol Evol. 2020;12(7):1131–47. doi: 10.1093/gbe/evaa113 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Wiens JJ, Morrill MC. Missing data in phylogenetic analysis: reconciling results from simulations and empirical data. Syst Biol. 2011;60(5):719–31. doi: 10.1093/sysbio/syr025 [DOI] [PubMed] [Google Scholar]
  • 18.Molloy EK, Warnow T. To include or not to include: the impact of gene filtering on species tree estimation methods. Syst Biol. 2018;67(2):285–303. doi: 10.1093/sysbio/syx077 [DOI] [PubMed] [Google Scholar]
  • 19.Apetrei C, Hahn B, Rambaut A, Wolinsky S, Brister J, Keele B. HIV sequence compendium. 2021. https://www.hiv.lanl.gov/content/sequence/HIV/COMPENDIUM/2021/sequence2021.pdf
  • 20.Lapp Z, Yoon H, Foley B, Leitner T. Hypermut 3: identifying specific mutational patterns in a defined nucleotide context that allows multistate characters. Bioinform Adv. 2025;5(1):vbaf025. doi: 10.1093/bioadv/vbaf025 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Pessôa R, Loureiro P, Esther Lopes M, Carneiro-Proietti ABF, Sabino EC, Busch MP, et al. Ultra-deep sequencing of HIV-1 near full-length and partial proviral genomes reveals high genetic diversity among Brazilian blood donors. PLoS One. 2016;11(3):e0152499. doi: 10.1371/journal.pone.0152499 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Nguyen L-T, Schmidt HA, von Haeseler A, Minh BQ. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol. 2015;32(1):268–74. doi: 10.1093/molbev/msu300 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Ragonnet-Cronin M, Hodcroft E, Hué S, Fearnhill E, Delpech V, Brown AJL, et al. Automated analysis of phylogenetic clusters. BMC Bioinformatics. 2013;14:317. doi: 10.1186/1471-2105-14-317 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Robinson DF, Foulds LR. Comparison of phylogenetic trees. Mathematical Biosciences. 1981;53(1–2):131–47. doi: 10.1016/0025-5564(81)90043-2 [DOI] [Google Scholar]
  • 25.Smith MR. Robust analysis of phylogenetic tree space. Syst Biol. 2022;71(5):1255–70. doi: 10.1093/sysbio/syab100 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Vinh NX, Epps J, Bailey J. Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. Journal of Machine Learning Research. 2010;11. [Google Scholar]
  • 27.Romano S, Vinh NX, Bailey J, Verspoor K. Adjusting for chance clustering comparison measures. Journal of Machine Learning Research. 2016;17. [Google Scholar]
  • 28.Felsenstein J. Cases in which parsimony or compatibility methods will be positively misleading. Systematic Biology. 1978;27(4):401–10. doi: 10.1093/sysbio/27.4.401 [DOI] [Google Scholar]
  • 29.Bergsten J. A review of long-branch attraction. Cladistics. 2005;21(2):163–93. doi: 10.1111/j.1096-0031.2005.00059.x [DOI] [PubMed] [Google Scholar]
  • 30.Superson AA, Battistuzzi FU. Exclusion of fast evolving genes or fast evolving sites produces different archaean phylogenies. Mol Phylogenet Evol. 2022;170:107438. doi: 10.1016/j.ympev.2022.107438 [DOI] [PubMed] [Google Scholar]
  • 31.Abadi S, Azouri D, Pupko T, Mayrose I. Model selection may not be a mandatory step for phylogeny reconstruction. Nat Commun. 2019;10(1):934. doi: 10.1038/s41467-019-08822-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Novitsky V, Steingrimsson J, Howison M, Dunn C, Gillani FS, Manne A, et al. Longitudinal typing of molecular HIV clusters in a statewide epidemic. AIDS. 2021;35(11):1711–22. doi: 10.1097/QAD.0000000000002953 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Novitsky V, Steingrimsson J, Howison M, Dunn CW, Gillani FS, Fulton J, et al. Not all clusters are equal: dynamics of molecular HIV-1 clusters in a statewide Rhode Island epidemic. AIDS. 2023;37(3):389–99. doi: 10.1097/QAD.0000000000003426 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Kantor R, Steingrimsson J, Fulton J, Novitsky V, Howison M, Gillani F, et al. Prospective evaluation of routine statewide integration of molecular epidemiology and contact tracing to disrupt human immunodeficiency virus transmission. Open Forum Infect Dis. 2024;11(10):ofae599. doi: 10.1093/ofid/ofae599 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Novitsky V, Steingrimsson J, Guang A, Dunn CW, Howison M, Gillani FS, et al. Dynamics of clustering rates in the Rhode Island HIV-1 epidemic. AIDS. 2025;39(2):105–14. doi: 10.1097/QAD.0000000000004062 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Zharkikh A, Li WH. Statistical properties of bootstrap estimation of phylogenetic variability from nucleotide sequences: II. Four taxa without a molecular clock. J Mol Evol. 1992;35(4):356–66. doi: 10.1007/BF00161173 [DOI] [PubMed] [Google Scholar]
  • 37.Novitsky V, Steingrimsson JA, Howison M, Gillani FS, Li Y, Manne A, et al. Empirical comparison of analytical approaches for identifying molecular HIV-1 clusters. Sci Rep. 2020;10(1):18547. doi: 10.1038/s41598-020-75560-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Rose R, Cross S, Lamers SL, Astemborski J, Kirk GD, Mehta SH, et al. Persistence of HIV transmission clusters among people who inject drugs. AIDS. 2020;34(14):2037–44. doi: 10.1097/QAD.0000000000002662 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Liu M, Han X, Zhao B, An M, He W, Wang Z, et al. Dynamics of HIV-1 molecular networks reveal effective control of large transmission clusters in an area affected by an epidemic of multiple HIV subtypes. Front Microbiol. 2020;11:604993. doi: 10.3389/fmicb.2020.604993 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Weaver S, Dávila Conn VM, Ji D, Verdonk H, Ávila-Ríos S, Leigh Brown AJ, et al. AUTO-TUNE: selecting the distance threshold for inferring HIV transmission clusters. Front Bioinform. 2024;4:1400003. doi: 10.3389/fbinf.2024.1400003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Chato C, Feng Y, Ruan Y, Xing H, Herbeck J, Kalish M, et al. Optimized phylogenetic clustering of HIV-1 sequence data for public health applications. PLoS Comput Biol. 2022;18(11):e1010745. doi: 10.1371/journal.pcbi.1010745 [DOI] [PMC free article] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1013676.r001

Decision Letter 0

Roger Dimitri Kouyos, Katharina Kusejko

31 Mar 2025

PCOMPBIOL-D-25-00292

T-shaped alignments integrating HIV-1 near full-length genome and partial pol sequences can improve phylodynamic inference of transmission clusters

PLOS Computational Biology

Dear Dr. Guang,

Thank you for submitting your manuscript to PLOS Computational Biology. After careful consideration, we feel that it has merit but does not fully meet PLOS Computational Biology's publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

​Please submit your revised manuscript within 60 days. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at ploscompbiol@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pcompbiol/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

* A rebuttal letter that responds to each point raised by the editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. This file does not need to include responses to formatting updates and technical items listed in the 'Journal Requirements' section below.

* A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

* An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, competing interests statement, or data availability statement, please make these updates within the submission form at the time of resubmission. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter

We look forward to receiving your revised manuscript.

Kind regards,

Katharina Kusejko

Academic Editor

PLOS Computational Biology

Roger Kouyos

Section Editor

PLOS Computational Biology

Additional Editor Comments (if provided):

The two reviewers have raised several important points that should be addressed. In addition, I have the following comments:

1) In the "Author Summary", the authors write that a novel approach for analyzing the spread of HIV-1 is explored. This is somewhat misleading, as this manuscript does not look into HIV-1 transmission, but clustering results in a convenience samples. No conclusions about any HIV transmission dynamics can be drawn here. Similar, the lines 226-229 in the Discussion concerning public health go in my opinion far beyond what was done here, a recommendation concerning cluster certainty should be avoided.

2) For this manuscript, as this describes mainly the workflow for exploring the impact of different fractions of full-length genomes, having the "Methods" section before the "Results" section would improve the reading flow.

3) Methods: To me it is not clear - also Reviewer 1 pointed this out - what was shuffled 1000 times randomly. This is a crucial point: Did you shuffle the order of sequences every time, and thus different sequences were "cropped" to consist of pol only? I am worried that if this was not shuffled properly, and thus always the same sequences were cropped, this might lead to artificial clustering due to sequences in the Los Alamos Database being more similar. Please clarify this aspect in the discussion. You might also think about a "methods" Figure clearly explaining the algorithm and different measures you applied to assess the impact of the fraction of full-genomes.

4) Selection of all available Los Alamos sequences: As these sequences stem from different studies and techniques (as Reviewer 2 pointed out), it would be interesting to see a sensitivity analysis looking into a subset of the sequences stemming from one study only (i.e., you could take the largest sub-study with available wgs). In my opinion, with this quite biased selection of available sequences, it is not clear which phylogenetic patterns would be expected at all - the current selection of sequences do not belong to one sub-epidemic in a country/region/time-frame/sub-population/etc. Thus the results are also hard to interpret.

5) Cluster Definition: You defined clusters as matching only if all members were the same. This is a very strict definition, and in my opinion might be too strict. E.g., if you have a cluster with 10 members, even if this cluster is still there with only one member difference, this would lead to a "non-match". I suggest to at least run a sensitivity analysis applying other cluster definition, e.g. more than 50%, or more than 2/3 of the members being the same. The result presented in Figure 4 might be a consequence of this strict definition: It is concerning that already when comparing all wgs to wgs90, the AMI is only between 0.3 to 0.5. If the AMI is anyway quite low, does the difference between the lower wgs values (wgs90, wgs80, wgs70,...) matter so much? It seems that even 10% missing wgs is leading to a much worse result. To understand this better: could you include results of wgs99, wgs95 or other intermediate values as well to understand this "jump" at the beginning? This also questions the whole "take-home message" that at least 50% wgs is beneficial: the actual changes in AMI are minimal.

6) Discussion: please do not use the term "phylodynamic" here, as no dynamic aspect (e.g. cluster growth) was investigated

Journal Requirements:

1) We ask that a manuscript source file is provided at Revision. Please upload your manuscript file as a .doc, .docx, .rtf or .tex. If you are providing a .tex file, please upload it under the item type u2018LaTeX Source Fileu2019 and leave your .pdf version as the item type u2018Manuscriptu2019.

2) We noticed that you used the phrase 'data not shown' in the manuscript. We do not allow these references, as the PLOS data access policy requires that all data be either published with the manuscript or made available in a publicly accessible database. Please amend the supplementary material to include the referenced data or remove the references.

3) Please ensure that the funders and grant numbers match between the Financial Disclosure field and the Funding Information tab in your submission form. Note that the funders must be provided in the same order in both places as well.

- State the initials, alongside each funding source, of each author to receive each grant. For example: "This work was supported by the National Institutes of Health (####### to AM; ###### to CJ) and the National Science Foundation (###### to AM)."

- State what role the funders took in the study. If the funders had no role in your study, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.".

If you did not receive any funding for this study, please simply state: u201cThe authors received no specific funding for this work.u201d

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The study by Guang et al assessed the impact of having sequences with varying length on clustering analysis, or more specific how much improvement including full-length sequences can have on the overall clustering outputs. This is an important aspect of clustering, particularly when it is used for public health use in HIV, where information on clusters may be used to inform policies. Thus, accuracy is important. Overall, it is a neat study looking at a common problem in sequence analysis. The study design seems sound, though I am not familiar with the Robinson-Foulds distance and the adjusted mutual index. Thus, I cannot comment on the results for these.

Though, this is a computational journal I struggled to fully understand the study and believe the manuscript could benefit from more detail and clarifications. Also, how does the IQTree algorithm deal with missing data, for example does it ignore positions with incomplete coverage? I may have missed it but some explanations on that would be helpful. Apologies if I missed this, but did including wgs improve bootstrapping support overall?

Minor comments:

Lines 74 to 79 could be re-written to clarify the outcomes. I didn’t understand fully what the distance along y-axis represents and what the distance. I understand how the points move along the axis from wgs20 to wgs90 but not why pol and wgs10 are ‘off’ the line, as in wgs10 is worse than pol?

Lines 253 to 260 not very clear

Line 144. Replace ‘developed’ with ‘assessed’ or similar as per my understanding the study did not ‘develop’ the approach, rather compared outputs.

Line 244. While sampling dates are important, I don’t think we can stay bootstrap and genetic distances are ‘arbitrary’. I would rephrase this.

Line 245. Unclear what exactly was repeated 1000 times? The clustering? The generation of mixed length datasets?

Figures

Figure 1. It seems the more ‘wgs’ sequences are present the more variation is in the distance comparison (larger confidence intervals), why is that?

Figure 3 is a bit chaos. Can you separate by bootstrap? Make each panel small, I think that should be sufficient.

Reviewer #2: See attachment.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: None

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

Figure resubmission:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. If there are other versions of figure files still present in your submission file inventory at resubmission, please replace them with the PACE-processed versions.

Reproducibility:

To enhance the reproducibility of your results, we recommend that authors of applicable studies deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Attachment

Submitted filename: Reviewer_comments.docx

pcbi.1013676.s005.docx (18.3KB, docx)
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1013676.r003

Decision Letter 1

Roger Dimitri Kouyos, Katharina Kusejko

9 Oct 2025

PCOMPBIOL-D-25-00292R1

T-shaped alignments integrating HIV-1 near full-length genome and partial pol sequences can improve phylogenetic inference of transmission clusters

PLOS Computational Biology

Dear Dr. Guang,

Thank you for submitting your manuscript to PLOS Computational Biology. After careful consideration, we feel that it has merit but does not fully meet PLOS Computational Biology's publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript within 30 days Dec 09 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at ploscompbiol@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pcompbiol/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

* A rebuttal letter that responds to each point raised by the editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. This file does not need to include responses to formatting updates and technical items listed in the 'Journal Requirements' section below.

* A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

* An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, competing interests statement, or data availability statement, please make these updates within the submission form at the time of resubmission. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

We look forward to receiving your revised manuscript.

Kind regards,

Katharina Kusejko

Academic Editor

PLOS Computational Biology

Roger Kouyos

Section Editor

PLOS Computational Biology

Journal Requirements:

1) Thank you for stating "Publicly available data at https://www.hiv.lanl.gov/content/sequence/HIV/mainpage.html was used for the analysis." Please confirm whether all data needed to replicate the study's findings are at this link? If so, please update your Data Availability statement to "All data required to replicate the study's findings can be found here  https://www.hiv.lanl.gov/content/sequence/HIV/mainpage.html

Note: If the reviewer comments include a recommendation to cite specific previously published works, please review and evaluate these publications to determine whether they are relevant and should be cited. There is no requirement to cite these works unless the editor has indicated otherwise.

Reviewers' comments:

Reviewer's Responses to Questions

Reviewer #2: This manuscript has improved significantly through all additional analyses and points discussed. Although the study is very thoroughly conducted and has gained improvements, the setup reflects a highly experimental scenario with many uncertainties. Thus, I would suggest to position the attitude towards public health implementations and recommendations for incorporating such nFL genomes even more neutral (e.g. L278-287).

Minor points to address:

- L 106/107: You have two unfinished sentences regarding the overlap coefficient.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

Figure resubmission:

While revising your submission, we strongly recommend that you use PLOS’s NAAS tool (https://ngplosjournals.pagemajik.ai/artanalysis) to test your figure files. NAAS can convert your figure files to the TIFF file type and meet basic requirements (such as print size, resolution), or provide you with a report on issues that do not meet our requirements and that NAAS cannot fix.

After uploading your figures to PLOS’s NAAS tool - https://ngplosjournals.pagemajik.ai/artanalysis, NAAS will process the files provided and display the results in the "Uploaded Files" section of the page as the processing is complete. If the uploaded figures meet our requirements (or NAAS is able to fix the files to meet our requirements), the figure will be marked as "fixed" above. If NAAS is unable to fix the files, a red "failed" label will appear above. When NAAS has confirmed that the figure files meet our requirements, please download the file via the download option, and include these NAAS processed figure files when submitting your revised manuscript.

Reproducibility:

To enhance the reproducibility of your results, we recommend that authors of applicable studies deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1013676.r005

Decision Letter 2

Roger Dimitri Kouyos, Katharina Kusejko

29 Oct 2025

Dear Dr. Guang,

We are pleased to inform you that your manuscript 'T-shaped alignments integrating HIV-1 near full-length genome and partial pol sequences can improve phylogenetic inference of transmission clusters' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Katharina Kusejko

Academic Editor

PLOS Computational Biology

Roger Kouyos

Section Editor

PLOS Computational Biology

***********************************************************

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1013676.r006

Acceptance letter

Roger Dimitri Kouyos, Katharina Kusejko

PCOMPBIOL-D-25-00292R2

T-shaped alignments integrating HIV-1 near full-length genome and partial pol sequences can improve phylogenetic inference of transmission clusters

Dear Dr Guang,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

For Research, Software, and Methods articles, you will receive an invoice from PLOS for your publication fee after your manuscript has reached the completed accept phase. If you receive an email requesting payment before acceptance or for any other service, this may be a phishing scheme. Learn how to identify phishing emails and protect your accounts at https://explore.plos.org/phishing.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Zsofia Freund

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table. Mean number of clusters and cluster size at different bootstrap thresholds and wgs proportions, full dataset.

    Each row represents a different mixture of wgs and pol sequences, going from 100% pol to 100% wgs. Each column represents a bootstrap threshold for clustering, with the subcolumns indicating the mean number of clusters and mean cluster size found at that bootstrap. Cluster number was averaged over all samples for a given mixture, and cluster size was averaged over both all cluster sizes in a given sample, and over all samples.

    (DOCX)

    pcbi.1013676.s001.docx (66KB, docx)
    S2 Table. Mean number of clusters and cluster size at different bootstrap thresholds and wgs proportions, subset dataset.

    Each row represents a different mixture of wgs and pol sequences, going from 100% pol to 100% wgs. Each column represents a bootstrap threshold for clustering, with the subcolumns indicating the mean number of clusters and mean cluster size found at that bootstrap. Cluster number was averaged over all samples for a given mixture, and cluster size was averaged over both all cluster sizes in a given sample, and over all samples.

    (DOCX)

    pcbi.1013676.s002.docx (15.5KB, docx)
    S1 Fig. Rates of different base frequencies from different mixture datasets.

    The figure explores the impact of the 4 different base frequency proportions (pi(A), pi(C), pi(G), pi(T), gray bar on top of each panel) in our choice of substitution model for IQTree (GTR+F+I+G4). Proportion is on the Y-axis, and the different mixture datasets are ordered from pol to wgs100 on the X axis. Boxes in panels indicate the central 50% (top and bottom of boxes) and the median (thicker black line in boxes) of the proportions, whiskers indicate the range, and dots indicate outliers. There is a monotonic relationship between relative rate values and wgs proportion beginning with wgs10, when wgs sequences are introduced.

    (TIFF)

    pcbi.1013676.s003.tif (9.8MB, tif)
    S2 Fig. Rates of different gamma relative rate categories from different mixture datasets.

    The figure explores the impact of the 4 different gamma rate categories (gray bar on top of each panel) in our choice of substitution model for IQTree (GTR+F+I+G4). Rate is on the Y-axis, and the different mixture datasets are ordered from pol to wgs100 on the X axis. Boxes in panels indicate the central 50% (top and bottom of boxes) and the median (thicker black line in boxes) of the proportions, whiskers indicate the range, and dots indicate outliers. There is a monotonic relationship between relative rate values and wgs proportion beginning with wgs10, when wgs sequences are introduced.

    (TIFF)

    pcbi.1013676.s004.tif (9.2MB, tif)
    Attachment

    Submitted filename: Reviewer_comments.docx

    pcbi.1013676.s005.docx (18.3KB, docx)
    Attachment

    Submitted filename: reviewer_response_v3.docx

    pcbi.1013676.s006.docx (142.1KB, docx)
    Attachment

    Submitted filename: response_to_reviewers.docx

    pcbi.1013676.s007.docx (110.1KB, docx)

    Data Availability Statement

    All data required to replicate the study’s findings can be found here https://www.hiv.lanl.gov/content/sequence/HIV/mainpage.html. All analysis source code is available at https://github.com/dunnlab/hiv_wide.


    Articles from PLOS Computational Biology are provided here courtesy of PLOS

    RESOURCES