Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2021 Feb 1;17(2):e1008537. doi: 10.1371/journal.pcbi.1008537

Factors influencing estimates of HIV-1 infection timing using BEAST

Bethany Dearlove 1,2, Sodsai Tovanabutra 1,2, Christopher L Owen 1,2, Eric Lewitus 1,2, Yifan Li 1,2, Eric Sanders-Buell 1,2, Meera Bose 1,2, Anne-Marie O’Sullivan 1,2, Gustavo Kijak 1,2, Shana Miller 1,2, Kultida Poltavee 1,2, Jenica Lee 1,2, Lydia Bonar 1,2, Elizabeth Harbolick 1,2, Bahar Ahani 1,2, Phuc Pham 1,2, Hannah Kibuuka 3, Lucas Maganga 4, Sorachai Nitayaphan 5, Fred K Sawe 6, Jerome H Kim 7, Leigh Anne Eller 1,2, Sandhya Vasan 1,2, Robert Gramzinski 1, Nelson L Michael 8, Merlin L Robb 1,2, Morgane Rolland 1,2,*; the RV217 Study Team
Editor: Roger Dimitri Kouyos9
PMCID: PMC7877758  PMID: 33524022

Abstract

While large datasets of HIV-1 sequences are increasingly being generated, many studies rely on a single gene or fragment of the genome and few comparative studies across genes have been done. We performed genome-based and gene-specific Bayesian phylogenetic analyses to investigate how certain factors impact estimates of the infection dates in an acute HIV-1 infection cohort, RV217. In this cohort, HIV-1 diagnosis corresponded to the first RNA positive test and occurred a median of four days after the last negative test, allowing us to compare timing estimates using BEAST to a narrow window of infection. We analyzed HIV-1 sequences sampled one week, one month and six months after HIV-1 diagnosis in 39 individuals. We found that shared diversity and temporal signal was limited in acute infection, and insufficient to allow timing inferences in the shortest HIV-1 genes, thus dated phylogenies were primarily analyzed for env, gag, pol and near full-length genomes. There was no one best-fitting model across participants and genes, though relaxed molecular clocks (73% of best-fitting models) and the Bayesian skyline (49%) tended to be favored. For infections with single founders, the infection date was estimated to be around one week pre-diagnosis for env (IQR: 3–9 days) and gag (IQR: 5–9 days), whilst the genome placed it at a median of 10 days (IQR: 4–19). Multiply-founded infections proved problematic to date. Our ability to compare timing inferences to precise estimates of HIV-1 infection (within a week) highlights that molecular dating methods can be applied to within-host datasets from early infection. Nonetheless, our results also suggest caution when using uniform clock and population models or short genes with limited information content.

Author summary

Molecular dating using phylogenetics allows us to estimate the date of an infection from time-stamped within-host sequences alone. There are large datasets of HIV-1 sequences, but genome and gene analyses are not often performed in parallel and rarely with the possibility to compare results against a known narrow window of infection. We showed that all but the longest genes are near-clonal in acute infection, with little information for dating purposes. For infections with single founders, we estimated the eclipse phase—the time between HIV-1 exposure and the first positive diagnostic test—to last between one and two weeks using env, gag, pol and near full-length genomes. This approach could be used to narrow the date of suspected infection in ongoing clinical trials for the prevention of HIV-1 infection.

Introduction

The exact age of a human immunodeficiency virus (HIV-1) infection is rarely known, as there is a lag between the time of infection and diagnosis. However, knowing when infection occurred is important for assessing incidence rates, transmission dynamics and preventive interventions. Fiebig stages are used to define the progression of the early stages of clinical infection, using the time until laboratory tests for HIV-1 antigens and HIV-1-specific antibodies become positive [1]. Fiebig stage I refers to the earliest time that the virus can be identified in the bloodstream, by viral RNA test using PCR. Preceding this stage is the eclipse phase, corresponding to the interval between initial exposure and HIV-1 detection, during which replication occurs first near the site of HIV-1 transmission. Previously, studies using serologic and RNA tests on plasma serially-sampled from blood bank donors estimated the eclipse phase between 3 and 33 days, with several reports emphasizing a window of ten to twelve days [15]. Besides diagnostic assays, HIV-1 sequences derived from infected participants can determine the age of an infection. Using the within-host sequence diversity that accumulates through error-prone replication after infection, it is possible to reconstruct the evolutionary history of a viral infection within an individual. Previous work has shown that polymorphic nucleotides in the pol region can be used to identify recent infection events within one year [6], and the precision of these estimates has improved with deep-sequencing efforts [7]. We recently showed that it is possible to estimate the date of infection with a resolution on the scale of days using a cohort with a known narrow window of infection, the RV217 cohort [8,9]. This cohort enrolled more than 3,000 seronegative high-risk individuals in four countries (Kenya, Tanzania, Thailand and Uganda) for twice-weekly HIV-1 RNA tests and identified 155 acute infections, thereby providing a unique opportunity to compare dating methods against real data, as the tight bounds on the time between the last negative and first positive test can be used to evaluate the accuracy of molecular dating methods. Diagnosis, or Day 0, was defined as the first sample which was reactive for HIV-1 RNA. Near full-length (NFL) genome sequences were obtained for a subset of 39 acutely infected participants, including 10 men, 5 transgender women and 1 woman from Thailand, and 23 women from East Africa.

We previously analyzed env sequences to estimate the eclipse phase using the time to most recent common ancestor (TMRCA) of the within-host phylogeny of sequences from the first six months of infection [8]. Here, we use the same methodology to compare estimates of the date of infection for the same study participants using individual genes and NFL genomes. Large numbers of HIV-1 sequences are publicly available (>850,000), yet few studies have evaluated how findings can differ based on genome- or gene-specific analyses. We evaluated how certain parameters influenced the results of our BEAST analyses and identified best-practice approaches for minimizing their effect. We showed that genetic and temporal signal, model choice and founder multiplicity greatly affected timing estimates. Whilst estimation of the date of infection remained problematic for participants with infections founded by multiple variants, the estimated date of infection for participants with single founder infections tended to fall within one to two weeks prior to diagnosis when estimated using the NFL genome, env, pol and gag.

Results

Characterization of infections with single and multiple founder variants

We analyzed HIV-1 genomes from 39 participants sampled longitudinally at three time points after the initial detection of HIV-1 viremia by a positive HIV-1 RNA test: at approximately one week (median: 4 days), one month (median: 32 days) and six months (median: 170 days); this diagnosis occurred a median of four days after the last negative test [8]. Approximately ten NFL genomes were derived via endpoint-dilution from plasma samples for each participant at each time point, yielding about thirty NFL genomes per participant (total = 761 NFL genomes, 379 5’ half genomes, 416 3’ half genomes and 81 additional env sequences). In East Africa (n = 23 women), the infections were most commonly identified as subtype A1 (n = 9), followed by subtype C (n = 3) and various A1/D recombinants. In Thailand (n = 10 men, 5 transgender women and 1 woman), all but two of the infections were CRF01_AE (n = 14); the remaining infections were subtype B (male participant) and a B/CRF01_AE recombinant (transgender participant).

Infections were classified into single or multiple founders using the genome sequences across all time points and the current standard approach: visual inspection of sequence alignments, Highlighter plots and phylogenetic trees reconstructed in IQ-TREE, and quantitative measures of intra-host diversity including the maximum pairwise diversity and ratio of shared to private mutations [8,10]. Additionally, we used a relatively new metric, the principle eigenvalue of the modified graph Laplacian (MGL), which summarizes the diversification patterns of phylogenetic trees and has been shown to positively correlate with the multiplicity of infection [8,11].

The majority of infections were founded by a single population (n = 28), with 11 infections founded by two or more distinct (but still closely related) lineages of variants. One week after diagnosis, the number of polymorphic sites was ten-fold higher in multiple founders (median: 136, IQR: 52.5–171) compared to those with single founders (median: 13, IQR: 11–16.5) for the NFL genome, with a two-fold difference still seen at six months (Fig 1A and 1B). Infections with multiple founders by definition had more phylogenetically informative sites (polymorphisms found in at least two sequences), due to the relatively deep branching events between founder variants (Fig 1C and 1D).

Fig 1. Informative sites identified in the first six months of infection across the 9 HIV-1 genes and genome for 39 participants in the RV217 cohort.

Fig 1

A) The number of polymorphic sites. B) The proportion of polymorphic sites. C) The number of informative sites. D) The proportion of polymorphic sites that were only found in one sequence. Genes are ordered by median sequence length across participants. Points are colored gray for participants with infections founded by a single variant, and pink for those founded by multiple variants.

The single versus multiple founders classification above was obtained using NFL genomes and we wanted to investigate if one of the methods used, the principle eigenvalue of the modified graph Laplacian (MGL), could yield similar results across the different genes despite the low number of polymorphic sites in some genes (Fig 2). To define a cut off between the two subsets, we used the median, jump and partition criteria; all three gave consistent results for NFL genomes. Though results were broadly consistent, there was a discrepancy between the three criteria for four borderline cases (participant 20225 in env, 40061 in nef, 30112 in tat and 40112 in vpu). This was due to the distribution of the data, with the threshold effectively rounded up to the value of the λ* at the threshold [11]. We also found discrepancies between genes. For example, participant 10220, who had an infection with multiple founders, was classified with the single founders for env, tat and rev, where the sequences were more homogeneous than in the rest of the genome. Likewise, the infection in participant 40265 corresponded to a single founder but was misclassified as a multiple founder for five out of nine genes.

Fig 2. Identification of infections with single versus multiple HIV-1 founders.

Fig 2

The principal eigenvalues from the modified graph Laplacian are compared for each participant and gene. Barplots are sorted in increasing order, and values shifted so that the smallest value is zero. Bars are colored according to whether participants were classified as a single founder (gray) or multiple founder (pink) with NFL genomes. White bars show sequence datasets in which there were no informative sites. Lines indicating thresholds inferred from the median (brown), jump (blue), and partition (red) criteria of the principal eigenvalue test of founder multiplicity are shown.

As we previously showed that estimates of the date of infection were poor for participants with multiple founders [8], we will focus on participants infected by a single founder for the majority of this paper. In the last section, we compare estimates of the eclipse phase using NFL genomes for multiple founders, including after splitting the sequences into founder subpopulations.

Low genetic diversity in the first month of infection for single founders

HIV-1 is measurably evolving, even over relatively short timescales. However, as we have seen, the information content varies across the genome and by participant. Not all HIV-1 genes were equally informative in the first six months of infection, and diversity was generally low in sequences from participants with single founders (Fig 1). The number of polymorphic sites increased with the length of the gene; the median number of polymorphic sites was ≤10 for all genes shorter than gag (vpu: 3; vpr: 4; tat: 4; rev: 6.5; vif: 8; nef: 10), with env showing the highest diversity (median: 48) (Fig 1A). Fig 1B shows that the proportion of polymorphic sites was generally consistent between genes (overall median: 0.013, IQR: 0.0087–0.0188), though tended to be higher in env (median: 0.018, IQR: 0.013–0.021) and one of the genes overlapping with it, rev (median: 0.018, IQR: 0.011–0.03). The majority of mutations in single founder infections were singletons. The median percentage of polymorphic sites across participants found in only a single sequence ranged from 67% in tat to 100% in vpu and vpr (Fig 1D). Nearly two thirds of participants had no informative sites in vpu and vpr (63% for vpu and 61% for vpr respectively). This proportion was higher when only considering the first month since diagnosis (the first two time points), as shared mutations were rare even at the NFL genome level (median: 2, IQR: 1–4); the percentage of participants with no informative sites at the second sampling time point was greater than 60% for all genes except env (46%) and NFL genome (15%) (S1 Fig). Two participants, 10435 and 20263, had no informative sites in the NFL genomes after sixth months (NFL genome median for single founders: 25, IQR: 14–39). These two participants also had some of the lowest number of polymorphic sites over the first 6 months of infection, 55 and 66 respectively, compared to a median of 130 sites in the NFL genome for infections with single HIV-1 founders.

For accurate estimation of the root of our within-participant phylogenies using phylogenetic reconstruction in BEAST, it is important that there is sufficient information in the sequences [12,13]. A first step was to exclude participant datasets in which there was insufficient genetic signal, i.e. those that contained no informative sites (all the mutations were singletons). We also tested for temporal signal, by regressing the root-to-tip distance calculated from phylogenies reconstructed in IQ-TREE against the sampling times for each within-participant sequence set. Datasets without a significant, positive slope were removed from the phylogenetic analyses in BEAST (S2 Fig). This resulted in the exclusion of participant 20442 for env, tat and rev; 20245 for env and nef; and 20314 (vpr) and 40265 (vpu). Additionally, participant 20368 was removed from the NFL genome analysis as no full length sequences were available, only half sequences from either the 3’ or 5’ end with 1293 sites overlap. This left between 9 (vpu) and 25 (NFL genome) single founder participant datasets to analyze in BEAST.

Estimates of eclipse phase for single founders were between one and two weeks

We used the phylogenetic software BEAST v1.8.3 to implement 16 combinations of clock and population model for each participant, considering each gene individually as well as the NFL genome for comparison purposes [14]. We took the median time to most recent common ancestor (TMRCA) of the best-fitting model (selected using the highest marginal likelihood estimated by stepping-stone sampling) as a point estimate for the date of infection (Fig 3).

Fig 3. Estimates of the date of infection by gene and founder type.

Fig 3

Points are colored gray for participants with infections founded by a single variant (A), and pink for those founded by multiple variants (B). The scale is shown with a power modulus transformation for visibility, and is different for infections with single or multiple HIV-1 founders.

Across infections with single founders, the median point estimate for the date of infection was 6 days before diagnosis for gag and env (gag IQR: 4–9; env IQR: 2–9). The estimates of the date of infection for the NFL genome and pol fell a little earlier relative to diagnosis: 10 days (IQR: 5–20) for the whole genome, and 17 days (IQR: 5–40) for pol (S1 Table). Overall, most of the participants had estimates for the dates of infection that fell in the two weeks prior to diagnosis: 79% for env, 71% for gag, 52% for NFL genomes and 43% for pol (Fig 3A and S1 Table). Shorter genes tended to estimate a date of infection much closer to diagnosis, with the IQR for nef, rev, vpr and tat encompassing estimates of infection date after diagnosis (36%, 39%, 40% and 55% of participants respectively) (S1 Table and S3 Fig).

When comparing the three longest genes and NFL genome, estimates for the same participant were not necessarily consistent (Fig 4). Curiously, for trees reconstructed from pol, and to some extent NFL genome, sequences seem to overestimate the eclipse phase compared to gag and env (Figs 4 and S4A). Particularly striking were participants 20442 and 20245, with estimates for the date of infection for the NFL genome more than 100 days prior to diagnosis (20442: 371 days prior to diagnosis, 20245: 133 days). These two Kenyan participants quickly controlled viremia after peak viral loads of 6.54 and 6.25 log copies/ml for 20442 and 20245 respectively, and only had 5 (participant 20442) and 6 (participant 20245) informative sites across the three time points for the NFL genome (indeed, only the NFL genome and one other gene had sufficient genetic and temporal signal to be included in the BEAST analysis). The lack of substitutions led to very low evolutionary rate estimates (20442: 6.66× 10-7subs/site/day; 20245: 2.64× 10-6subs/site/day; overall median: 1.45× 10-5subs/site/day), pushing the root of the phylogenies back in time.

Fig 4. Posterior distributions of the date of infection for participants with a single founder population.

Fig 4

Vertical lines mark the median. The shaded blue area corresponds to the interval between the last negative and first positive HIV-1 RNA test (or diagnosis date). The shaded gray rectangle highlights the period between 7 and 14 days before diagnosis.

We used the overlap coefficient, which ranges from 0 (no overlap) to 1 (complete overlap), to capture the overlap in posterior distributions between genes for each participant [15,16]. Coefficients varied widely overall but tended to show more overlap between posterior distributions for env and gag, and env and NFL genome (S4B Fig).

The best-fitting model combination varied across participants and genes

The best-fitting model combination, as chosen by the highest marginal likelihood estimated using stepping-stone sampling [17], varied across both participants and genes (Fig 5). For no single participant did all 16 models run and converge. The random local clock (RLC) was particularly poor, only fitted for env, pol, gag and the genome. Even so, it was never fitted with the exponential population model, and in all but one case there was substantial evidence (Bayes factor>3.2 [18]) towards rejecting models fit with the RLC in favor of the best-fitting model (S5 Fig). The exception was participant 20509, with a BF of 2.1 in favor of the best-fitting model (UCLD-skyline) relative to the second best-fitting (RLC-skyline) model.

Fig 5. Comparison of model rankings for each model combination for single founders for each gene and NFL genome.

Fig 5

Each combination of the four clock (strict, uncorrelated exponential (UCED) relaxed, uncorrelated lognormal relaxed (UCLD), and random local (RLC)) and four population models (constant, exponential, skyline and birth-death) are represented on the x-axis, and the model placing on the y-axis. Each dot represents the model fitted for one participant. Models were ranked by their estimated marginal likelihood and rankings scaled by the total number of models fitted for that participant and gene.

For the NFL genome, the most frequently selected population model was the skyline, selected as the best-fitting for all but one of the participants, in combination with the strict clock for 16 participants, the uncorrelated lognormal relaxed clock (UCLD) for 5 participants, and the uncorrelated exponential relaxed clock (UCED) for 3 participants. The remaining participant was fitted with the UCLD clock and birth-death population model. All participants had at least one model combination including a skyline in the top three ranked models, while 14 out of 25 participants had the skyline population model for all of the top three models selected. In addition to participant 20509 mentioned above, there were four other participants with little evidence separating the top two model combinations, with the top two models both involving the skyline: for participants 20314 and 30190, the UCLD was slightly favored over the strict clock (BF = 2 and 1.2 respectively), for participant 10066 the UCED was slightly favored over the strict clock (BF = 1.8), and for participant 40007 the strict clock was slightly favored over the UCLD relaxed clock (BF = 1.2) (S5 Fig).

The skyline was also the best population model for pol, env and vpr, though it was selected less frequently (61%, 58% and 40% of participants respectively). Indeed, for env the best single model combination was equally likely to be the skyline with a strict clock, or the UCLD relaxed clock and birth death model (n = 8). For the remaining genes, simpler population models were favored with either of the relaxed clocks: the UCLD relaxed clock with constant population for tat and rev and the UCED and birth-death model for vpu, rev, vif, nef and gag.

When considering the best-fitting population model, we found that the strict clock tended to estimate a longer time between infection and diagnosis compared to other clock models when in combination with the best population model (S6 Fig). For NFL genomes, the strict clock was estimated to be a median of 2.7 (IQR: 1–12) days prior to estimates using other clocks; strict clock estimates also tended to be more consistent with UCLD relaxed clock (median difference: 1.2, IQR: 0.41–4.8) than the UCED relaxed clock (median difference: 5.7, IQR: 2.2–13). Similar patterns were seen across the other genes, with the strict clock giving estimates 2.5 (IQR: 1.8–5) days prior to other clock models for gag, 3.4 (IQR: 0.93–7.2) for env, and 3.4 (IQR: 0.93–7.2) for pol.

‘Standard’ models tended to fit worse and gave poorer estimates than best-fitting models

We showed that although the random local clock model was clearly inferior, there was no single best-fitting combination of clock and population model for estimating the date of infection across all genes and participants. We considered how our estimates of the date of infection would have changed for single founders, had we only fitted one of the most commonly fitted ‘standard’ models: the strict clock with constant population size (the simplest possible model), and the uncorrelated log-normal relaxed clock and skyline model (allowing for most flexibility).

When the UCLD relaxed clock with skyline was fitted, it tended to be ranked highly (Fig 5); it was the top fitted model for 20% of participants for the NFL genome. However, it often had convergence problems—in no gene or NFL genome did it fit for every participant, and even for gag and pol it was only fitted for one and two participants respectively. In contrast, the strict-constant model was fitted for all participants, but tended to rank much more lowly—for no participant or informative gene/NFL genome was it chosen as the best model, and only one participant, 40257, had it ranked within the top five for both gag and pol. Estimates for the eclipse phase were also much longer for the strict-constant model, with a median estimate across participants of 27 days for gag, 41 days for pol, 33 days for env, and 34 days for the NFL genome (Fig 6).

Fig 6. Estimates of the date of infection for the best-fitting model, compared to the UCLD-skyline and strict-constant models.

Fig 6

Points are colored according to the relative rank of that model out of all models fitted for that participant and gene. Not all participants had a UCLD-skyline or strict-constant model fitted.

Splitting multiple founders into founding subpopulations gave more realistic, but not robust, results

We previously showed that phylogenetic methods revealed discordant dating estimates for infections with multiple founders that reflected the age of the infection in the transmitter rather than the current host. We also showed that our limited number of sequences did not allow us to accurately resolve infections into founder variants for estimating the date of infection [8]. Here, point estimates (calculated as the median) from the best-fitting model were also wide-ranging for multiple founders when considering all sequences, from 17,127 days prior to diagnosis to 5.1 days post-diagnosis (Fig 3B). The estimates for multiple founders tended to be significantly higher than those for single founders (Wilcoxon rank sum test: p < 0.05 for all genes except tat, rev and genome), with the median estimated date of infection being 225 days (IQR: 3–356) for the NFL genome. The estimates for gag and pol were closer to diagnosis, but still implausible given our knowledge on the recency of the infections (gag—median: 141, IQR: 17–266; pol—median: 184, IQR: 105–371), whereas estimates for env tended to be worse (median: 200, IQR: 25–689).

We previously split env sequences into founder subpopulations, which improved estimates on the whole (IQR: 14 days pre-diagnosis to 1 day post-diagnosis), but gave disparate results between founders [8]. Of the eleven participants infected with multiple founders, seven showed two variants with sufficient phylogenetic signal for estimates of the date of infection to be compared directly. In five participants, the two subpopulations gave estimates of the date of infection that were more than 24 days apart [8]. To see if the increased information in longer sequences could help better resolve estimates, we repeated the analysis using NFL genome sequences, which are over three times longer than env.

We used the Gap Procedure [19] on sequences from the initial two time points to identify founders. Five of eleven infections with multiple founders were split into two founder variants and four participants showed three founder lineages; sequences from participants 30812 and 40363 were split into six and five subpopulations respectively. In some participants, the proportion of sequences attributed to each founder were fairly even (e.g. 10203, 10220, 40123), whereas in other participants one variant dominated with >75% majority (e.g. 10463, 20502, 40100, 40436) (Fig 7). After removing founder datasets lacking sufficient information (less than 5 sequences, no temporal signal, or no informative sites), we were left with one founder variant to analyze in BEAST for two participants (20502 and 40100), and two founder variants for the rest.

Fig 7. Improved BEAST estimates on the subpopulations from infections with multiple founder variants for the NFL genome.

Fig 7

The posterior distributions for the best-fitting model for each identified founder population are shown, with vertical lines marking the median. The shaded blue area corresponds to the interval between the last negative and first positive HIV-1 RNA test (or diagnosis date). Black crosses show the median estimate from assuming a single population (crosses not shown for estimates beyond 365 days prior to diagnosis, which are figured in Fig 3B). The number of visits, sequences and polymorphic sites corresponding to each subpopulation are reported, along with the overlap coefficient for posterior distributions when two subpopulations were analyzed. Only subpopulations with sequences covering at least two time points, a minimum of five sequences with more than one informative site, and significant temporal signal were analyzed.

Overall, results were much improved by splitting the founder populations (Fig 7). Comparing the best-fitting model estimate from each subpopulation versus all sequences combined, there was a median improvement of 228 days (IQR: 86–404), though the range was still wider than the previous env analysis (IQR: 20 days pre-diagnosis to 0.051 days post-diagnosis). In four participants, at least one of the estimates for the date of infection was post-diagnosis (10463, 30124, 40363 and 40436), and four participants (10203, 20337, 30124 and 40123) had estimates more than a month before their last negative test. Within a single participant, estimates tended to be disparate, and were more than 30 days apart in five out of nine cases where two subpopulations were analyzed, with very little overlap between the posterior distributions (Fig 7). However, two participants, 10220 and 30812, had concordant estimates between subpopulations (overlap coefficient > 0.7), which fell only a few days prior to diagnosis.

Discussion

We have used phylogenetic modelling to estimate the date of infection in 39 participants infected with HIV-1, comparing across genes and NFL genome. We showed that the estimated date of infection for single founder infections is around a week prior to diagnosis when using the best-fitting model for env and gag, whilst the NFL genome placed it around 10 days. Our results emphasized that, in early infection, only the longest genes (gag, pol, env and NFL genomes) tended to have sufficient nucleotide substitutions to be amenable for phylogenetic dating, though pol tended to give poorer estimates compared to known windows of last negative to first positive HIV-1 RNA test. It remains unclear why the results from pol tended to overestimate the date of infection. Previous work has shown that env has an elevated evolutionary rate compared to gag and pol, and that selection acts differently upon the 3’ and 5’ halves of the genome [20]. The NFL genome results likely reflect the average of these effects between genes, however, given that the estimates from gag were similar to those from env, these factors alone do not explain the discrepancy with pol.

We also highlight that the simplest models, i.e. strict clock, constant population, which are often included as null models in studies, were not typically identified as best-fitting in our cohort (ranking outside the top 5 for all but one participant in gag and pol), and tend to overestimate the eclipse phase compared to other models. Given the dynamics of viral load in acute infection, which increases rapidly, before a steep decline and finally plateauing at the set point, it is not unexpected that more complex population dynamics were favored over the single-parameter constant population size model. In cases where running the full suite of available tree and clock models for model selection is avoided because it can be computationally intensive, our results indicate that it is preferable to use as default the most flexible model, i.e. uncorrelated log-normal relaxed clock and skyline population, which makes the least assumptions.

We found that shared diversity in acute infection is limited; in single founders the majority of the polymorphic sites were singletons, and up to two thirds of participants had insufficient genetic or temporal signal for the shortest genes. Upon nadir viral load about one month after diagnosis (at the second time point of our study), we saw that the sequences remained relatively homogeneous for single founders, with a maximum of 10 parsimony informative sites across the genome, most of which were found in env. Mutations accumulate rapidly after this point with evidence of selection; most phylogenies showed one long branch connecting sequences sampled around six months to the previous two time points. This is consistent with previous findings, which identified little evidence of selected mutations in the first month of infection [2123]. Most genes had fewer than 30 informative sites across the three sampled time points, with more than 60% of participants having no informative sites found in the shortest genes (vpu and vpr). This is problematic for parameter-rich phylodynamic analyses, and we found that the lack of signal gave estimates heavily influenced by the prior distribution rather than being biologically plausible (S7 Fig). As an example, we would expect estimates for the evolutionary rate on the order of 10−5 substitutions per site per day (derived from [24]). However, over one third of the point estimates for vpr, tat and rev were more than 100 times higher than this, and reflected the median of the prior, which had a median of 0.067. Substantial prior posterior overlap, with the prior distribution driving the posterior, suggests that the data is having little influence on the results and this non-identifiability can give misleading results [25]. One way to overcome this could be a more informative prior, although it could also serve to exacerbate the problem and obscure that there was not sufficient information to drive the output. We recommend checking the prior posterior overlap to check for identifiability problems, especially for short genes with low genetic diversity [26].

We found two scenarios where estimated dates were implausible: estimates that occurred after the first positive test and estimates that fell many weeks or months prior to the last negative test. Both are clearly problematic for accurately dating when infection took place, but neither are a methodological issue per se. The date of infection was estimated by the time to most recent common ancestor in the tree. For infections dated after diagnosis, this suggests that either the common ancestor of the sampled sequences existed a considerable time after the point of infection, or, possibly, that the evolutionary rate is being overestimated. In both cases, increasing the sequence sample size by adding more sequences at each time point, or more time points, would help better capture the extant lineages and improve estimates. Over-estimates of the time to most recent common ancestor were most commonly seen for participants infected with multiple founders. The phylogenies for these participants had long deep branches, reflecting evolution both in the current host, and in ancestral branches in the previous host prior to transmission. The longer this previous host has been infected, the more mutations their viral population will have accumulated and the more diverse the variants they will likely pass on. As a way to try and overcome this, we attempted to distinguish founder variants, and re-run the analyses for each variant separately. This improved estimates substantially (Fig 7), though were still often unrealistic compared to the known dates of the last negative and first positive HIV-1 RNA tests. For four participants infected with multiple founders, at least one subpopulation estimate gave dates after diagnosis, suggesting that the common ancestor of those sequences occurred much more recently than the founding infection; given the small sample size (as discussed above), it is likely that this set of sequences represented a subpopulation that diverged recently (from a large viral population corresponding to a chronic infection). Four other participants had estimates more than a month before their last negative test. In the cases where phylogenetic analyses could be conducted on two founder subpopulations, all participants had at least one feasible estimate for the date of infection. These results suggest that with more sequence data available, it should be possible to identify with more confidence the estimated date of infection based on the main founder population.

Besides the small number of sequences per participant, recombination across sequences is also likely to negatively impact estimates. The combination of peak viral load and depletion of target CD4 cells that happens in acute infection means that recombination is likely frequent. While it was not an issue for the homogeneous infections with single founders, we found evidence of recombination across the different populations in infections with multiple founders [27,28]. Recombination can distort branch lengths and cause false signals of exponential growth [29], which could explain the discrepancies between subpopulations (due to the small number of sequences we did not remove recombinants when defining populations, since sufficiently distant recombinant lineages might be expected to cluster into their own subpopulation). Fitting a relaxed clock model may help control these, effectively assuming that a recombination event on a branch is equivalent to that branch having a relatively high mutation rate relative to branches without recombination events. To reduce the possible effects of recombination, we recommend increasing the sequencing depth at each time point, to obtain a sufficiently large dataset to capture and analyze each founder subpopulation separately after having removed all putative recombinants.

However, even in the absence of recombination, accurately identifying distinct founder subpopulations is not necessarily straightforward. Here we used the Gap Procedure on the sequences from the first two time points to define founder subpopulations [19]. Various other methods for clustering sequences exist, although they are more designed for identifying epidemiologically more-closely linked cases in large, population level databases (see, for example, [3033]). The current standard for establishing whether an infection was founded by a single or multiple lineage(s) is by coupling the visual inspection of phylogenetic trees and Highlighter plots (www.hiv.lanl.gov) [10] with the analysis of measures of sequence diversity. Recently we showed that the principal eigenvalue of the modified graph Laplacian, which is computed from the distance matrix of the phylogeny, can distinguish between trees from single- or multi- founder infections [11]. Using trees reconstructed in IQ-TREE under the same codon partitions and substitutions as the BEAST analyses, the principal eigenvalue divided participants into whether sequences came from single or multiple founder infections with increased accuracy for longer gene segments, showing that the length of the fragment analyzed is important for founder estimations (Fig 2). It therefore could be used to identify which participant data was more likely to be from a single founder to take forward into the more computationally intensive BEAST analysis.

Here we decided to take the model with the maximum marginal likelihood estimated by stepping stone-sampling as the best-fitting [17]. However, there was often very little difference between the top few models, and there appeared in general to be a balance in overall complexity between the clock and population models. Because the marginal likelihoods are very similar, potential alternative approaches to choosing clock and tree models would be to take the model combination with the fewest parameters [34], or to conduct a model averaging approach [3537]. Model averaging allows for several models to be included in the same analysis, with results weighted by the time the MCMC spends sampling from each model, perhaps representing a good compromise between number of analyses performed and allowing for model selection. Of course, model selection is also dependent on the models fitted; it is also possible none of the models fitted were ideal. This could have been mitigated by including other flexible models, such as the uncorrelated Gamma relaxed clock, or by using the skyride or skygrid population models, which require less user specification than the skyline model we included here [3840]. Both the birth-death and coalescent models employed here have some vulnerabilities. The birth-death model is a forwards-in-time model designed for the speciation and extinction of lineages at the population level, and has traditionally been applied to between-host epidemiological models of virus dynamics, rather than within-host. While the birth-death model could potentially offer a useful framework for the transition from a localized infection to a systemic one, we did not evaluate this here. Still, it performed surprisingly well, being selected in the best-fitting model combination for the majority of participants infected with a single founder for gag (62%) and typically giving a date of infection within one week prior to diagnosis (median: 5.2; range: 0.16–9.5 days prior to diagnosis). It was selected less frequently for env and pol, but with similar overall accuracy. This suggests it can be adapted to the within-host setting, but caution is recommended, especially in the selection of priors. It was recently reported that a multitude of alternative birth–death models are equally likely to explain a given time-calibrated phylogeny [41], and results are prone to bias if the sampling process is misspecified [42]. It has been suggested that one way to avoid potential problems is to always consider the full posterior distribution, rather than putting emphasis on point estimates [43]. However, the full posterior distribution is highly dimensional, which may not be easy to incorporate into clinical trial inference. The deterministic coalescent models are more computationally efficient than birth-death models, being a retrospective look backwards-in-time from the tips, but can struggle to deal with demographic stochasticity [44], which we might expect early in the establishment of infection.

In conclusion, we showed that molecular dating methods can estimate the date of infection of single founders, based on sequences sampled in acute and early infection. For the most accurate estimates, we suggest first identifying whether the infection was established by single- or multiple- founder variants, and then using the longest sequences available, preferably the full genome, env or gag, to help ensure sufficient temporal and genetic signal. We also recommend using relaxed clock and population models to account for population size changes in acute/early infection and the accompanying substitution rate changes alongside model selection if possible. More broadly, our results emphasize that further studies should be done in parallel across different HIV-1 genes to better understand how the choice of a specific gene may dictate certain conclusions.

Methods

Ethics statement

Samples were obtained from the RV217 cohort [8,9]. The protocol was approved by the Walter Reed Army Institute of Research and local ethics review boards: the Makerere University Walter Reed Project, Kampala, Uganda; the Walter Reed Project, Kericho, Kenya; the Mbeya Medical Research Centre, Mbeya, Tanzania; and the Armed Forces Research Institute of Medical Sciences, Bangkok, Thailand. Only adult participants were enrolled. Written informed consent was obtained from all participants. We received Institutional Review Board approval to use the samples and all samples were anonymized.

Participant selection

Participants from the RV217 prospective cohort were selected for sequencing under the following criteria: 1) recency of the last negative visit to the date of diagnosis; 2) multiple time points available for the viral load upslopes; 3) the length of longitudinal follow up after HIV-1 infection; and 4) sample availability [8]. For the purposes of this study, we also required that sequences were available for three sampling time points (around one week, one month and six months post-diagnosis) for the majority of the genome. For some of the participants, only half genomes were available for one of the sampling dates; in these cases, the genes with less than three time points were dropped from the overall analyses.

Sequence characterization

HIV-1 near full-length genomes were sequenced from plasma samples following the endpoint dilution strategy outlined in [8]. Hypermutated sequences, as identified using the online Hypermut 2.0 tool (available at https://www.hiv.lanl.gov/content/sequence/HYPERMUT/hypermut.html [45]), were excluded from further analysis. Sequences were annotated with sampling dates, aligned with Mafft, genes extracted with Gene Cutter from the Los Alamos HIV sequence database (https://www.hiv.lanl.gov/content/sequence/GENE_CUTTER/cutter.html), and alignments manually inspected in Mesquite [46,47]. The number of parsimony informative sites was calculated using the pis function in the phyloch package in R [48,49]. This, the intra-host genetic diversity, visual inspection of Highlighter plots (available at www.hiv.lanl.gov [10]), and results from the modified graph Laplacian were used to define infections with single or multiple founders as previously described in [8]. In a previous study [22], participants 40061 and 40265 were found to have multiple minority variants at levels ≤3.7% using targeted deep sequencing. However, after removing hypermutant sequences, minority variants were not identified within the single genome amplification (SGA) sequences used in this study, and therefore we classify these participants with the single founders.

The best-fit partitioning schemes and models of evolution were assigned according to the Bayesian Information Criterion (BIC) implemented in PartitionFinder v2.1.1 [50]. For individual genes, codon positions were used as partition subsets; for the full genome, subsets included codon positions, gene, gene overlap and intron regions. The BIC favored partitions and substitution models were then implemented in BEAST.

Modified graph laplacian

The modified graph Laplacian was obtained using the RPANDA package in R for each participant [51,52]. For each graph we computed the principal (or maximum) eigenvalue, which is shown to positively correlate with the multiplicity of the founder population of HIV-1 infections [11], and calculated thresholds to determine single and multiple founders using the median criterion (the median principal eigenvalue plus 0.5 x standard deviation squared), the jump criterion (the position of the largest discrepancy between consecutive ranked principal eigenvalues), and the partition criterion (clustering on the principal eigenvalue by partitioning around medoids [53]).

Maximum likelihood trees

Maximum likelihood (ML) trees for each participant were obtained with IQ-TREE [54] using the best partitioning scheme found by PartitionFinder. Polytomies were randomly resolved and zero branch lengths were set to 1 x 10−7. Temporal signal was assessed following Temp-Est [12]. We first used the rtt function in the ape package in R to root the phylogeny according to the maximum correlation between tip sampling time and distance to root. We then fitted a linear regression of the tip sampling times against the root distances to estimate the mutation rate [55]. Only data sets with a significant positive regression slope (that is, the null hypothesis of β1 = 0 was rejected at the 5% level in favor of the alternative, β1>0, as tested with a one-tailed t-test), were analyzed in BEAST.

BEAST analyses

BEAST v1.8.3 [14] was used for the joint estimation of the molecular dates, rates of evolution and phylogeny as in [8]. We tested four tree priors–the constant population, exponential growth, and skyline coalescent models and the constant rate birth-death model [56,57]–with four clock models: strict clock, uncorrelated log-normal relaxed clock (UCLD), uncorrelated exponential relaxed clock (UCED) and the random local clock (RLC) [40,58].

The main concern with this dataset was that we were at the lower limit in terms of information content for these datasets given that a majority of sequences were sampled in the first month after diagnosis, especially when considering the short genes in the HIV-1 genome. Therefore, we generally opted for non-informative priors with the rationale that sufficient data will overwhelm a non-ideal choice of prior and still produce reasonable posterior distributions. In this way, we could determine instances when our sequence data were not informative for parameter estimation. The prior for the substitution rate prior was a normal distribution with a mean of 2.24 x 10−5 substitutions per site per day, with a standard deviation of 0.1 and truncated with bounds of 0 and 1. The mean of this distribution was derived from a within-host estimate in the C2V5 region of the envelope by Lemey et al. [24]. For the constant coalescent model, a log-normal prior with log(mean) of 0 and log(standard deviation) of 1 was used for the effective population size, and for coalescent exponential and skyline models, a uniform prior between 1 and 1.0 x 10100 was used. All other priors were left as the default values given in BEAUti v1.8.3. For the birth-death model we used a uniform prior between 0 and 100,000 for the birth rate, a uniform prior between 0 and 1 for the relative death rate, a uniform prior between 0 and 100 for the rate of sampling through time, and a uniform prior between 1.0 and 1.0 x 10100 was used for the time the lineage originated.

BEAST was run for 250 million generations, with a thinning interval of 10,000 iterations. MCMC traces were verified using R and the coda package [49,59]. The first 10% of each run was discarded for burn in as standard, with the effective sample size checked to be above 200 for all parameters. The date of infection was estimated as the root height of the phylogeny; unless otherwise stated, we used the posterior median for point estimates alongside the 95% highest posterior density interval [60]. Plots were generated with ggplot2 in R [61].

BEAST model selection

Model selection was performed using stepping-stone sampling to assess the relative goodness-of-fit between models based on the marginal likelihood score [17]. Stepping-stone sampling estimates the marginal likelihood by using a set of power posteriors to bridge from the prior to posterior. Following the recommendation by Xie and colleagues [17], and since followed by Baele and colleagues [37,62], we selected the power values, β, on the path from the prior to posterior using 100 evenly spaced quantiles from the Beta(α,1.0) distribution with α = 0.3. This was implemented in BEAST v1.8.2, with a chain length of 1,000,000 [37,62]. The best-fitting model was chosen as the one with the highest estimated marginal likelihood. Bayes factors (BFs), which calculate the likelihood ratio between hypotheses, were derived to quantify the relative fit between models. Kass and Raftery [18] suggest that a BF>3.2 gives substantial evidence in favor of the alternative hypothesis.

Identifying subpopulations in multiple founders

For infections with multiple founders, sequences corresponding to different founder variants were also analyzed separately for the genome. For each participant, the Gap Procedure [19] was used to cluster the sequences from the first two timepoints and identify founders. For each of these clusters, a consensus was derived. Sequences from the 6-month time point were added to clusters based on the p-distance from the consensus. Resulting founder datasets that had less than five sequences, sequences from only a single visit date, and/or no parsimony informative sites were excluded from further analysis. The remaining datasets were processed as for the BEAST procedure as above.

Supporting information

S1 Table. Summary of the estimated dates of HIV-1 infection for participants infected by single or multiple HIV-1 founders by gene.

n denotes the number of participant datasets. Negative estimates denote days prior to diagnosis.

(DOCX)

S1 Fig. The accumulation of diversity over time for infections with single founders.

At each sampling time point, we calculated the number of polymorphic sites (sites with at least two alleles) and informative sites (sites with multiple alleles found in at least two sequences) for sequences collected up to and including that time.

(TIF)

S2 Fig. Breakdown of datasets for the BEAST analysis, by founder type.

Two participants did not have samples for all three time points for gag and pol and were thus removed for these genes; participant 20368 was removed from the NFL genome analysis due to half genomes only being available. The category ‘informative sites’ refers to participants whose sequences had at least one informative site; these were then tested for significant temporal signal. Only those with at least one informative site and temporal signal were analyzed in BEAST.

(TIF)

S3 Fig. Summary of the estimated dates of HIV-1 infection for participants with single or multiple HIV-1 founders by gene.

(TIF)

S4 Fig. Comparison of estimates for participants infected with a single founder.

A) Point estimates of the date of infection for participants. The color of the points shows the number of informative sites in the within-host dataset for that participant and gene. Lines link gene and genome results for each participant. B) Boxplot of the overlap coefficient for posterior distributions between pairs of genes for each participant. The coefficient is defined between 0 and 1; 0 implies the curves are non-overlapping, and 1 complete overlap.

(TIF)

S5 Fig. Comparison of Bayes factors for each model combination across genes and participants with significant temporal signal.

The best-fitting model is shown in red. Shades of orange and yellow give the strength of evidence for the best-fitting model relative to the other model combinations fitted, that is, the darker the color, the smaller the improvement by the best-fitting model. Model combinations shown in gray could not be fitted.

(TIF)

S6 Fig. Comparison of clock priors for the best-fitting population model.

For each participant, the best-fitting estimate from BEAST is marked by a triangle, with a circle showing the result from the other clock models under the same population model. The shaded blue area corresponded to the interval between the last negative and first positive HIV-1 RNA test; where the blue bar is missing, the sequences were unavailable for that participant.

(TIF)

S7 Fig. Smaller genes have less signal and are more likely to be influenced by the prior.

The estimated substitution rates for each participant and gene (left) are plotted alongside the prior distribution (right). Horizontal dashed lines show the median and IQR of the prior.

(TIF)

Acknowledgments

We are indebted to the 3,173 individuals who participated in the RV217 study. We thank the RV217 study team: Jintanat Ananworanich, Sheila Peel, Linda Jagodozinski, Jennifer Malia, Mark Manak, Mark Milazzo, Qun Li, Steve Schech, Julie Dorsey Spitz, Peter Dawson, Prossy Sekiziyivu, Francis Kiweewa, Monica Millard, Doug N. Shaffer, Josphat Kosgei, Joseph Oundo, Nyanda Ntinginya, Cornelia Lueer, Abisai Kisinda, Inge Kroidl, Michael Hoelscher, Arne Kroidl, Rapee Trichavaroj, Siriwat Akapirat, Alex Schuetz, Eugene Kroon, Somchai Sriplienchan, Robert J. O’Connell, Mark DeSouza, Mary Marovich, Edith Swann. We wish to acknowledge former HJF employees who contributed to the HIV-1 sequencing: Daniel Silas, Sandra Mendoza-Guerrero, Adam Bates, Karishma Suchday, Amy Nguyen, Joann Harris, Anjali Bhatnagar, Tara Pinto, Stephanie Melton, Sevan Muhammad, Clinton Ogega, Michelle Lazzaro, Felix Tang, Celina Oropeza, Grace Ibitamuno, Joey Patterson.

Data Availability

Sequences are available on GenBank under accession numbers KY580473—KY580727 and MN791130—MN792579.

Funding Statement

This work was supported by a cooperative agreement between The Henry M. Jackson Foundation for the Advancement of Military Medicine, Inc., and the U.S. Department of the Army [W81XWH-07-2-0067 (NLM), W81XWH-11-2-0174 (NLM), W81XWH-18-2-0040 (RG)]. The views expressed are those of the authors and should not be construed to represent the positions of the U.S. Army, the Department of Defense, or the Department of Health and Human Services. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Fiebig EW, Wright DJ, Rawal BD, Garrett PE, Schumacher RT, Peddada L, et al. Dynamics of HIV viremia and antibody seroconversion in plasma donors: implications for diagnosis and staging of primary HIV infection. AIDS. 2003;17: 1871–1879. 10.1097/00002030-200309050-00005 [DOI] [PubMed] [Google Scholar]
  • 2.Kahn JO, Walker BD. Acute human immunodeficiency virus type 1 infection. The New England Journal of Medicine. 1998;339: 33–39. 10.1056/NEJM199807023390107 [DOI] [PubMed] [Google Scholar]
  • 3.McMichael AJ, Borrow P, Tomaras GD, Goonetilleke N, Haynes BF. The immune response during acute HIV-1 infection: clues for vaccine development. Nature Reviews Immunology. 2010;10: 11–23. 10.1038/nri2674 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Cohen MS, Gay CL, Busch MP, Hecht FM. The Detection of Acute HIV Infection. Journal of Infectious Diseases. 2010;202: S270–S277. 10.1086/655651 [DOI] [PubMed] [Google Scholar]
  • 5.Delaney KP, Hanson DL, Masciotra S, Ethridge SF, Wesolowski L, Owen SM. Time Until Emergence of HIV Test Reactivity Following Infection With HIV-1: Implications for Interpreting Test Results and Retesting After Exposure. Clinical infectious diseases. 2017;64: 53–59. 10.1093/cid/ciw666 [DOI] [PubMed] [Google Scholar]
  • 6.Kouyos RD, Wyl V von, Yerly S, Böni J, Rieder P, Joos B, et al. Ambiguous nucleotide calls from population- based sequencing of HIV-1 are a marker for viral diversity and the age of infection. Clinical infectious diseases. 2011;52: 532–539. 10.1093/cid/ciq164 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Puller V, Neher R, Albert J. Estimating time of HIV-1 infection from next-generation sequence diversity. PLoS Computational Biology. 2017;13: e1005775 10.1371/journal.pcbi.1005775 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Rolland M, Tovanabutra S, Dearlove B, Li Y, Owen C, Sanders-Buell E, et al. Molecular dating and viral load growth rates suggested that the eclipse phase lasted about a week in HIV-1 infected adults in East Africa and Thailand. PLoS Pathogens. 2020;16: e1008179 10.1371/journal.ppat.1008179 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Robb ML, Eller LA, Kibuuka H, Rono K, Maganga L, Nitayaphan S, et al. Prospective Study of Acute HIV-1 Infection in Adults in East Africa and Thailand. The New England Journal of Medicine. 2016;374: 2120–2130. 10.1056/NEJMoa1508952 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Keele BF, Giorgi EE, Salazar-Gonzalez JF, Decker JM, Pham KT, Salazar MG, et al. Identification and characterization of transmitted and early founder virus envelopes in primary HIV-1 infection. Proceedings of the National Academy of Sciences of the United States of America. 2008;105: 7552–7557. 10.1073/pnas.0802203105 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Lewitus E, Rolland M. A non-parametric analytic framework for within-host viral phylogenies and a test for HIV-1 founder multiplicity. Virus Evolution. 2019;5: vez044 10.1093/ve/vez044 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Rambaut A, Lam TT, Carvalho LM, Pybus OG. Exploring the temporal structure of heterochronous sequences using TempEst (formerly Path-O-Gen). Virus Evolution. 2016;2: vew007 10.1093/ve/vew007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Drummond AJ, Pybus OG, Rambaut A. Inference of viral evolutionary rates from molecular sequences. Advances in Parasitology. 2003;54: 331–58. Available: http://www.ncbi.nlm.nih.gov/pubmed/14711090 10.1016/s0065-308x(03)54008-8 [DOI] [PubMed] [Google Scholar]
  • 14.Drummond AJ, Suchard MA, Xie D, Rambaut A. Bayesian phylogenetics with BEAUti and the BEAST 1.7. Molecular Biology and Evolution. 2012;29: 1969–73. 10.1093/molbev/mss075 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Pastore M, Calcagnì A. Measuring distribution similarities between samples: A distribution-free overlapping index. Frontiers in Psychology. 2019;10 10.3389/fpsyg.2019.01089 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Pastore M. Overlapping: a R package for Estimating Overlapping in Empirical Distributions. Journal of Open Source Software. 2018;3: 1023 10.21105/joss.01023 [DOI] [Google Scholar]
  • 17.Xie W, Lewis PO, Fan Y, Kuo L, Chen M-H. Improving marginal likelihood estimation for Bayesian phylogenetic model selection. Systematic Biology. 2011;60: 150–160. 10.1093/sysbio/syq085 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Kass RE, Raftery AE. Bayes Factors. Journal of the American Statistical Association. 1995;90: 773–795. [Google Scholar]
  • 19.Vrbik I, Stephens DA, Roger M, Brenner BG. The Gap Procedure: for the identification of phylogenetic clusters in HIV-1 sequence data. BMC Bioinformatics. 2015;16 10.1186/s12859-015-0791-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Alizon S, Fraser C. Within-host and between-host evolutionary rates across the HIV-1 genome. Retrovirology. 2013;10 10.1186/1742-4690-10-49 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Herbeck JT, Rolland M, Liu Y, McLaughlin S, McNevin J, Zhao H, et al. Demographic processes affect HIV-1 evolution in primary infection before the onset of selective processes. Journal of Virology. 2011;85: 7523–7534. 10.1128/JVI.02697-10 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Kijak GH, Sanders-Buell E, Chenine A-L, Eller MA, Goonetilleke N, Thomas R, et al. Rare HIV-1 transmitted / founder lineages identified by deep viral sequencing contribute to rapid shifts in dominant quasispecies during acute and early infection. PLoS Pathogens. 2017;13: e1006510 10.1371/journal.ppat.1006510 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Goonetilleke N, Liu MKP, Salazar-Gonzalez JF, Ferrari G, Giorgi E, Vitaly V, et al. The first T cell response to transmitted / founder virus contributes to the control of acute viremia in HIV-1 infection. Journal of Experimental Medicine. 2009;206: 1253–1272. 10.1084/jem.20090365 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Lemey P, Rambaut A, Pybus OG. HIV Evolutionary Dynamics Within and Among Hosts. AIDS Reviews. 2006;8: 125–140. [PubMed] [Google Scholar]
  • 25.Garrett ES, Zeger SL. Latent Class Model Diagnosis. Biometrics. 2000;56: 1055–1067. 10.1111/j.0006-341x.2000.01055.x [DOI] [PubMed] [Google Scholar]
  • 26.Gimenez O, Morgan BJ, Brooks SP. Weak Identifiability in Models for Mark-Recapture-Recovery Data In: Thomson D, Cooch E, Conroy M, editors. Modeling demographic processes in marked populations. Boston, MA: Springer; 2009. pp. 1055–1067. 10.1007/978-0-387-78151-8_48 [DOI] [Google Scholar]
  • 27.Song H, Giorgi EE, Ganusov VV, Cai F, Athreya G, Yoon H, et al. Tracking HIV-1 recombination to resolve its contribution to HIV-1 evolution in natural infection. Nature Communications. 2018;9: 1928 10.1038/s41467-018-04217-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Martin DP, Murrell B, Golden M, Khoosal A, Muhire B. RDP4: Detection and analysis of recombination patterns in virus genomes. Virus Evolution. 2015;1: vev003 10.1093/ve/vev003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Schierup MH, Hein J. Consequences of recombination on traditional phylogenetic analysis. Genetics. 2000;156: 879–91. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1461297{\&}tool=pmcentrez{\&}rendertype=abstract [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Prosperi MCF, Ciccozzi M, Fanti I, Saladini F, Pecorari M, Borghi V, et al. A novel methodology for large-scale phylogeny partition. Nature Communications. 2011;2 10.1038/ncomms1325 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Ragonnet-Cronin M, Hodcroft EB, Hué S, Fearnhill E, Delpech V, Brown AJL, et al. Automated analysis of phylogenetic clusters. BMC Bioinformatics. 2013;14: 317 10.1186/1471-2105-14-317 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Poon AFY. Impacts and shortcomings of genetic clustering methods for infectious disease outbreaks. Virus Evolution. 2016;2: vew031 10.1093/ve/vew031 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Tonkin-Hill G, Lees JA, Bentley SD, Frost SDW, Corander J. RhierBAPS: An R implementation of the population clustering algorithm hierBAPS. Wellcome Open Research. 2018;3 10.12688/wellcomeopenres.14694.1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Bromham L, Duchêne S, Hua X, Ritchie AM, Duchêne DA, Ho SYW. Bayesian molecular dating: opening up the black box. Biological Reviews. 2018;93: 1165–1191. 10.1111/brv.12390 [DOI] [PubMed] [Google Scholar]
  • 35.Li WLS, Drummond AJ. Model averaging and Bayes factor calculation of relaxed molecular clocks in Bayesian phylogenetics. Molecular Biology and Evolution. 2012;29: 751–61. 10.1093/molbev/msr232 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Dearlove B, Wilson DJ. Coalescent inference for infectious disease: meta-analysis of hepatitis C. Philosophical Transactions of the Royal Society of London Series B, Biological Sciences. 2013;368: 20120314 10.1098/rstb.2012.0314 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Baele G, Li WLS, Drummond AJ, Suchard MA, Lemey P. Accurate Model Selection of Relaxed Molecular Clocks in Bayesian Phylogenetics. Molecular Biology and Evolution. 2013;30: 239–243. 10.1093/molbev/mss243 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Gill MS, Lemey P, Faria NR, Rambaut A, Shapiro B, Suchard MA. Improving Bayesian Population Dynamics Inference: A Coalescent-Based Model for Multiple Loci. Molecular Biology and Evolution. 2013;30: 713–724. 10.1093/molbev/mss265 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Minin VN, Bloomquist EW, Suchard MA. Smooth Skyride through a Rough Skyline: Bayesian Coalescent-Based Inference of Population Dynamics. Molecular Biology and Evolution. 2008;25: 1459–1471. 10.1093/molbev/msn090 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Drummond AJ, Ho SYW, Phillips MJ, Rambaut A. Relaxed phylogenetics and dating with confidence. PLoS Biology. 2006;4: e88 10.1371/journal.pbio.0040088 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Louca S, Pennell MW. Extant timetrees are consistent with a myriad of diversification histories. Nature. 2020;580: 502–505. 10.1038/s41586-020-2176-1 [DOI] [PubMed] [Google Scholar]
  • 42.Volz EM, Frost SDW. Sampling through time and phylodynamic inference with coalescent and birth-death models. Journal of The Royal Society Interface. 2014;11: 20140945–20140945. 10.1098/rsif.2014.0945 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Barido-Sottani J, Vaughan TG, Stadler T. A Multitype Birth–Death Model for Bayesian Inference of Lineage-Specific Birth and Death Rates. Systematic Biology. 2020; syaa016 10.1093/sysbio/syaa016 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Stadler T, Vaughan TG, Gavryushkin A, Guindon S, Kühnert D, Leventhal GE, et al. How well can the exponential-growth coalescent approximate constant-rate birth–death population dynamics? Proceedings of the Royal Society B. 2015;282: 20150420 10.1098/rspb.2015.0420 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Rose PP, Korber BT. Detecting hypermutations in viral sequences with an emphasis on G→A hypermutation. Bioinformatics. 2000;16: 400–401. 10.1093/bioinformatics/16.4.400 [DOI] [PubMed] [Google Scholar]
  • 46.Maddison WP, Maddison DR. Mesquite: a modular system for evolutionary analysis. 2017. Available: http://mesquiteproject.org [Google Scholar]
  • 47.Katoh K, Standley DM. Sequence analysis A simple method to control over-alignment in the MAFFT multiple sequence alignment program. Bioinformatics. 2016;32: 1933–1942. 10.1093/bioinformatics/btw108 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Heibl C. PHYLOCH: R language tree plotting tools and interfaces to diverse phylogenetic software packages. 2008. [Google Scholar]
  • 49.R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria: 2013. Available: http://www.r-project.org/ [Google Scholar]
  • 50.Lanfear R, Frandsen PB, Wright AM, Senfeld T, Calcott B. PartitionFinder 2: new methods for selecting partitioned models of evolution for molecular and morphological phylogenetic analyses. Molecular Biology and Evolution. 2017;34: 772–3. 10.1093/molbev/msw260 [DOI] [PubMed] [Google Scholar]
  • 51.Lewitus E, Morlon H. Characterizing and comparing phylogenies from their Laplacian spectrum. Systematic Biology. 2016;65: 495–507. 10.1093/sysbio/syv116 [DOI] [PubMed] [Google Scholar]
  • 52.Morlon H, Lewitus E, Condamine FL, Manceau M, Clavel J, Drury J. RPANDA: an R package for macroevolutionary analyses on phylogenetic trees. Methods in Ecology and Evolution. 2016;7: 589–597. 10.1111/2041-210X.12526 [DOI] [Google Scholar]
  • 53.Reynolds AP, Richards G, Iglesia B de la, Rayward-Smith VJ. Clustering Rules: A Comparison of Partitioning and Hierarchical Clustering Algorithms. Journal of Mathematical Modelling and Algorithms. 2006;5: 475–504. 10.1007/s10852-005-9022-1 [DOI] [Google Scholar]
  • 54.Nguyen L-T, Schmidt HA, Haeseler A von, Minh BQ. IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies. Molecular Biology and Evolution. 2014;32: 268–274. 10.1093/molbev/msu300 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Paradis E, Claude J, Strimmer K. APE: Analyses of Phylogenetics and Evolution in R language. Bioinformatics. 2004;20: 289–290. 10.1093/bioinformatics/btg412 [DOI] [PubMed] [Google Scholar]
  • 56.Stadler T. Sampling-through-time in birth-death trees. Journal of theoretical biology. 2010;267: 396–404. 10.1016/j.jtbi.2010.09.010 [DOI] [PubMed] [Google Scholar]
  • 57.Drummond AJ, Rambaut A, Shapiro B, Pybus OG. Bayesian coalescent inference of past population dynamics from molecular sequences. Molecular Biology and Evolution. 2005;22: 1185–92. 10.1093/molbev/msi103 [DOI] [PubMed] [Google Scholar]
  • 58.Drummond AJ, Suchard MA. Bayesian random local clocks, or one rate to rule them all. BMC Biology. 2010;8 10.1186/1741-7007-8-114 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Plummer M, Best N, Cowles K, Vines K. CODA: Convergence Diagnosis and Output Analysis for MCMC. R News. 2006;6: 7–11. [Google Scholar]
  • 60.Meredith M, Kruschke J. HDInterval: Highest (Posterior) Density Intervals. 2018. [Google Scholar]
  • 61.Wickham H. ggplot2: Elegant Graphics for Data Analysis. New York: Springer-Verlag; 2016. Available: http://ggplot2.org [Google Scholar]
  • 62.Baele G, Lemey P, Bedford T, Rambaut A, Suchard MA, Alekseyenko AV. Improving the accuracy of demographic and molecular clock model comparison while accommodating phylogenetic uncertainty. Molecular Biology and Evolution. 2012;29: 2157–67. 10.1093/molbev/mss084 [DOI] [PMC free article] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008537.r001

Decision Letter 0

Rob J De Boer, Roger Dimitri Kouyos

5 Jun 2020

Dear Dr Rolland,

Thank you very much for submitting your manuscript "Factors influencing estimates of HIV-1 acute infection timing using BEAST" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Roger Dimitri Kouyos

Associate Editor

PLOS Computational Biology

Rob De Boer

Deputy Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Dearlove et al use serially-sampled viral genome sequences from a group of recently-infected HIV patients whose plausible seroconversion dates are known with some precision. They analyse these to determine the level of genetic diversity present in samples at different time points. They then run BEAST with a wide variety of models on each within-host dataset and use model selection to recommend optimal settings. This is a useful study, taking advantage of an unusually comprehensive dataset to make recommendations for general practice. The analyses presented are sensible and well-conducted. I do, however, have a number of substantial concerns. Some concern choices of what to emphasise in the results, others undiscussed limitations.

The manuscript does not actually mention what value it uses as the estimated date of infection (and nor does the accepted Rolland et al. paper). I am assuming, in the absence of a description of any other method, that this is simply the time of the tree root. But it should be noted that this is not necessarily the same thing, and that results that the authors regard as implausible (“infection dates” after diagnosis and differing “infection dates” for multiple founder strains) are in fact perfectly reasonable if the common ancestor of a sample existed some time after the point of infection. TMRCA estimates that are more recent than would be expected from epidemiological data may not really be a problem. This issue certainly should appear in the discussion.

The manuscript also places too much emphasis on point estimates. It’s not really surprising or informative to learn that that posterior medians for the time of infection are not consistent between genes or founder strains. More interesting would be how much overlap there is between the relevant posterior densities.

It feels a little unkind to recommend additional BEAST analyses given the potential time and CPU investment involved, but I do feel that some important models are missing and warrant inclusion. Firstly, the BEAST development team would not have recommended use of the original skyline for many years; the first version with a GMRF smoothing prior appeared in 2008. That the skyride and skygrid are appropriate and preferable in almost all circumstances where the skyline would be used seems to have been poorly communicated to the general community of users. I would be interested to see if results using one of them are meaningfully different, and it would make these results reflect current best practice. Secondly, a relaxed uncorrelated Gamma molecular clock was also added to BEAUti some time ago, and this is a more flexible model than even the lognormal as it can also take a shape with a mode of zero (as with the exponential clock) as well as the peaked shape.

The study has access to enviably detailed and precise data on possible time windows for transmission events, but the manuscript makes surprisingly little use of it. For example, how often does the best-fitting model from stepping-stone sampling estimate an infection date within this window? (Or a credible interval which considerably overlaps it?) Does it ever happen that a less well-fitting model gives a more consistent estimate? Which genes perform best in estimating dates that are consistent with the given intervals?

Finally, while I accept that within-host HIV recombination is a very difficult nut to crack and that it is reasonable to ignore it here, a brief discussion of the problem and how it might have affected the results is warranted.

Minor points

L86: I do not really care for the use of “genome” as a shortening of “whole genome”, even if it is explained here. I have to remind myself what is intended every time I see it.

L167: How was “significant slope” defined?

L191: The authors acknowledge that there was little difference between some of the models, could they not use Bayes factors to quantify this?

L199: Not really a guaranteed overestimate, surely, as you do not know the truth.

L200: When the strict clock was selected as the best model, did it still give earlier estimates than the others?

L220: The pol observation seems quite striking, and I would be interested to see possible explanations in the discussion.

L272: The number of datasets mentioned here do not sum to 11, nor is the change from 7 (L267) explained.

L280: Which BEAST model was used here?

L451: It should be noted that improper priors should not be used in combination with stepping-stone sampling. Were all those replaced with proper ones?

L453: Including the birth-death model at all is a questionable choice. It was, after all, developed as a between-host epidemiological model. I can see how it might be interpreted as a within-host demographic model, but there is also a problem of correlation between the three parameters (see Stadler et al, PNAS, 2013 and in particular the appendix). One of the priors needs to be informative for this analysis to give useful results, and this would have to be carefully chosen if moving the entire framework to a model of a within-host population. This did not happen here.

L463-466: There is a repeated sentence here.

Most of the figures need enlargement of the label text.

Reviewer #2: Overall comments

I was excited to read this paper based on the abstract. But, despite the obviously massive amount of work that went into this paper, I think that the analysis does not make sense and is misleading. I think that the authors need to re-evaluate what their objectives are and re-think what analysis best supports that objective. After reading the paper several basic things are not clear to me. The stated objective seems to be to provide some guidance on how to use multi-sample data (e.g. clones or NGS) to estimate (unknown) infection times, but the analysis seems to be nearly exclusively focused on poorly executed and misleading phylogenetic model selection. I’m also unclear on some basics like how the samples were used, were all the time samples combined into a single tree (a single tree figure would have resoled this). If that is the cases, the authors are analyzing a situation that basically never occurs “in the field” (almost none of the many HIV sequences in the referenced will have multiple clone/NGS from multiple time points); if each time point was considered separately, then the results were not reported by their time strata. Some of the supplemental also seems to suggest that the analysis was strange (env seems to have one of the lower median estimated evolutionary rates—the authors also use substation rate when they mean evolutionary rate). The authors say that they expect evolutionary rates that are 100 fold lower than the median of the priors. This makes no sense

General comments

It’s not at all clear how multiple infection was determined. The reference is to a paper in press and no other mention is made of how this was done or than to say that alignment, phylogeny, and some kind of graph theoretic method was involved.

Figure 2 shows results using methods from citation 10 I think, but citation 8 still seems to be the main citation here (which I cannot read because it’s not public). I’m also not clear where the time dimension went in this figure. The samples came from 4 to 170 days from infection but this does not seem to be reference in the figure (are these just the first sample?).

In general, it’s not clear why the results are not presented as a function of time from infection. The strength of this study is that that time is known.

I don’t understand the point behind the “best-fitting model combination” and I don’t think that the technical execution makes sense. I’ll address the latter point first. Even if we assume that the stone stepping estimator is unbiased and perfect, model selection by marginal likelihoods does not make sense in this case. The marginal likelihood will be highly dependent on the choice of model priors (the ML integrates the product of the likelihood and the prior over the parameter space). Given that some of the model formulations have 10s of parameters, and include ones that have no real principled prior defaults, it’s almost impossible that the authors have set up a situation where every single prior was truly evaluated for what the most reasonable distribution should be for these datasets. And, in the unlikely situation where they did, those values were not reported in full. I think that it’s more likely that they used more-or-less the default priors in BEAST (for unreported values). In this case, this analysis tells us what the default priors of BEAST think about this data. Further, some of the reported values of the priors don’t make sense: the constant population size had a lognormal prior (logmean=0 sdmean=1), which has an expectation of about 1.68, while the exponential population size was uniform over 1 and 1 google. The marginal likelihood calculations therefore assume that the likelihood when the population is 1 billion billion billion is as important as when the population size is 100. To extent the critique, if we told two competent HIV phylogeneticists to do this analysis and to carefully think about the priors, we would get two very different sets of “best” models not because either were wrong, but because the outcome is highly dependent on “reasonable guesses” (e.g. what is a reasonable prior on the exponential growth rate of an ideal population for these data?). This whole analysis does not make sense to me and I believe is highly misleading.

I also find it confusing that the authors seem to want to talk about which models are the best at recovering the known times of infection, but they use instead a (poor in my opinion) model fit statistic. The authors logic is not really stated so I can’t really say, but I suspect that they used marginal likelihoods because they naturally penalize overfitting by naturally spreading out the integral over a larger parameter space in more complex models. However, the whole phylogenetic model is just a nuisance parameter if the only goal is to recover the infection time. If there was a single model that had the lowest bias in the estimated of the infection time but that was only partially identifiable or had a bad marginal likelihood, would that matter? The authors are not trying to do biology but rather to determine the statistical properties of an estimator.

The vertical grid lines in Figure 3 make reading the boxplots hard. This figure also seems to suggest that using more data (whole genome) leads to worse and more variable estimates. It seems that this result alone suggests either the models being used were not sufficient for the whole genome (i.e. not enough heterogeneity comparted to individual gene models) or that recombination is causing a systemic bias towards longer trees.

Specific comments

(L60) This phrasing is weird, should be made clear that the unknown time is the interval from infection to diagnosis. Also, “often” is misleading; there is always a lag between infection and diagnosis.

(L70) I don’t think of PCR as being a serological assay, but this might be ignorance on my part.

(L73) The authors should pick a single term for the time interval between infection and diagnosis and stick to it (in the first 1.5 pages they use both age and date of infection).

(L104) Why are there 3 medians reported? Is this the median time for the first, second, and third samples?

(L106) What does “mainly identified” mean? Subtype A1 was 9/23, do you mean the most common subtype was A1?

(L108) The Thai woman mentioned here is omitted in the intro.

(L115) This seem tautological as sequence alignment (and ostensibly number of polymorphic sites) was used to determine multiplicity of infection.

(L126) How do the authors know which patients were actually multiply infected? Do they just mean that each method was internally consistent.

Reviewer #3: Overall this is a very interesting paper dealing with a very common potential flaw in phylogenetic analysis -- using separate genes that may be more readily available rather than using full genomes that contain more information for meaningful phylogenetic analysis. These kinds of articles are extremely important in particular due to the existing status quo in the field, where the genes with limited information may be used to make conclusions on an epidemiological scale.

The dataset that is being used in this and some preceding papers by the same authors is very unique and useful. Showing the use of computational methods to narrow down the date of suspected infection is a great result that has major potential impact on the healthcare system, helping to control spread of infection in real-time.

However, there are two major concerns about the current version of the manuscript that I mention again later where they occur in text. The first major concern is the fact that per dataset of three sequences per patient and gene/genome the authors ran 16 independent analyses. The amount of analyses on one small dataset is large and this inherently leads to some levels of false positivity (e.g. of models fitting "well" to the data while in fact the fit is not that good). Unfortunately, the authors do not mention how they counteract this or why it is not an issue.

The second major issue is the selection of priors on different parameters. The priors were obviously intended to be as uninformative of the true values as possible (e.g. very flat uniform distributions used for different parameters), however these kinds of distributions are far from uninformative for parameters with extremely low expected values. E.g. in cases when a major part of the posterior distribution lies between 0 and 0.1, whereas the prior sets a weight of 9999.9 on values larger than 0.1. Unsurprisingly, when the amount of information is low, the posterior is very close to the prior.

Overall, I do not think that the authors should redo any analyses in order for this paper to be scientifically useful and a helpful addition to the existing research body. Moreover, rerunning analyses under slightly different assumptions may further exacerbate the multiple testing issue brought up above.

However, at the stage that it is now the manuscript is missing some crucial explanation for justifying the choices made for the analyses. The prior selection is very specific but also extremely generic given the body of knowledge already present on HIV evolution. With the addition of choice justification the manuscript should be ready for publication.

Additionally, I think that it would be very helpful if the authors also published their Beast configuration files so that other researchers could analyse their datasets in a similar manner.

Major comments:

Line 39 and multiple others:

The paper reports IQRs for the posterior values. It would seem more reasonable to (also) report the standard 95% HPDs that most Bayesian analyses report (and in particular Beast analyses) as this covers most of the values of the distribution and can show how precise the estimates are. For example, if the 95% HPD range is extremely wide, the precision of the estimate is low and few conclusions can be drawn from the median value. The reported IQRs are already quite wide so it would be interesting to see the 95% HPDs.

Line 103:

Were the differences in subtype evolution accounted for? E.g. it could be that different subtypes evolve at very different rates, which may bias the analysis results. In particular, when trying to estimate exact timescales good priors on evolution rates can be crucial for precise estimates.

Line 172:

The paper says that there were 9 datasets for a single gene and 25 datasets for full genomes. How is it possible that there were many more full genome sequences (that should include that single gene) than single gene sequences?

Line 176:

One major concern is that small datasets were re-analysed multiple times setting up different models. While a legitimate criterion was used to determine goodness of fit, I am concerned that the multiple analysis runs have introduced false good fits. Could the authors clarify how this was handled or why it is not an issue?

Line 180:

It is not exactly clear how the analyses are set up in Beast. It is my understanding that individual patient data (single gene or genome) was analysed in a single analysis separate from any other patient data. However, there are only 3 sequences per patient, which is incredibly little for reliable parameter inference. How do the authors counteract that? In particular given that the priors on different parameters such as substitution rates are uninformative (in fact, informative of values very different from the expected) and especially since per participant (i.e. per 3 sequences) 16 different model configurations were ran.

Line 320:

This is a very true statement, analyses with little data do end up mainly reconstructing the priors. This is the primary reason for using more data and informative priors. This statement raises once more the question of prior selection described in the manuscript as a lot is already known about HIV phylogenetics and the priors selected here were extremely broad.

Line 324:

Here is an example of the prior explicitly putting emphasis on an unrealistic value. The only conclusion that can be drawn from this would be that there is not enough information in the data for successful inference.

Line 350:

Could it actually be the estimate of the infection time for the infection source for that patient? I.e. if we are estimating the time of coalescence of the founder variants that all come from a single other source, it would span within-host evolution over two patients -- the sampled patient and the source of infection. Of course, there's a possibility of multiple founders in the source too, but the exact same founders would be extremely unlikely to have been passed on to the sampled patient.

Line 452:

The prior choices described here need further justification.

"a uniform prior between 0 and 100,000 for the birth rate" -- This is in fact a prior that puts an incredibly high weight on any value above 1 (99,999 vs 1).

"a uniform prior between 0 and 100 for the rate of sampling through time" -- The sampling proportion and rate in this case are values that are relatively well known for the background population. It would make more sense to set them sensibly based on existing knowledge to not bias the analyses.

"a uniform prior between 1.0 and 1.0 x 10^100 was used for the time the lineage originated" -- in this case as well, the oldest human lineages are from the 20th century at the earliest and the prior puts a lot of weight on values a long time in the past.

Minor comments (typos, typesetting errors, etc):

Line 28: The abstract says "coalescence analyses" whereas in fact the paper presents analyses based on both the coalescent and the birth death models.

Line 181: The word "were" seems to not match the rest of the sentence structure (maybe it is redundant?).

Line 203: Missing a comma after "but when not".

Line 214: "emcompassing" --> "encompassing".

Line 271: The authors refer to 5 out of 11 datasets as the majority, which is not the case.

Line 361: "straight forward" --> "straightforward".

Line 463: The first sentence of the paragraph is repeated twice.

line 470: The parameter of the Beta distribution is typeset as a blank square.

Figure 1: The grey and the pink are a bit dull and therefore not easy to see.

Figure 2: The figures are too small to be able to read the sample numbers -- the fonts need to be bigger as well as the figures themselves.

Figure 3: The grey and the pink are a bit dull and therefore not easy to see.

Supplementary Figure 3: The plots are unreadable due to the small size.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: No: They have not been provided in this submission; it is promised that they will follow.

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: Yes: Jūlija Pečerska

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see http://journals.plos.org/compbiol/s/submission-guidelines#loc-materials-and-methods

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008537.r003

Decision Letter 1

Rob J De Boer, Roger Dimitri Kouyos

11 Sep 2020

Dear Dr Rolland,

Thank you very much for submitting your manuscript "Factors influencing estimates of HIV-1 infection timing using BEAST" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. 

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Roger Dimitri Kouyos

Associate Editor

PLOS Computational Biology

Rob De Boer

Deputy Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: My concerns have largely been addressed, but I have a few follow-up points:

L104: I did not spot this the first time, but ten sequences from three time points in 39 individuals would be 1170 sequences, not 1280.

L237: The word “skyline” here is extraneous and a little confusing.

L309-310: Could you calculate overlapping coefficients for the dates estimated using different founder populations from the same individual?

L370-371: I think this is not clearly worded enough, perhaps “the common ancestor of the sampled sequences existed a considerable time after the point of infection”? This issue is also not discussed in the context of multiple founder strains, where it is also an explanation for disagreement in estimates.

L393-394: What evidence of recombination was found, and how? Also, since recombination could make phylogenetic inference unreliable, simply adding more sequences will not solve the problem. It would need to be checked for explicitly. Could recombination potentially explain the too-distant TMRCA estimates for some founder strains in 10203, 20337, 30124 and 40123? If single-gene analyses for the same samples did not behave this way I would strongly suspect it. Or are those long branches themselves the evidence?

L427-431: I don’t think this is worded strongly enough. If the authors agree that the use of the birth-death model is inadvisable, I recommend removing it from the manuscript and adjusting the results accordingly.

Figure 4: This looks worse for the authors than I think it needs to. Perhaps add another shaded area representing plausible eclipse periods?

Reviewer #3: The authors have sufficiently addressed the comments and have made significant adjustments to the manuscript to reflect the answers.

With some very minor remaining comments I think that the manuscript is in a good shape for publication.

Minor comments:

Line 109: From what I understand now there are 30 NFL genomes per 39 participants, which should come to 1,170 sequences in total. Where do the other 110 sequences come from? Or is it that there are not exactly 10 NFL genomes per time point per patient but sometimes randomly a few more?

Lines 231-233: This sentence is slightly different from the version in author answers and that small change makes it very confusing. Suggested change:

"All participants had at least one model combination including a skyline in the top three ranked models, with 14 out of 25 participants having the skyline population model for the top three models selected."

to

"All participants had at least one model combination including a skyline among the top three ranked models, while 14 out of 25 participants had the skyline population model for all the top three models selected."

Line 275: What do "better results" mean in in this context? More precise, more accurate or more realistic maybe? Using a different comparator rather that "better" would make this title clearer.

Line 283: Could there be a typo in "17,127 days"? Or did the authors actually get point estimates for infection at 46.5 years prior to diagnosis?

Line 523: Here I would greatly appreciate the adding of a similar comment as was given to my question on priors. Leaving priors on default values is a very dangerous approach which should be discouraged, as mostly the priors make no sense and are only there such that the software would not crash. The authors selected such priors on purpose and I think that stating that extremely clearly will minimise the risk of someone copying these analyses without thinking this over. While this is mentioned in the discussion now, I think that reiterating the reasoning here is very important.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #3: Yes: Jūlija Pečerska

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see http://journals.plos.org/ploscompbiol/s/submission-guidelines#loc-materials-and-methods

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008537.r005

Decision Letter 2

Rob J De Boer, Roger Dimitri Kouyos

13 Nov 2020

Dear Dr Rolland,

We are pleased to inform you that your manuscript 'Factors influencing estimates of HIV-1 infection timing using BEAST' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Roger Dimitri Kouyos

Associate Editor

PLOS Computational Biology

Rob De Boer

Deputy Editor

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: My comments have been addressed

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008537.r006

Acceptance letter

Rob J De Boer, Roger Dimitri Kouyos

25 Jan 2021

PCOMPBIOL-D-19-01987R2

Factors influencing estimates of HIV-1 infection timing using BEAST

Dear Dr Rolland,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Alice Ellingham

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table. Summary of the estimated dates of HIV-1 infection for participants infected by single or multiple HIV-1 founders by gene.

    n denotes the number of participant datasets. Negative estimates denote days prior to diagnosis.

    (DOCX)

    S1 Fig. The accumulation of diversity over time for infections with single founders.

    At each sampling time point, we calculated the number of polymorphic sites (sites with at least two alleles) and informative sites (sites with multiple alleles found in at least two sequences) for sequences collected up to and including that time.

    (TIF)

    S2 Fig. Breakdown of datasets for the BEAST analysis, by founder type.

    Two participants did not have samples for all three time points for gag and pol and were thus removed for these genes; participant 20368 was removed from the NFL genome analysis due to half genomes only being available. The category ‘informative sites’ refers to participants whose sequences had at least one informative site; these were then tested for significant temporal signal. Only those with at least one informative site and temporal signal were analyzed in BEAST.

    (TIF)

    S3 Fig. Summary of the estimated dates of HIV-1 infection for participants with single or multiple HIV-1 founders by gene.

    (TIF)

    S4 Fig. Comparison of estimates for participants infected with a single founder.

    A) Point estimates of the date of infection for participants. The color of the points shows the number of informative sites in the within-host dataset for that participant and gene. Lines link gene and genome results for each participant. B) Boxplot of the overlap coefficient for posterior distributions between pairs of genes for each participant. The coefficient is defined between 0 and 1; 0 implies the curves are non-overlapping, and 1 complete overlap.

    (TIF)

    S5 Fig. Comparison of Bayes factors for each model combination across genes and participants with significant temporal signal.

    The best-fitting model is shown in red. Shades of orange and yellow give the strength of evidence for the best-fitting model relative to the other model combinations fitted, that is, the darker the color, the smaller the improvement by the best-fitting model. Model combinations shown in gray could not be fitted.

    (TIF)

    S6 Fig. Comparison of clock priors for the best-fitting population model.

    For each participant, the best-fitting estimate from BEAST is marked by a triangle, with a circle showing the result from the other clock models under the same population model. The shaded blue area corresponded to the interval between the last negative and first positive HIV-1 RNA test; where the blue bar is missing, the sequences were unavailable for that participant.

    (TIF)

    S7 Fig. Smaller genes have less signal and are more likely to be influenced by the prior.

    The estimated substitution rates for each participant and gene (left) are plotted alongside the prior distribution (right). Horizontal dashed lines show the median and IQR of the prior.

    (TIF)

    Attachment

    Submitted filename: Dearlove_PLoSCompBio_ResponseToReviewers.5aug20.pdf

    Attachment

    Submitted filename: 20200928_Dearlove_PLoSCompBio_ResponsetoReviewers.pdf

    Data Availability Statement

    Sequences are available on GenBank under accession numbers KY580473—KY580727 and MN791130—MN792579.


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES