. 2017 Jun 1;31(9):1211–1222. doi: 10.1097/QAD.0000000000001470

Defining HIV-1 transmission clusters based on sequence data

Amin S Hassan a,b, Oliver G Pybus c, Eduard J Sanders a,d, Jan Albert e,f, Joakim Esbjörnsson b,d,f
PMCID: PMC5482559  PMID: 28353537


Understanding HIV-1 transmission dynamics is relevant to both screening and intervention strategies of HIV-1 infection. Commonly, HIV-1 transmission chains are determined based on sequence similarity assessed either directly from a sequence alignment or by inferring a phylogenetic tree. This review is aimed at both nonexperts interested in understanding and interpreting studies of HIV-1 transmission, and experts interested in finding the most appropriate cluster definition for a specific dataset and research question. We start by introducing the concepts and methodologies of how HIV-1 transmission clusters usually have been defined. We then present the results of a systematic review of 105 HIV-1 molecular epidemiology studies summarizing the most common methods and definitions in the literature. Finally, we offer our perspectives on how HIV-1 transmission clusters can be defined and provide some guidance based on examples from real life datasets.

Keywords: HIV-1, molecular epidemiology, phylogeny, transmission clusters


The classification and clustering of biological organisms and entities has been fundamental to understanding their origins, relationships and evolution [1,2]. Mechanisms of reproductive isolation prevent animals and plants of different species from producing fertile offspring, thereby maintaining species integrity over time, such that biological diversity typically falls into discrete categories or clusters [3]. Reproductive isolation mechanisms are absent or less distinct for viruses. Combined with high mutation rate and ability to adapt swiftly to environmental changes, the genetic diversity of many viruses, such as the HIV-1, exists in much more of a continuum. This makes the definition of discrete clusters at the inter-host and intra-host level particularly challenging [4,5]. However, the rapid evolution leaves measurable footprints in viral genomes that can be associated with transmission dynamics and epidemiology. An HIV-1 transmission cluster can be described as a set of HIV-1 sequences that are aggregated in a nonrandom manner linked to their epidemiology. Over the last two decades, evolutionary theory and sequence analysis have contributed significantly to our understanding of HIV-1 epidemiology, for example by providing information about the time and geographical location of HIV-1 origins [6,7]. Detailed analyses of viral sequences can provide useful information about HIV-1 epidemics by identifying transmission linkages and by elucidating differences in transmission within and between populations [6,8]. Well characterized transmission chains have been compared with sequence-based phylogenies and are often in close agreement [914].

Phylogenetic analysis has been used successfully to identify and dissect HIV-1 transmission clusters. When combined with detailed epidemiological and clinical data, the results of such analyses can be of public health relevance, for example by identifying how virus lineages are restricted to, or mix among, different demographic and behavioural subgroups [1521]. Typically, each HIV-1-infected individual under study is sampled once and represented by a single HIV-1 sequence obtained by bulk Sanger sequencing. The sequences are used to construct a bifurcating evolutionary tree (a phylogeny) in which each virus sequence (taxon) is positioned at a tree tip. Pairs of tree branches share a node that represents the most recent common ancestor (MRCA) of the taxa that have descended from that node. Individuals that share a MRCA are usually considered to be epidemiologically linked, that is to represent a transmission cluster. The lengths of the tree branches usually represent the genetic relatedness between the different ancestors and their descendants. Genetic distances can be linked to time by assuming a so-called molecular clock model [22]. Early molecular clock models assumed that all phylogeny branches evolved at the same rate. However, a constant evolutionary rate is often unrealistic and alternative more flexible molecular clock models have been developed [2326]. In essence, phylogenetic inference relies on an alignment of genetic sequences, an underlying substitution model to model the process of evolutionary sequence change, an approach or algorithm for inferring the tree, and some measure of statistical support for the relationships given in the tree.

The selection of an appropriate definition for a ‘transmission cluster’, that is a shared MRCA, is complex and needs to take into account both the research question and characteristics of the sequence data, such as the selected genomic region, sequence length, the range of sample collection dates, the distribution of sampling locations, the mode of transmission, the diversity of the genetic variants sampled (within and between HIV-1 subtypes and circulating recombinant forms), the number of sequences, the proportion of the population under study that is sampled, and the degree to which sampling is representative [27]. Hence, it is not surprising that there is no clear consensus on how transmission clusters should be defined. Nevertheless, there is a need for a common strategy among researchers for determining appropriate cluster definitions for typical datasets and research questions. A common rationale would contribute to a better understanding of the HIV-1 pandemic by increasing the comparability between studies [28,29].

Genetic distance, tree building algorithms and node support

Pairwise genetic distances can be either calculated directly from the sequences (the so-called p-distance, or Hamming distance, which equals the observed number of nucleotide differences between two sequences) or computed using a nucleotide substitution model (the expected genetic distance, or so-called d-distance). If genetic distances are computed as the sum of the branch length between two tips in a tree, then they are known as patristic distances [30]. Simple p-distances do not employ a substitution model that describes the evolutionary process and therefore do not account for multiple changes or back mutations at the same site. Consequently, they typically underestimate the true genetic distance between two sequences [31].

Pairwise genetic distances within a transmission cluster of more than two sequences can be summarized in different ways, for example by using the mean, median or the maximum pairwise distance [29,32,33]. Another approach is to associate a sequence to a specific cluster if the distance from that sequence to any other sequence in that cluster is lower than a threshold value – irrespective of the distances to other sequences in the cluster [18,34]. There are advantages and disadvantages to the different approaches, for example the maximum genetic distance has been suggested to be less sensitive to cluster size than cluster definitions relying on mean genetic distances in which one or a few ‘unlinked’ sequences may be erratically included in large clusters because they have minimal influence on the mean distance [35]. Moreover, maximum genetic distance approaches are fast to compute and has been suggested to correlate with time of the MRCA of clusters in molecular clock phylogenies [33].

Both maximum-likelihood and Bayesian tree building approaches use probability models to evaluate the relative plausibility of different phylogenetic topologies, whereas the neighbour-joining approach uses a deterministic tree-building algorithm that generates only a single phylogenetic topology (Table 1 and described in detail in [31]). Traditionally, statistical node support for the relationships in a phylogenetic tree has been evaluated by a statistical technique called bootstrapping [36]. During phylogenetic bootstrapping, site positions in the original alignment are randomly resampled with replacement to produce a set of pseudo-replicate alignments. The tree building approach is then applied to each of these alignments. Clusters of related taxa that are present in a low percentage of the bootstrap trees are weakly supported and vice versa. However, the exact interpretation of bootstrap values is difficult. Higher values are of course better, but what is a reasonable cut-off? It has been suggested that bootstrap values of more than 70% indicate strong support for a group, based on the conclusion that bootstrap supports are conservative measurements [37]. Two other types of statistical tests, that are substantially faster than the bootstrap approach, are the approximate likelihood-ratio test and the zero-branch length test [3840]. In essence, these test whether each branch in a tree is significantly greater than zero or not (i.e. if the branch exist), and cut-off probabilities of more than 0.9 have been suggested to be conservative and correspond relatively well to bootstrap values more than 70% [3941].

Table 1.

Components of HIV-1 transmission cluster definitions based on phylogenetic node support.

Phylogenetic tree reconstruction (examples of commonly used methods)
Substitution model (examples of commonly used substitution models)
 Jukes–Cantor (JC)
 Tamura–Nei (TN)
 General time reversible (GTR)
Node support tests (examples of commonly used tests)
 Bootstrap test
 Approximate likelihood-ratio test (aLRT)
 Zero-branch length test

Instead of relying on one ‘best tree’ or a set of bootstrap pseudo-replicates, Bayesian phylogenetic approaches use Markov chain Monte Carlo sampling to infer a full posterior probability distribution of plausible trees, which should contain all the different tree topologies that are well supported by the data. This set of trees can be used to produce a consensus tree [called a maximum clade credibility (MCC) tree] in which each branch and cluster has an associated probability. In an MCC tree, this probability is the proportion of trees in the posterior probability distribution in which the cluster of interest exists. Bayesian posterior probabilities have been suggested to be a generally less biased predictor of phylogenetic accuracy than bootstrapping [42].

Systematic literature review

We systematically reviewed the scientific literature of HIV-1 molecular epidemiology with the aim to explore current definitions of HIV-1 transmission clusters. Preliminary explorative analyses showed that both the majority of available HIV-1 sequences in Genbank (77%), and HIV-1 transmission network studies (80%) were published after 2005. We therefore limited our review to HIV-1 specific literature published between 2005 and 2016.

A literature search of the PubMed database ( was undertaken on 11 April 2016 using the following search and mesh terms: [‘2005’(PDAT): ‘2016’(PDAT)] & (hiv OR ‘human immunodeficiency virus’) & transmission & (cluster OR network) & (molecular OR phylogenetic). Previous reviews and opinions were excluded from this review as our aim was to explore cluster definitions used in primary research studies (Fig. 1). Strictly methodological, simulation and case studies were excluded for similar reasons. Non-English articles were excluded for simplicity (in total 18 articles). Two researchers (A.H. and J.E.) independently assessed the eligibility of articles from the literature search. The articles were manually screened, first by title, then by abstract, to assess relevance based on our eligibility criteria. Any discordance between the two reviewers in the list of shortlisted publications was flagged, and discussions held until a consensus on eligibility was reached. Shortlisted articles were imported into EndNote X7 (Thomas Reuters, Philadelphia, Pennsylvania, USA) for further management, and duplicate articles were deleted. After screening, 105 articles remained for full text review [12,17,18,20,33,43142]. The articles were stratified into three main categories based on the approach that was used to define transmission clusters (Tables 2 and 3).

Fig. 1.

Flow chart showing results from the literature search and inclusion of articles considered in the review.

Fig. 1

We employed the PubMed search engine ( for the literature search strategy. Previous reviews and opinions, strictly methodological articles, simulation work, case studies and non-English papers were excluded from the final full text review.

Table 2.

Three main types of cluster definition.

Pure phylogenetic transmission cluster definitions based solely on phylogenetic node support
Pure distance-based transmission cluster definitions based solely on pairwise genetic distances
Combined transmission cluster definitions based on both phylogenetic node support and pairwise genetic distances

Table 3.

Results of the systematic review of 105 articles employing different strategies to define HIV-1 transmission clusters.

Categories of cluster definition Number of articles Mediana number of sequences (IQRb) Median study period in years (IQRb) Median sequence length (IQRb) Most analysed genetic region (proportion of articles)c Tree building model used (proportion of articles)c,d Substitution model used (proportion of articles)c,e Branch support approach (proportion of articles)c,d Median cut-off for determining clusters (IQRb)
Phylogenetic 53 219 (96–562) 6 (2–12) 1100 (895–1497) pol (88%) ML (60%) GTR (73%) Bootstrap (71%) 90% (75–90%)
Distance-based 7 2747 (179–40950) 11 (6–16) 900 (500–1800) pol (100%) NA TN (100%) NA 0.015 Substitutions/site (0.014–0.019)
Distance-based & phylogenetic 45 534 (131–1413) 7 (2–12) 1150 (915–1308) pol (98%) ML (76%) GTR (74%) Bootstrap (86%) 90% (90–98%), 0.015 Substitutions/site (0.015–0.038)

aPairwise comparisons using the Mann–Whitney U test: phylogenetic vs. distance-based, P = 0.036; phylogenetic vs. distance-based & phylogenetic, P = 0.030; distance-based vs. distance-based & phylogenetic, P = 0.13.

bInterquartile range.

cThe most commonly used methodology are presented with the proportion of articles in which this methodology was used.

dML, maximum likelihood; NA, not applicable.

eGTR, general time reversible; TN, Tamura–Nei.

The most common approach was to use a pure phylogenetic definition (50% of the articles), followed by the combined approach (43%) and the pure distance-based approach (7%). No clear difference in study aims was found among the three approaches. Among the seven articles that relied purely on a distance-based cluster definition, the most common approach was to use the Tamura–Nei substitution model, whereas the general time reversible (GTR) model was most popular in phylogenetic-based and combined cluster definitions (Table 3) [143]. Some studies employed more than one tree building methodology. However, the most common was maximum-likelihood (used in 67% of the 98 studies that used a pure or combined phylogenetic cluster definition), followed by neighbour-joining (46%) and Bayesian tree building methodology (28%). Bootstrapping was the most commonly used statistics for branch support with the most common cut-off being 0.9. Studies defining HIV-1 transmission clusters by distance-based methodologies most often used a threshold of 0.015 substitutions/site (Table 3).

Analysis of publication year suggested that the interest in performing cluster analyses of HIV-1 sequence data increased through time, with 17 articles published between 2005 and 2010 compared with 88 articles published between 2011 and 2016. There was a tendency towards increased popularity of the combined approach during the latter period, compared with the pure phylogenetic approach that was previously more popular (Fig. 2). Analysis of future publications will be required to determine whether this is a random fluctuation or a true shift in the most popular approach. The median number of analysed sequences was greatest in articles employing pure distance-based cluster definitions (2747 sequences) and lowest in articles employing pure phylogenetic cluster definitions (219 sequences, Table 3).

Fig. 2.

Number of articles stratified by strategy of HIV-1 transmission cluster determination in relation to availability and number of analysed sequences during the study period.

Fig. 2

The articles were reviewed for strategy of HIV-1 transmission cluster determination. The lines represent moving averages (per 2 year average) over the study period. The articles were stratified into three categories based on cluster definitions (solid lines): phylogenetic (blue), distance-based (red) and distance-based & phylogenetic (green). The average number of analysed HIV-1 sequences per study is indicated by a dashed purple line. 1The cumulative number of HIV-1 sequences submitted to Genbank is indicated by bars (data collected from the Los Alamos HIV Sequence Database, /HIV/mainpage.html).

The average number of analysed HIV-1 sequences per study increased from 41 to 5389 sequences between 2005 and 2015 (Fig. 2). Phylogenetic analysis can be associated with high computational burden, in particular for large sequence datasets. It is possible that the increasing number of available HIV-1 sequences have favoured the generally faster and less parameter-rich distance-based methods in analysis of large sequence datasets (i.e. datasets with several thousands of sequences). Most articles (98 of 105) focused on sequences from a single country, and half of those represented studies in European countries. Only 8% of the studies focused on the African countries. Considering that approximately 70% of all HIV-1 infected individuals live in Africa, this highlights the need of additional studies from the African continent [144]. This is further emphasized by the fact that HIV-1 epidemics in MSM populations in Africa have been recognized only in the last 10 years [145].

All 105 articles based their analyses on sequences produced by Sanger sequencing. The most commonly analysed HIV-1 genetic region was the polymerase gene (pol), and the sequence length was generally around 1000 nucleotides (Table 3). This is likely because pol is used for routine testing of antiretroviral resistance and is the most common HIV-1 genetic sequence that is available in public databases. Although it has been argued that pol has some limitations in giving high enough phylogenetic resolution, it has been used extensively in studies of HIV-1 molecular epidemiology and has been reported to contain sufficient information for analyses of HIV-1 transmission [10,32,132].

Taken together, the literature review showed that the most common phylogenetic methodology was a maximum likelihood approach that uses a substitution model GTR with transmission clusters defined by tree nodes with bootstrap support values of more than 90%.

Real-life examples

Analyses of datasets with high coverage [when a large fraction (>30%) of the total number of infected individuals in a population is represented in the dataset] and longitudinal sampling over extensive periods of time (>10 years) – sometimes comprising sequences from more than one country – can be challenging. It is therefore important to carefully consider the main study aim before determining the cluster definition [28].

To explore and exemplify the effects of different genetic distance thresholds, we performed a comparative analysis of two previously described transmission clusters with different topologies and sequence sampling durations [41]. Both clusters contain HIV-1 pol sequences that have diverged more than 4.5% (0.045 substitutions per site, Fig. 3). Consider, for example, the relatively long and statistically well supported branch that divides the Danish MSM cluster in two (highlighted by an arrow in Fig. 3a). The length of this branch could be due to: (1) no transmissions of this viral strain for a few years, or (2) unsampled transmissions of this viral strain. However, in this particular example, 57% of all newly detected HIV-1 infections in Denmark were sequenced during the study period. Hence, it is perhaps unlikely that the 41 more terminally located Danish sequences would not stem from the same epidemiological introduction as the 14 more basally placed Danish sequences. In addition, the branch ancestral to this cluster was both well supported and relatively long, further supporting this scenario [41]. Moreover, a large number of non-Danish reference sequences selected by genetic similarity (from both Genbank and a large nonpublic sequence database representing surveillance programmes in most European countries) were analysed together with the Danish sequences to maximize the chances of picking up non-Danish links in the original report of this transmission cluster [41,146].

Fig. 3.

Comparison between phylogenetic cluster definitions employing a branch support criteria with or without different distance thresholds.

Fig. 3

Statistically supported branches are indicated with an asterisk (estimates ≥0.90, as determined by approximate likelihood ratio test with the Shimodaira–Hasegawa-like procedure) [40]. The scale bar indicates the genetic distance in substitutions per site and applies to both panels. Branches highlighted in colour represent sequences found in clusters according to respective cluster definition (as detailed for each tree in the figure). All examples of clusters are defined by a threshold of maximum pairwise distance in substitutions per site between any sequence pair in a cluster. Clusters (a) and (b) were cut out of larger phylogenies analysed in an extensive multinational HIV-1 transmission study with an overall coverage of more than 50% of newly HIV-1 infected individuals in the studied region [41]. (a) Transmission cluster of HIV-1 subtype B infected Danish MSM. The 16.2% distance threshold is the level in which all sequences will be included in the cluster. The relatively long and statistically supported branch dividing the Danish MSM cluster in two parts discussed in the main text are indicated by an arrow. (b) Transmission cluster of HIV-1 CRF01_AE infected Finnish intravenous drug users.

Another example that illustrates the effects of a relatively small distance-based threshold is presented in Fig. 3b. This cluster represents a set of late presenters from a well characterized Finnish HIV-1 outbreak among intravenous drug users (IDUs) in 1998–1999 and is linked to a large Swedish IDU outbreak that occurred 2005–2007 (as established by epidemiological data and discussed in previous publications) [41,108,147]. The longer terminal branches (i.e. higher diversity) observed in this Finnish cluster reflects the fact that these individuals were infected by HIV-1 several years before being sampled (it is likely that the majority of patients were infected during the outbreak in 1998–1999, but the sampling period of the Finnish dataset in this study started first 2003). The Finnish cluster would be reduced to a small cluster of two sequences if the most commonly used distance-based threshold of 1.5% was employed (Table 3 and Fig. 3b). Thus, too small distance-based thresholds may reduce large and long-lived transmission clusters to multiple smaller subclusters. The same principle applies to the Danish MSM cluster discussed above, in which a lower genetic distance threshold would result in several smaller independent, more recent, and potentially active clusters/transmission pairs, instead of the larger cluster as identified by a higher genetic threshold (Fig. 3a). These examples highlight the continuous evolution of HIV-1, which results in an increasing divergence from a founder strain (e.g. the MRCA of a transmission cluster).

Moreover, if the main aim of a study is to determine the number of active transmission clusters, it may be important to also consider nongenetic epidemiological information (e.g. known date of infection or Recent Infection Testing Algorithms) and the social context of the population(s) under study [148,149]. In contrast, a higher distance threshold results in only one or a few larger transmission clusters, implying long-standing and continuous HIV-1 transmission problems in this population. If the aim is to identify both long-lasting and active transmission chains, a stepwise procedure may be useful. A higher threshold could be applied in the first step to identify long-lasting clusters; in a second step, these clusters could then be stratified into active and nonactive clusters based on the existence of subclusters as defined by a lower threshold. This could be particularly useful in datasets that cover a large proportion of the infected population. An alternative sequence-based approach that might reduce the risk of including transmission clusters with important missing links is one that determine clusters using the maximum length of internal branches, instead of mean, median or maximum pairwise genetic distances.

Tools for identification of HIV-1 transmission clusters

Several tools and software have been developed for the identification of transmission clusters from HIV-1 sequence data [29,35,150154]. Two popular and freely available tools are PhyloPart ( and Cluster Picker ( [29,35]. Both rely on a predetermined phylogenetic tree and allow the user to determine thresholds for either genetic distance, phylogenetic branch support or both. The main difference is that PhyloPart uses a distance threshold that is a user-specified percentile of median pairwise distances, whereas Cluster Picker employs a user-specified maximum pairwise distance threshold. In contrast to PhyloPart and Cluster Picker, the recently developed ‘Gap Procedure’ does not depend on a phylogenetic tree [150]. Instead, pairwise distances are estimated directly from the sequence alignment and sorted by size to identify relatively larger ‘gaps’ between subsets or aggregations of similar distance estimates. By this procedure, the authors argue that there is no requirement for a user-defined and potentially poorly justified a-priori threshold to identify clusters.

Jacka et al.[154] recently used sampling collection dates to infer molecular clock phylogenies and then defined clusters based on lineages existing at a particular point in time (the analysis can be done by the freely available software ClusterByTime). This definition is clearly related to distance-based cluster definitions, because under a strict clock the genetic distance is linearly proportional to time. If rates of evolution vary among lineages, one could argue that time is superior to genetic distance since the same schedule of transmission events would result in clusters with significantly different levels of genetic distance. Further developments also allow for the addition of discrete and continuous traits linked to viral sequences and infections. For example, phylogeographic and Markov jump models can be used to infer the directionality and number of transitions between different traits (e.g. geographic locations or transmission groups) [6,155157]. Finally, more complex cluster definitions based on simultaneous analysis of epidemiological information and viral sequence data have also been proposed to improve the reconstruction of accurate HIV-1 transmission networks [151,152].

Future research directions

When we assessed studies that have analysed sequence datasets covering a relatively large proportion of the infected population at national or regional scales, it became clear that there is no common strategy to define transmission clusters [12,33,41,45,77,78,131,136,158]. The increasing number of available HIV-1 sequences will make it increasingly difficult to infer phylogenetic trees to determine transmission clusters. A future challenge will therefore be to estimate the level of sequence coverage (i.e. the fraction of the total number of infected individuals in a population) at which the current methods of determining branch support becomes impractical or even uninformative.

Recent developments in sequencing strategies have not only resulted in an increased number of sequences, but also in a wide variety in the quality and accuracy of viral sequences submitted to public databases. Next-generation sequencing (NGS) is superior to Sanger sequencing in detecting low-level variants, but some NGS methodologies suffer from relatively higher error rates and one of the major challenges has been to distinguish technical and analytical errors from true viral diversity [159]. Eshleman et al.[160] analysed HIV-1 sequences from eight index-partner pairs with unlinked HIV-1 sequences (as previously determined by analysis of bulk Sanger sequences) and reported that one of the eight couples in fact was linked when the virus populations were reanalysed by NGS. This indicates that although the correspondence between Sanger and NGS sequences generally seems high, there may be occasions in which bulk Sanger sequences will not adequately represent the entire virus population within an individual. In our literature review, none of the studies used NGS sequences to study HIV-1 transmission dynamics in a geographic region or country. However, with the increasing number of NGS sequences generated in recent years, there will be a need to study both the impact of analysing Sanger versus NGS sequences on a larger population-based scale and the effects of combining both types of sequences in the same analysis of transmission clusters.

The HIV-1 evolutionary dynamics and population genetic forces differ substantially between intra-host and inter-host levels, and by transmission route. Another topic that needs further investigation is therefore how the inclusion of several sequences per patient (either longitudinally collected or multiple clonal sequences from one time-point) impacts the identification of transmission clusters in large sequence datasets [161]. Similarly, further studies are needed on the effects of mixing sequences from individuals infected through different transmission routes. An outbreak among IDUs can for example look very different compared with a transmission cluster with sequences predominantly from MSM or heterosexuals [41,162]. It has been suggested that such differences may be linked to rapid HIV-1 transmissions and lack of transmission bottlenecks in IDU outbreaks [162].

Conclusion and selecting an appropriate cluster definition

HIV-1 infected patients are connected by transmission history and HIV-1 populations accumulate genetic distance over time. Therefore, the genetic distances in a transmission cluster will depend on how long ago it was established. The most suitable definition of an HIV-1 transmission cluster will depend on the hypothesis being tested and the composition of the HIV-1 sequence dataset under study. Consequently, no single method or cut-off will suit all research purposes. However, an approach that combines a genetic distance threshold with a phylogenetic branch support seems to fit most hypotheses and datasets. Moreover, and as exemplified in Fig. 3, loosely set genetic threshold (e.g. larger than the commonly used thresholds of 1.5 or 4.5%) allows inclusion of clusters that span longer time periods. This seems appropriate for datasets with high-sequence coverage of populations followed over long-time periods if the main aim is to understand the long-standing transmission dynamics. A higher threshold will, however, increase the likelihood of including transmission clusters with missing links (i.e. unsampled sequences), and a stricter genetic threshold or a molecular clock analysis may be more appropriate when the aim is to determine recent and epidemiologically active transmission clusters (recently formed clusters have a higher likelihood of still being active).

Studies of viral transmission based on sequence data can provide critical information that would be difficult to obtain through traditional epidemiological methodology and will likely be an increasingly important component in population-based surveillance of infectious diseases. Further developments of accessible and flexible software will be important in future analyses of the increasing number of publicly available HIV-1 sequences and to compare results between studies.


A.H. and J.E. performed the literature review. J.E. outlined the review and wrote the article. J.A. and J.E. designed the cluster analysis. All authors read and provided substantial input to the concepts presented in the review. J.E. is supported by the Swedish Research Council (350-2012-6628; 2016-01417) and the Swedish Society for Medical Research. The Kenya Medical Research Institute/Wellcome Trust Research Programme (KWTRP) at the Centre for Geographical Medicine Research-Kilifi is supported by core funding from the Wellcome Trust (#077092). A.H. and E.S. are supported in part by the International AIDS Vaccine Initiative (IAVI), which receives generous support of the American people through the United States Agency for International Development (USAID). A.H. was also supported by funding from the African Research Excellence Fund (AREF, grant # MRF-157-0002-F-HASSA). This work was also supported through the Sub-Saharan African Network for TB/HIV Research Excellence (SANTHE), a DELTAS Africa Initiative [grant # DEL-15-006]. The DELTAS Africa Initiative is an independent funding scheme of the African Academy of Sciences (AAS)'s Alliance for Accelerating Excellence in Science in Africa (AESA) and supported by the New Partnership for Africa's Development Planning and Coordinating Agency (NEPAD Agency) with funding from the Wellcome Trust [grant # 107752/Z/15/Z] and the UK government. The views expressed in this publication are those of the author(s) and not necessarily those of USAID, or the United States Government, AAS, NEPAD Agency, Wellcome Trust or the UK government. This report was published with permission from the Kenya Medical Research Institute (KEMRI).

Conflicts of interest

There are no conflicts of interest.


