Abstract
Background
To stop tuberculosis (TB), the leading infectious cause of death globally, we need to better understand transmission risk factors. While many studies have identified associations between individual-level covariates and pathogen genetic relatedness, few have identified characteristics of transmission pairs or explored how closely covariates associated with genetic relatedness mirror those associated with transmission.
Methods
We simulated a TB-like outbreak with pathogen genetic data and estimated odds ratios (ORs) to correlate each covariate and genetic relatedness. We used a naive Bayes approach to modify the genetic links and nonlinks to resemble the true links and nonlinks more closely and estimated modified ORs with this approach. We compared these two sets of ORs with the true ORs for transmission. Finally, we applied this method to TB data in Hamburg, Germany, and Massachusetts, USA, to find pair-level covariates associated with transmission.
Results
Using simulations, we found that associations between covariates and genetic relatedness had the same relative magnitudes and directions as the true associations with transmission, but biased absolute magnitudes. Modifying the genetic links and nonlinks reduced the bias and increased the confidence interval widths, more accurately capturing error. In Hamburg and Massachusetts pairs were more likely to be probable transmission links if they lived in closer proximity, had a shorter time between observations, or had shared ethnicity, social risk factors, drug resistance, or genotypes.
Conclusions
We developed a method to improve use of genetic relatedness as a proxy for transmission, and aid in understanding TB transmission dynamics in low-burden settings.
Keywords: whole genome sequencing, cluster analysis, infectious disease transmission, naive Bayes
INTRODUCTION
Tuberculosis (TB) is the leading infectious cause of death globally (1). Achieving the WHO's End TB goals requires a greater understanding of TB transmission dynamics including transmission risk factors to inform targeted interventions (2,3). Many studies have identified risk factors for recent transmission by determining recent transmission clusters using classical genotyping methods, with and without contact investigation data (4,5,14-18,6-13). Recent work, however, suggests that these genotyping methods do not provide the granularity necessary to definitively identify recent transmission events (19,20). To address this, whole genome sequencing (WGS) is becoming the preferred method to cluster cases (21-24).
Most studies examining covariates associated with transmission have identified individual-level risk factors for being part of these recent transmission clusters. While this approach is useful in identifying the types of people involved in recent transmission, these analyses do not directly identify factors associated with transmitting TB. Although some studies have identified risk factors of being an infector in the cluster (25,26) or having a source case in the cluster (27) which more directly identifies risk factors for transmission, few studies identify pair-level risk factors of transmission (20,28).
Regardless of the method used to identify risk factors of transmission, no genetic method can definitively identify transmission between two cases; not all genetically related pairs are transmission pairs and, depending on the single nucleotide polymorphism (SNP) threshold used, some transmission pairs will not be classified as recent transmission (29,30). The relationship between genetic distance and transmission is affected by pathogen characteristics including strain diversity, presence of mixed infection, and outbreak duration. Studies that use genetic clustering as a proxy for recent transmission assume that the risk factors associated with genetic relatedness are the same as the risk factors associated with transmission. However, to the best of our knowledge, no one has explored how closely the association between covariates and close genetic relatedness mirrors the association with transmission.
Recently we developed a method that uses probable transmission events defined by WGS and/or contact investigations to estimate the transmission probability between cases using pair-level covariates (31). The contribution of each covariate to the probabilities represents another proxy for the relationship between covariates and transmission, but one that accounts for the uncertainty of transmission defined by WGS and/or contact investigations using an iterative estimation process. Here, we use simulations to determine how similar the association between covariates and close genetic relatedness is to the true association with transmission. We then explore if these estimates could be improved by the iterative naive Bayes process. Finally, we estimate the contribution of various covariates to the naive Bayes transmission probabilities in two different low-burden settings: outbreak data from Hamburg, Germany, and surveillance data from Massachusetts.
METHODS
Association of Covariates with Genetic Relatedness
Since transmission events are generally unobservable, in order to understand the association between covariates and transmission (ORT in Figure 1A), a common approach is to identify factors associated with close genetic relatedness (ORG in Figure 1A), assuming that is a good proxy for recent transmission. As Figure 1A demonstrates, we expect ORG to approximate ORT because transmission is associated with close genetic relatedness, however, transmission and genetic relatedness are not the same.
Figure 1.

Diagram of the associations with covariates. A) Diagram of the association between covariates, unobserved transmission (ORT), and observed genetic relatedness (ORG). B) Diagram of the relationship between the covariates, transmission (ORT), and training links (ie. genetic relatedness), and the naive Bayes modified transmission links (ORM) which are then used to estimate transmission probabilities.
Naive Bayes Transmission Method Odds Ratios
We aim to improve upon the estimates of ORG though an iterative estimation procedure that attempts to modify the dataset of genetic links to more closely resemble the dataset of true transmission links. Previously we developed a method using the naive Bayes classification algorithm to predict the relative probability of transmission with multiple data sources (31). Using a training set of linked pairs, we estimate the association between pair level covariates and linkage. Then, using Bayes rule, we estimate the probability that all ordered case-pairs are linked based on their covariates. Ideally, the training set would contain only true links. However, in practice the training set defines probable links such as pairs that are closely related genetically and/or have a confirmed contact.
The method uses an iterative estimation procedure correcting some characteristics of the initial training set. For example, if the probable links are defined by close genetic relationships, one case could have probable links with multiple infectors. In this case, at the start of each iteration one probable link for each infectee is randomly chosen as the “true” link for that iteration, based on the assumption that each case has one infector. Once a pair is designated as a link, all other appropriately timed pairs with the same infectee as the link (whether they were originally denoted as links, nonlinks or unknown) are also included in the training set as nonlinks. Therefore, at each iteration, the final training set is similar to the original set of probable links and nonlinks, but modified as described above to make the training set more resemble a set of true transmission links and nonlinks (see eAppendix Section 1 and eFigure1; Supplementary Digital Content for more details.)
Figure 1B illustrates how the naive Bayes iteration process modifies the original dataset of training links resulting in a modified association between the covariates and the training variable (ORM). This association is estimated at each iteration and used to determine the contribution of each covariate to the naive Bayes transmission probabilities. To demonstrate that odds ratios are an appropriate measure of the contribution of the covariates to the probabilities, consider a simple example with two covariates Z1 and Z2. The un-scaled predicted probabilities from naive Bayes (π′ij) would be written as
| #(1) |
where P(L = 1) and P(L = 0) are the probabilities a pair in the training set is linked and unlinked respectively and P(Zk = zk ∣ L = 1) and P(Zk = zk ∣ L = 0) are the probabilities that Zk = zk in the training set among linked and unlinked pairs, respectively. This formula can be rewritten in terms of odds as
| #(2) |
where O1z1 and O2z2 are the odds of being a training link in the iteration for a pair with Z1 = z1 and Z2 = z2 respectively (see eAppendix Section 2; Supplementary Digital Content). Therefore, the odds ratios for different levels of Z1 and Z2 are meaningful representations of the contribution of the different levels of those covariates to the probability estimates.
Simulation Structure
To assess how accurately ORG estimates ORT and whether ORM is a better approximation of ORT, we use a simulation method that has been previously described (31-33). Using R v3.6.0 (34) and the TransPhylo v1.2.3 package (35) we simulate 1000 TB-like outbreaks with phylogenetic trees representing multiple transmission chains. Each chain starts from one case and progresses according to a reproductive number and generation interval distribution. We use the phagnorn v2.5.5 (36) package to generate genetic sequences corresponding to the phylogenetic trees. We then simulate six individual-level covariates, Xj, j = 1, …, 6 with pair level analogs, Zj, that are associated with transmission (ORT ≠ 1). eTable1 describes the characteristics of these covariates that were used to simulate them. Though the covariates were arbitrary, we designed them to represent different structures (numbers of levels and directionality) and have different strengths of associations with transmission. We also include the time between infection dates for each case-pair with categories (in years): ≤1, 1-≤2, 2-≤3, 3-≤4, 4-≤5, and >5. See eAppendix Section 3; Supplementary Digital Content for more detail about the simulation structure.
We assume we have WGS data for all cases and define pairs of close genetic relatedness as pairs that differ by fewer than two SNPs (31,37-39). In sensitivity analyses, we consider other thresholds: fewer than three, four, and five SNPs. For each outbreak we estimate ORG and ORT with 95% confidence intervals from the contingency table using standard methods. We average ORT across the 1000 simulated outbreaks giving the true association between the covariates and transmission.
We then apply the naive Bayes transmission method to each of the 1000 outbreaks defining probable links in the initial training set as genetically related pairs (fewer than two SNPs) and probable nonlinks as pairs with more than 12 SNPs (31,37-39). At each iteration we estimate the odds ratios between each covariate level and being a training link and average the values across all iterations for the final estimates of ORM. We use Rubin's rules (40) to obtain standard errors and 95% confidence intervals for ORM. We compare our estimates of ORG and ORM to the average ORT (the truth) for each covariate level by calculating the mean absolute percentage error and mean squared error. We also compare the confidence interval width and coverage (what percentage of the time ORT is within the confidence bounds) for ORM and ORG.
Application to TB in Two Low-burden Settings
We identify the contribution of covariates to the TB transmission probabilities previously estimated in two low-burden settings: Hamburg, Germany, and Massachusetts, USA. The estimated ORM reflects the magnitude of association of each covariate with whether a pair is a probable transmission link. These odds ratios also illuminate which covariates contributed to the reproductive number and serial interval estimates which use these probabilities (31,33). For both studies we formulated pair-level versions of each individual-level covariate in a way that made sense clinically based on which combinations and directions of the covariate levels could be associated with transmission.
The first setting is a TB outbreak in Hamburg and Schleswig–Holstein, Germany, analyzed in Roetzer et al. 2013 (41). The outbreak includes 86 individuals from the largest strain cluster in a long-term surveillance study conducted between 1997-2010. We estimate ORM to determine how the covariates—sex, age, nationality, city, smear result, HIV status, substance abuse, residence, association with a certain alcohol consuming street scene, and time between observation—contribute to the estimation of these probabilities when the training links are defined with either WGS or contact investigations.
The second setting is TB surveillance data from Massachusetts, USA, from 2010-2016. The Massachusetts Department of Public Health conducts surveillance of all active TB cases statewide including demographic, clinical, and pathogen genotyping data. Here, the covariates were sex, age, country of birth, county of residence, smear result, immune-suppression status, shared drug resistance, and the time between observations. No ethical review was required for analysis of these data. We also included GENType, a genotype classification method used by the United States' Centers for Disease Control and Prevention based on variable-number tandem repeats of mycobacterial interspersed repetitive units (MIRU-VNTR) and spoligotype, (42). Again, we estimate ORM, to assess the contribution of each covariate to the probabilities when the training links were defined by contact investigations.
The naive Bayes transmission method including estimating ORM is implemented in the R package, nbTransmission, available from https://github.com/sarahleavitt/nbTransmission. The code used to produce the simulations and our results are also available at https://github.com/sarahleavitt/nbSimulation and https://github.com/sarahleavitt/nbPaper3.
RESULTS
Simulation results
The true relationships between the simulated covariates and transmission (ORT) averaged across the 1000 simulated TB-like outbreaks are shown in Figure 2 and eTable2; Supplementary Digital Content. Covariates Z1 (ORT=1.46) and Z3 (ORT=1.31 and 0.89) were weakly associated with transmission, while Z2 (ORT=4.46), Z5 (ORT=3.42), and 3.76), and Z6 (ORT=5.14) were strongly associated with transmission. The Z4 levels (ORT=0.38 and 0.041) and the time between cases (ORT ranging from 0.026 to 0.87) were differentially associated with transmission. The log ORG estimates followed the same pattern of association as log ORT, i.e. covariate levels that had a stronger association with true transmission also had a stronger association with close genetic relatedness (Figure 2). However, the absolute magnitude of log ORG was biased towards the null.
Figure 2.

Bias of covariate association estimation. Plot of the mean estimated log odds ratio across 1000 simulated outbreaks representing the relationship between the simulated covariate and close genetic relatedness (log ORG, light grey) or naive Bayes modified close genetic relatedness (log ORM). The black dot represents the true log odds ratio (log ORT) calculated as the average of the log odds ratio for the relationship between the covariates and true transmission across the 1000 simulations.
The modified log odds ratios of close genetic relatedness (log ORM) estimated with the iterative naive Bayes algorithm also followed the same pattern as log ORT. Furthermore, the log ORM estimates across the simulations were closer to log ORT than log ORG though still biased (Figure 2). Increasing the SNP threshold defining probable links from two to three, four, and five, increased the bias of both log ORG and log ORM (see eFigure2; Supplementary Digital Content). The mean absolute percentage error and mean squared error across the simulations was consistently lower for log ORM than log ORG. Additionally, the coverage of the estimated confidence intervals was substantially higher for log ORM than log ORG, because the ORM estimates were less biased, but the confidence intervals were also much wider than for ORG (eFigure3).
Application results
Tables 1 and 2 detail the Hamburg and Massachusetts covariates respectively, stratifying by the training datasets (contact tracing and WGS data for Hamburg; contact tracing only for Massachusetts). The contribution of each covariate to the naive Bayes transmission probabilities (ORM) are shown in Figures 3 and 4. For both scenarios, the sex and ages of the cases in each pair did not add information about the odds of probable transmission (defined by contact and/or genetic links). However, sharing a nationality or country of birth that was not the study country (USA or Germany) were associated with increased odds of probable transmission. In Hamburg, living in the same city was associated with increased odds of probable transmission, while the same was true for living in the same county in Massachusetts. As expected, the longer the time between cases, the lower the odds of probable transmission. No ethical review was required for the use of data in this analysis.
Table 1.
Pair-level Demographic and Clinical Characteristics for the Hamburg Outbreak Stratified by whether the Pair is Linked by Pathogen Genetics or Contact Investigations, n (%).
| Genetic | Contact | ||||
|---|---|---|---|---|---|
| Covariate Level | All Pairs (n = 3633) |
Nonlinksa (n = 385) |
Linksb (n = 796) |
Nonlinksc (n = 408) |
Linksd (n = 51) |
| Female to female | 120 (3) | 23 (6) | 19 (2) | 10 (3) | 0 (0) |
| Male to male | 2401 (66) | 211 (55) | 613 (77) | 278 (68) | 42 (82) |
| Male to female | 757 (21) | 90 (23) | 82 (10) | 83 (20) | 8 (16) |
| Female to male | 355 (10) | 61 (16) | 82 (10) | 37 (9) | 1 (2) |
| Different age group | 2938 (81) | 311 (81) | 639 (80) | 331 (81) | 38 (75) |
| Same age group | 695 (19) | 74 (19) | 157 (20) | 77 (19) | 13 (26) |
| One German, one from a foreign country | 1315 (36) | 185 (48) | 137 (17) | 126 (31) | 3 (6) |
| Both German | 2129 (59) | 152 (40) | 622 (78) | 276 (68) | 44 (86) |
| From different foreign countries | 170 (5) | 48 (13) | 32 (4) | 6 (2) | 0 (0) |
| From the same foreign country | 19 (1) | 0 (0) | 5 (1) | 0 (0) | 4 (8) |
| Live in different cities | 1485 (41) | 162 (42) | 308 (39) | 127 (31) | 3 (6) |
| Live in the same city | 2148 (59) | 223 (58) | 488 (61) | 281 (69) | 48 (94) |
| Infector smear− | 2339 (64) | 309 (80) | 511 (64) | 275 (67) | 32 (63) |
| Infector smear+ | 1294 (36) | 76 (20) | 285 (36) | 133 (33) | 19 (37) |
| Infector HIV− | 3463 (95) | 380 (99) | 730 (92) | 380 (93) | 49 (96) |
| Infector HIV+ | 170 (5) | 5 (1) | 66 (8) | 28 (7) | 2 (4) |
| Different substance abuse patterns | 1735 (48) | 184 (48) | 281 (35) | 194 (48) | 22 (43) |
| Neither abuse substances | 526 (15) | 85 (22) | 102 (13) | 49 (12) | 6 (12) |
| Both abuse substances | 1372 (38) | 116 (30) | 413 (52) | 165 (40) | 23 (45) |
| Different residence status | 1055 (29) | 89 (23) | 259 (33) | 79 (19) | 4 (8) |
| Both have permanent residences | 2473 (68) | 288 (75) | 501 (63) | 327 (80) | 46 (90) |
| Both are homeless | 105 (3) | 8 (2) | 36 (5) | 2 (1) | 1 (2) |
| Different affiliation with local drinking scene | 1361 (38) | 150 (39) | 209 (26) | 58 (14) | 0 (0) |
| Neither affiliated | 210 (6) | 26 (7) | 35 (4) | 0 (0) | 1 (2) |
| Both affiliated | 2062 (57) | 209 (54) | 552 (69) | 350 (86) | 50 (98) |
| <1 year between cases | 546 (15) | 18 (5) | 180 (23) | 85 (21) | 31 (61) |
| 1-2 years between cases | 485 (13) | 19 (5) | 134 (17) | 72 (18) | 6 (12) |
| 2-3 years between cases | 374 (10) | 17 (4) | 86 (11) | 42 (10) | 7 (14) |
| 3-4 years between cases | 305 (8) | 26 (7) | 53 (7) | 12 (3) | 0 (0) |
| >4 years between cases | 1923 (53) | 305 (79) | 343 (43) | 197 (48) | 7 (14) |
Pairs whose isolates had more than 12 single nucleotide polymorphisms (SNPs) between them
Pairs whose isolates had had fewer than 2 SNPs between them
Pairs who were both involved in contact investigations but were not linked
Pairs who were linked by contact investigation or could both be linked to a common contact.
Table 2.
Pair-level Demographic and Clinical Characteristics for Massachusetts TB Surveillance Data Stratified by whether the Pair is Linked by Contact Investigations, n (%).
| Contact | |||
|---|---|---|---|
| Covariate Level | All Pairs n = 220,758 |
Nonlinksa n = 2058 |
Linksb n = 26 |
| Female to female | 41097 (19) | 195 (10) | 2 (8) |
| Male to male | 71170 (32) | 979 (48) | 12 (46) |
| Male to female | 51067 (23) | 491 (24) | 5 (19) |
| Female to male | 56602 (26) | 393 (19) | 7 (27) |
| Different age group | 158125 (72) | 1373 (67) | 17 (65) |
| Same age group | 62633 (28) | 685 (33) | 9 (35) |
| One US born, one born in a foreign country | 55448 (25) | 663 (32) | 9 (35) |
| Both US born | 5350 (2) | 91 (4) | 5 (19) |
| Born in different foreign countries | 147618 (67) | 1167 (57) | 3 (12) |
| Born in the same foreign country | 11373 (5) | 137 (7) | 9 (35) |
| Residing in distant counties | 81919 (37) | 1011 (49) | 1 (4) |
| Residing in neighboring counties | 98082 (45) | 733 (36) | 3 (12) |
| Residing in the same county | 40524 (18) | 314 (15) | 22 (85) |
| Infector smear− | 89364 (43) | 551 (30) | 4 (17) |
| Infector smear+ | 117055 (57) | 1297 (70) | 20 (83) |
| Infector not immune suppressed | 158188 (72) | 1535 (75) | 22 (85) |
| Infector immune suppressed | 62570 (28) | 523 (25) | 4 (15) |
| Both drug susceptible | 157469 (71) | 1265 (62) | 15 (58) |
| No shared resistance | 59005 (27) | 690 (34) | 1 (4) |
| Shared resistance to 1 drug | 3607 (2) | 52 (3) | 1 (4) |
| Shared resistance to 2 drugs | 569 (0) | 31 (2) | 5 (19) |
| Shared resistance to 3+ drugs | 108 (0) | 20 (1) | 4 (15) |
| Different CDC GENTypec | 215610 (100) | 2046 (99) | 6 (23) |
| Same CDC GENType | 363 (0) | 12 (1) | 20 (77) |
| <1 year between cases | 60762 (28) | 549 (27) | 23 (89) |
| 1-2 years between cases | 47722 (22) | 503 (24) | 3 (12) |
| 2-3 years between cases | 39416 (18) | 403 (20) | 0 (0) |
| 3-4 years between cases | 30803 (14) | 288 (14) | 0 (0) |
| >4 years between cases | 42055 (19) | 315 (15) | 0 (0) |
Pairs of a random subset of cases who were both involved in contact investigations who were not linked
Pairs who were linked by contact investigation or could both be linked to a common contact.
Defined as matching on spoligotype and all 24 MIRU-VNTR loci
Figure 3.

Covariate contribution for Hamburg. Plot of the naive Bayes modified odds ratios (ORM) of the relationship between various covariates and close genetic relatedness (dark grey) and having a confirmed contact (light grey) with 95% confidence intervals for a small TB outbreak in Hamburg, Germany. These odds ratios represent the contribution of each covariate value to the transmission probabilities estimated for the outbreak where a higher odds ratio indicates a higher probability of a probable transmission link.
Figure 4.

Covariate contribution for Massachusetts. Plot of the naive Bayes modified odds ratios (ORM) of the relationship between various covariates and having a confirmed contact with 95% confidence intervals for TB surveillance data in Massachusetts between 2010-2016. (For this analysis we did not have genetic data so there is no comparison with a training dataset defined by close genetic relatedness). These odds ratios represent the contribution of each covariate value to the transmission probabilities estimated for the outbreak where a higher odds ratio indicates a higher probability of a probable transmission link.
In Hamburg, shared substance abuse was not associated with probable transmission while if both cases were homeless or were both either associated or not associated with a certain local drinking scene, they were more likely to be probable transmission links. These associations were consistent across the two different training datasets though there were some discrepancies in the association magnitude due to small sample sizes. In Massachusetts, shared resistance to two or more drugs and matching GENType had the strongest association. The infector’s HIV status in Hamburg and the infector’s immune-suppression status in Massachusetts were not associated with probable transmission. In Massachusetts if the infector was smear positive, a pair was more likely to be a probable transmission link; however, this was not observed in Hamburg.
DISCUSSION
In a TB-like simulated outbreak, the association between covariates and close genetic relatedness (ORG) was consistent with the association between covariates and transmission (ORT) in the direction and the relative (but not absolute) magnitude of association. These results suggest that methods using genetic linkages as a proxy for recent transmission correctly identify transmission risk factors and their direction of association. However, the absolute magnitude of the estimated association does not equal that of the true association with transmission. The iterative process of modifying the training dataset used in the naive Bayes transmission method produces modified estimates (ORM) that are closer to the truth. However, it cannot override the fact that close genetic relatedness is not equivalent to transmission. The modified estimates also had much wider confidence intervals capturing measurement error, which ORG ignores, because genetic distance does not equate to transmission.
We aimed to explore the association between pair-level covariates and whether cases are linked by direct transmission. This differs from the commonly reported association between individual-level covariates and being in a recent transmission cluster. Cluster analysis seeks to identify individual characteristics associated with recent transmission. This can inform public health officials about types of people more likely to be part of local disease outbreaks, potentially guiding interventions. However, a pair-level analysis, as we present here, seeks to retrospectively identify which case characteristics make them more likely to be a transmission pair, but still without comment on directionality, unless the pair-level covariates are chosen to have a directional form (e.g., the possible infector is smear positive and the infectee is smear negative). Therefore, both cluster and pair-level analyses produce measures of association that are useful for transmission analysis, but which often answer different questions and inform different courses of action. Knowing what pair-level covariates are associated with transmission can help to inform better surveillance databases. If key covariates are collected along with case counts, then they can be used to identify transmission chains more accurately and in various methods to estimate transmission parameters such as the reproductive number and the serial interval.
As both measures are based upon identifying recent transmission, the effect of using genetics as a transmission proxy on the accuracy of resulting estimates is relevant to both measures. Although we found that ORM does not perfectly estimate ORT, understanding the contribution of each covariate to the naive Bayes transmission probabilities helps to understand which covariates are driving transmission probabilities. In both Hamburg and Massachusetts, pairs where both cases were from the same foreign country were more likely to be probable transmission links than pairs from different countries or both from the study country. This result apparently contradicts studies that find that native born individuals are more likely to be part of recent transmission clusters (4,6,24,43,8,10,14,15,18,21-23). However, if foreign-born individuals transmit TB, they are more likely to transmit to someone of the same ethnicity or country of birth through household or close community transmission – consistent with Wyllie et al. (20), which also found through a pair-level analysis that same ethnicity was associated with genetic relatedness. There are also far more individuals in an outbreak setting from the study country so being born in that country does not contribute much information regarding whether a pair is linked. However, if two individuals are from the same foreign county, this greatly increased the probability they are linked.
Another illustration of the difference between cluster and pair-level analyses is drug resistance profiles. Though many studies found that drug resistance was not associated with clustering (6,9,12,17,27), in Massachusetts, a setting with low prevalence of drug resistance, we found that sharing the same resistance pattern to two or more drugs was highly informative of whether a pair was a probable transmission link. Someone who has drug resistant TB is not necessarily more likely to be part of a recent transmission cluster, but looking retrospectively, if two cases shared resistance to the same drugs, they were more likely to be a transmission link than a pair of cases with drug susceptible TB. If an infector has drug-resistant TB, the mutations causing resistance would be transmitted forward to the secondary case. Therefore, if two cases are found to both be resistant to the same set of drugs, then there is a higher probability that they are linked. This is particularly true for the settings we studied, where drug susceptible TB is much more common than drug resistant TB, and therefore shared resistance contributes more information to the probability of a link than shared susceptibility.
There were other covariates where the difference between pair-level and individual-level covariates is not as pronounced. In Hamburg, pairs where both cases were homeless or affiliated with a local drinking scene were more likely to be probable transmission links. These results are consistent with previous studies showing an association between recent transmission and homelessness and alcohol use (5,6,8,10,14,16,18,28). These results are more consistent because these populations have a high level of social contact thus increasing both the probability that they will be part of a recent transmission cluster and that pairs of cases who share these characteristics will be transmission pairs. These populations also experience limited access to care and poor health seeking behavior often leading to longer time before treatment and thus more time to transmit the infection.
In Massachusetts, the most influential covariate of transmission links was GENType (composed of MIRU-VNTR and spoligotype). This is not surprising, as these genotyping methods have been used classically to help identify transmission (20). We also found that in Massachusetts, pairs with infectors who had a positive smear were more likely to be probable transmission links. This is consistent with evidence that cases who have a positive smear are more infectious and likely to transmit (10,13,15,27,44). In Hamburg, the infector’s smear status was not associated with whether a pair was a transmission link.
One limitation is that Hamburg and Massachusetts represent low TB burden settings. Therefore, the results may not generalize to moderate or high burden settings. However, as we seek to end TB around the world, understanding transmission drivers in both high and low burden settings is important, especially if these drivers are different. An additional limitation is that the ORM estimates from the naive Bayes transmission method are univariable odds ratios, and therefore are not adjusted for confounding. Unlike ORG, which can be interpreted as the odds of close genetic relatedness of one covariate value compared to another, ORM is difficult to interpret. It roughly describes the association between close genetic relatedness (or confirmed contact) and the covariates, but the outcome is modified to more accurately resemble true transmission. Therefore, an interpretation of ORM should focus on the relative contribution to the transmission probabilities, i.e., how much specific covariate values increase or decrease the probability that a pair is a probable transmission link.
Another limitation is sampling bias. If the cases sampled for a study or those which have genetic/contact investigation data are not representative of all TB cases in that context, then this could further bias the associations between covariates and close genetic relatedness (or confirmed contact) and therefore they may not represent the relative magnitude or direction of the true association with transmission. Our simulation scheme also only considered one isolate per case even though infection with multiple strains is possible and will change the apparent genetic relatedness of case-pairs. Furthermore, our simulation results are for TB which is a slow mutating pathogen. It is possible that in other, more rapidly mutating pathogens, ORG might better approximate ORT.
The associations between close genetic relatedness and covariates can inform the associations between transmission and the covariates but they are not equivalent. However, these relationships can still be used to better understand the relative associations of different covariates to transmission and to estimate highly predictive transmission probabilities with naive Bayes. Though imperfect, our study has shown that the information from a transmission proxy such as genetic distance can aid in understanding transmission dynamics of TB in low-burden settings. It is also important to consider the different implications of using individual-level or pair-level covariates. As we develop more sophisticated genetic tools such as deep sequencing for transmission analyses and are better able to link cases (30), we will continue to improve our understanding of critical risk factors of transmission, and potentially better manage or even prevent future outbreaks.
Data and Code Availability
The methods described in this paper are implemented in the R package, nbTransmission, available on CRAN and GitHub (https://github.com/sarahleavitt/nbTransmission). The code used to produce the simulations and all results in this paper are also available on GitHub at https://github.com/sarahleavitt/nbSimulation and https://github.com/sarahleavitt/nbPaper3. The Hamburg dataset has been previously published by Roetzer et al. (Plos Medicine, 2013, https://doi.org/10.1371/journal.pmed.1001387) and is available as part of the supplementary material for that paper. Details on the contact investigation data though, is not publicly available and was obtained directly from the authors. The Massachusetts dataset is owned by the Massachusetts Department of Health and therefore cannot be shared.
Supplementary Material
Sources of Funding
This work was supported by NIHGMS R01GM122876, NIHGMS T32GM074905, and NIH K01AI102944 from the US National Institutes of Health, P30AI042853 from Providence/Boston Center for AIDS Research, U19AI111276 from Boston University/Rutgers Tuberculosis Research Unit, fellowship MFE-152448 from the Canadian Institutes of Health Research, and the U.S.-India Vaccine Action Program (VAP) Initiative on Tuberculosis.
Footnotes
Conflict of Interest
None declared
REFERENCES
- 1.World Health Organization. Global Tuberculosis Report 2019. Geneva; 2019. [Google Scholar]
- 2.Trauer JM, Dodd PJ, Gomes MGM, et al. The importance of heterogeneity to the epidemiology of tuberculosis. Clin Infect Dis. 2019;69(1):159–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Mathema B, Andrews JR, Cohen T, et al. Drivers of Tuberculosis Transmission. J Infect Dis. 2017;216(S6):S644–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Anderson LF, Tamne S, Brown T, et al. Transmission of multidrug-resistant tuberculosis in the UK : a cross-sectional molecular and epidemiological study of clustering and contact tracing. Lancet Infect Dis. 2014;14(5):406–15. [DOI] [PubMed] [Google Scholar]
- 5.Diel R, Schneider S, Meywald-Walter K, et al. Epidemiology of Tuberculosis in Hamburg, Germany : Long-Term Population-Based Analysis Applying Classical and Molecular Epidemiological Techniques. J Clin Microbiol. 2002;40(2):532–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Diel R, Niemann S, Nienhaus A. Risk of tuberculosis transmission among healthcare workers. ERJ Open Res. 2018;4(2):00161–2017. 10.1183/23120541.00161-2017 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Fenner L, Gagneux S, Helbling P, et al. Mycobacterium tuberculosis Transmission in a Country with Low Tuberculosis Incidence : Role of Immigration and HIV Infection. J Clin Microbiol. 2012;50(2):388–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Fok A, Numata Y, Schulzer M, FitzGerald MJ. Risk factors for clustering of tuberculosis cases: A systematic review of population-based molecular epidemiology studies. Int J Tuberc Lung Dis. 2008;12(5):480–92. [PubMed] [Google Scholar]
- 9.Franzetti F, Codecasa L, Matteelli A, et al. Genotyping analyses of tuberculosis transmission among immigrant residents in Italy. Clin Microbiol Infect. 2010;16(8):1149–54. [DOI] [PubMed] [Google Scholar]
- 10.Hamblion EL, Menach A Le, Anderson LF, et al. Recent TB transmission, clustering and predictors of large clusters in London, 2010 – 2012 : results from first 3 years of universal MIRU-VNTR strain typing. Thorax. 2016;71:749–56. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Lalor MK, Anderson LF, Hamblion EL, et al. Recent household transmission of tuberculosis in England, 2010-2012 : retrospective national cohort study combining epidemiological and molecular strain typing data. BMC Med. 2017;15(105). 10.1186/s12916-017-0864-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Lim LK, Sng LH, Win W, et al. Molecular Epidemiology of Mycobacterium tuberculosis Complex in Singapore, 2006-2012. PLoS One. 2013;8(12):e84487. 10.1371/journal.pone.0084487 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Moonan PK, Ghosh S, Oeltmann JE, et al. Using Genotyping and Geospatial Scanning to Estimate Recent Mycobacterium tuberculosis Transmission, United States. Emerg Infect Dis. 2012;18(3):458–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Vynnycky E, Keen AR, Evans JT, et al. Mycobacterium tuberculosis transmission in an ethnically-diverse high incidence region in England, 2007–11. BMC Infect Dis. 2019;19(1):26. 10.1186/s12879-018-3585-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Oeltmann JE, Click E., Moonan PK. Using tuberculosis patient characteristics to predict future cases with matching genotype results. Public Heal Action. 2014;4(1):47–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Rodwell TC, Kapasi AJ, Barnes RFW, Moser KS. Factors associated with genotype clustering of Mycobacterium tuberculosis isolates in an ethnically diverse region of southern California, United States. Infect Genet Evol. 2012;12(8):1917–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Nava-Aguilera E, Andersson N, Harris E, et al. Risk factors associated with recent transmission of tuberculosis: Systematic review and meta-analysis. Int J Tuberc Lung Dis. 2009;13(1):17–26. [PubMed] [Google Scholar]
- 18.Yuen CM, Kammerer JS, Marks K, Navin TR, France AM. Recent Transmission of Tuberculosis - United States, 2011-2014. PLoS One. 2016;11(4):e0153728. 10.1371/journal.pone.0153728 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Meehan CJ, Moris P, Kohl TA, et al. The relationship between transmission time and clustering methods in Mycobacterium tuberculosis epidemiology. EBioMedicine. 2018;37:410–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Wyllie DH, Davidson JA, Grace Smith E, et al. A Quantitative Evaluation of MIRU-VNTR Typing Against Whole-Genome Sequencing for Identifying Mycobacterium tuberculosis Transmission: A Prospective Observational Cohort Study. EBioMedicine. 2018;34:122–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Izumi K, Murase Y, Uchimura K, et al. Transmission of tuberculosis and predictors of large clusters within three years in an urban setting in Tokyo, Japan: A population-based molecular epidemiological study. BMJ Open. 2019;9(5):e029295. 10.1136/bmjopen-2019-029295 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Lapadula G, Zanini F, Codecasa L, et al. Influence of hospitalization upon diagnosis on the risk of tuberculosis clustering. Mediterr J Hematol Infect Dis. 2013;5(1):e2013071. 10.4084/mjhid.2013.071 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Jiang Q, Liu Q, Ji L, et al. Citywide Transmission of Multidrug-resistant Tuberculosis Under China’s Rapid Urbanization: A Retrospective Population-based Genomic Spatial Epidemiological Study. Clin Infect Dis. 2020;71(1):142–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Stucki D, Ballif M, Egger M, et al. Standard Genotyping Overestimates Transmission of Mycobacterium tuberculosis among Immigrants in a Low-Incidence Country. J Clin Microbiol. 2016;54(7):1862–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Melsew YA, Doan TN, Gambhir M, et al. Risk factors for infectiousness of patients with tuberculosis: A systematic review and meta-analysis. Epidemiol Infect. 2018;146(3):345–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Xu Y, Cancino-Munoz I, Torres-Puente M, et al. High-resolution mapping of tuberculosis transmission: Whole genome sequencing and phylogenetic modelling of a cohort from Valencia Region, Spain. PLoS Med. 2019;16(10):e1002961. 10.1371/journal.pmed.1002961 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Guerra-Assuncao JA, Crampin AC, Houben RMGJ, et al. Large-scale whole genome sequencing of M . tuberculosis provides insights into transmission in a high prevalence area. Elife. 2015;4:e05166. 10.7554/eLife.05166 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Cronin WA, Golub JE, Lathan MJ, et al. Molecular Epidemiology of Tuberculosis in a Low- to Moderate-Incidence State : Are Contact Investigations Enough? Tuberc Genotyping Netw. 2002;8(11):1271–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Campbell F, Strang C, Ferguson N, Cori A, Jombart T. When are pathogen genome sequences informative of transmission events? PLoS Pathog. 2018;14(2):e1006885. 10.1371/journal.ppat.1006885 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Lee RS, Proulx J-F, McIntosh F, Behr MA, Hanage WP. Previously undetected superspreading of Mycobacterium tuberculosis revealed by deep sequencing. Elife. 2020;9:e53245. 10.7554/eLife.53245 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Leavitt S V, Lee RS, Sebastiani P, et al. Estimating the Relative Probability of Direct Transmission between Infectious Disease Patients. Int J Epidemiol. 2020;49(3):764–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Stimson J, Gardy J, Mathema B, et al. Beyond the SNP Threshold: Identifying Outbreak Clusters Using Inferred Transmissions. Mol Biol Evol. 2019;36(3):587–603. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Leavitt S V, Jenkins HE, Sebastiani P, et al. Estimation of the generation interval using pairwise relative transmission probabilities. Biostatistics. 2021;kxaa059. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.R Core Team. R: A Language and Environment for Statistical Computing [Internet]. Vienna, Austria: R Foundation for Statistical Computing; 2019. https://www.r-project.org/ [Google Scholar]
- 35.Didelot X, Fraser C, Gardy J, Colijn C. Genomic Infectious Disease Epidemiology in Partially Sampled and Ongoing Outbreaks. Mol Biol Evol. 2017;34(4):997–1007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Schliep KP. phangorn: Phylogenetic analysis in R. Bioinformatics. 2011;27(4):592–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Walker TM, Ip CLC, Harrell RH, et al. Whole-genome sequencing to delineate Mycobacterium tuberculosis outbreaks : a retrospective observational study. Lancet Infect Dis. 2013;13:137–46. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Walker TM, Lalor MK, Broda A, et al. Assessment of Mycobacterium tuberculosis transmission in Oxfordshire, UK, 2007-12, with whole pathogen genome sequences: an observational study. Lancet Repiratory Med. 2014;2:285–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Lee RS, Radomski N, Proulx J, et al. Population genomics of Mycobacterium tuberculosis in the Inuit. PNAS. 2015;112(44):13609–12614. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Barnard J, Rubin DB. Small-Sample Degrees of Freedom with Mulitple Imputation. Biometrika. 1999;86(4):948–55. [Google Scholar]
- 41.Roetzer A, Deil R, Kohl TA, et al. Whole Genome Sequencing versus Traditional Genotyping for Investigation of a Mycobacterium tuberculosis Outbreak : A Longitudinal Molecular Epidemiological Study. PLoS Med. 2013;10(2):e1001387. 10.1371/journal.pmed.1001387 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Centers for Disease Control and Prevention. GENType: New Genotyping Terminology to Intergrate 24-locus MIRU-VNTR [Internet]. Atlanta, Georgia; 2012. http://www.cdc.gov/tb/publications/factsheets/statistics/genotypingterminology.pdf [Google Scholar]
- 43.Stucki D, Ballif M, Bodmer T, et al. Tracking a Tuberculosis Outbreak Over 21 Years : Strain-Specific Single-Nucleotide Polymorphism Typing Combined With Targeted Whole- Genome Sequencing. J Infect Dis. 2015;211:1306–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Ribeiro FKC, Pan W, Bertolde A, et al. Genotypic and spatial analysis of mycobacterium tuberculosis transmission in a high-incidence urban setting. Clin Infect Dis. 2015;61(5):758–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The methods described in this paper are implemented in the R package, nbTransmission, available on CRAN and GitHub (https://github.com/sarahleavitt/nbTransmission). The code used to produce the simulations and all results in this paper are also available on GitHub at https://github.com/sarahleavitt/nbSimulation and https://github.com/sarahleavitt/nbPaper3. The Hamburg dataset has been previously published by Roetzer et al. (Plos Medicine, 2013, https://doi.org/10.1371/journal.pmed.1001387) and is available as part of the supplementary material for that paper. Details on the contact investigation data though, is not publicly available and was obtained directly from the authors. The Massachusetts dataset is owned by the Massachusetts Department of Health and therefore cannot be shared.
