ABSTRACT
Most people infected with Mycobacterium tuberculosis (Mtb) are believed to be in a state of latent tuberculosis (TB) infection (LTBI). Although LTBI is asymptomatic and not infectious, there is a risk of developing active disease even decades after infection. Here, to characterize mutations acquired during LTBI, we collected and analyzed Mtb genomes from seven Japanese patient pairs, each pair consisting of two active TB patients whose starting dates of developing active disease were >3 years apart; one had a high suspicion of LTBI before developing active disease, whereas the other did not. Thereafter, we compared these genomes with those of longitudinal sample pairs within a host of chronic active TB infections combined with public data. The bacterial populations in patients with LTBI were genetically more homogeneous and accumulated single nucleotide polymorphisms (SNPs) slower than those from active disease. Moreover, the lower proportion of nonsynonymous SNPs indicated weaker selective pressures during LTBI than active disease. Finally, the different mutation spectrums indicated different mutators between LTBI and active disease. These results suggest that the likelihood of the acquisition of mutations responsible for antibiotic resistance and increased virulence was lower in the Mtb population from LTBI than active disease.
IMPORTANCE
Controlling latent tuberculosis (TB) infection (LTBI) activation is an effective strategy for TB elimination, where understanding Mycobacterium tuberculosis (Mtb) dynamics within the host plays an important role. Previous studies on chronic active disease reported that Mtb accumulated genomic mutations within the host, possibly resulting in acquired drug resistance and increased virulence. However, several reports suggest that fewer mutations accumulate during LTBI than during the active disease, but the associated risk is largely unknown. Here, we analyzed the genomic dynamics of Mtb within the host during LTBI. Our results statistically suggest that Mtb accumulates mutations during LTBI, but most mutations are under low selective pressures, which induce mutations responsible for drug resistance and virulence. Thus, we propose that LTBI acts as a source for new TB disease rather than as a period for in-host genome evolution.
KEYWORDS: latent infection, Mycobacterium tuberculosis, adaptive mutations
INTRODUCTION
Tuberculosis (TB) remains the second leading cause of death from a single infectious agent; one-quarter of the world’s population is estimated to be infected with Mycobacterium tuberculosis (Mtb) (1, 2). Most people infected with Mtb are in a state called latent TB infection (LTBI), which is asymptomatic and noninfectious, while viable Mtb remains within hosts and can multiply to develop active disease (3, 4). Approximately 10% of LTBIs are estimated to develop into active TB disease in an individual’s lifetime (5). In countries with low TB incidence, most new active TB cases are predicted to be reactivations of LTBI (6, 7). Thus, preventing LTBI activation has become a significant strategy for TB elimination in these countries.
Understanding Mtb dynamics during LTBI is the key to controlling LTBI activation. However, in-host Mtb dynamics is more frequently analyzed in chronic active TB disease than in LTBI. Several studies of chronic active disease reported that Mtb and other Mycobacteria proportionately accumulate single nucleotide polymorphisms (SNPs) with disease duration (8–10). Typically, selective pressures, such as antibiotic treatments and host immune responses, increase mutations in specific genome positions responsible for the resistance to pressures and lead to the high ratio of nonsynonymous to synonymous mutation. Mutations detected in active disease show these characteristics, indicating that some mutations are under selective pressures (8–12). Furthermore, the Mtb population during active disease becomes genetically diverse, increasing their chances to adapt to environmental changes and survive within the host (13, 14). In addition to the scenario under active disease, previous studies also demonstrated the accumulation of SNPs and other genomic mutations in the Mtb genome during LTBI. Interestingly, while similar SNP accumulation rates were observed in macaques for LTBI and active disease, the former had a slower rate than the latter in humans (7, 15–18). However, whether other characteristics of genomic mutations, including population diversification and signs of selective pressures, are unique to LTBI or similar to active disease remains unknown.
We analyzed the genomic characteristics of Mtb from seven Japanese TB patient pairs with a strong suspicion of LTBI combined with public data. Using genetic homogeneity, SNP accumulation rate, and the proportion of nonsynonymous SNPs, changes in in-host genome evolution were compared and assessed between the Mtb population from LTBI and active disease.
RESULTS
Collection of genomes of the Mtb populations from LTBI and active disease
To characterize the Mtb genome during LTBI, we collected Mtb isolates from seven pairs of TB patients in Japan. Each pair consisted of two active TB patients, whose starting date of active disease differed by >3 years. The transmission between the two or simultaneous infection from another patient was implied since both patients lived in the same household and their isolates were genetically related (Table 1; Fig. 1A). Patients developing active TB earlier and later in each pair were defined as first and second patients, and Mtb isolates from first and second patients were named first and second samples, respectively. In these pairs, the second patient was suspected of having LTBI that reactivated to cause TB disease (Fig. 1A). Two out of the seven second patients had records of receiving drug treatment to prevent LTBI activation; this information was unavailable for others. The genomes of 14 isolates from the seven pairs were analyzed by PacBio HiFi long-read (hereafter referred to as HiFi) and Illumina short-read DNA sequencing. The assembled genomes from HiFi reads included genome sizes and gene numbers close to the reference strain H37Rv, suggesting complete genome constructions (Tables S1 and S2) (19). HiFi and Illumina reads were aligned to the assembled genomes, and their abilities to detect SNPs were compared (Fig. S1). Most SNPs were shared between HiFi and Illumina data regardless of the repetitiveness levels of genomic regions (Fig. S1A). The only exception was an SNP in a mobile genetic element present in multiple copies in the Mtb genome, for which the Illumina reads were incorrectly aligned (Fig. S1B). The same SNPs were detected when Illumina reads were aligned to the assembled genomes and H37Rv (data not shown). Thus, we considered that the SNPs detected by aligned Illumina reads to H37Rv were of enough quality for further analysis. Then, we searched publicly available Illumina data of Mtb genomes collected like our seven sample pairs and obtained sequencing data from 29 additional patient pairs (Fig. 1B; Table S3) (15, 16, 18, 20). We selected 19 pairs whose dates of sample collection differ by ≥2 years since most patients develop active TB within 2 years after infection, and LTBI becomes a quiescent state after an asymptomatic state ≥2 years (18, 21). To effectively compare these pairs with our sample pairs, whose durations were 3.5–6.8 years, we excluded two more pairs; the in-pair differences in dates of developing active disease were >33 years apart (15). We added the 17 data pairs to our seven sample pairs, resulting in 24 LTBI pairs.
TABLE 1.
Collection dates of Mtb for seven household patient pairs with high suspicion of LTBI infectiona
| Patient pair | Primary sample | Secondary sample | Duration (years) | ||
|---|---|---|---|---|---|
| Name | Date of collection | Name | Date of collection | ||
| 1 | JPN-OP-1-1 | January 2010 | JPN-OP-1-2 | August 2015 | 5.6 |
| 2 | JPN-OP-2-1 | June 2010 | JPN-OP-2-2* | August 2016 | 6.2 |
| 3 | JPN-OP-3-1 | June 2012 | JPN-OP-3-2 | April 2019 | 6.8 |
| 4 | JPN-OP-4-1 | March 2012 | JPN-OP-4-2* | October 2018 | 6.6 |
| 5 | JPN-OP-5-1 | January 2012 | JPN-OP-5-2 | July 2017 | 5.5 |
| 6 | JPN-OP-6-1 | December 2008 | JPN-OP-6-2 | May 2012 | 3.5 |
| 7 | JPN-OP-7-1 | June 2010 | JPN-OP-7-2 | May 2014 | 3.9 |
The Mtb collection dates from seven TB patient pairs are shown. Duration represents differences between dates in each pair. The diagnosis was made 1–3 months before the collection. Asterisks indicate available records of drug treatment for the prevention of LTBI activation. Note that it is unknown whether the other secondary patients received preventative treatment.
Fig 1.
Analyzed Mtb samples with high suspicion of LTBI. (A) Schematic diagram of Mtb samples from patient pairs with strong LTBI suspicion and their time consideration in this study. Gray circles and red stars represent the dates of infection and the sample collection, respectively. Gray dotted lines represent the asymptomatic term after the infection and were described as 1Asymptomatic and 2Asymptomatic for first and second patients. Red lines represent symptomatic terms after developing active disease and were described as 1Symptomatic and 2Symptomatic for first and second patients. With high LTBI suspicion, 2Asymptomatic sustained, and this period is regarded as LTBI period (LTBItrue), described as blue lines, but could not be measured. Instead, differences in the sample collection dates, shown as the blue double lines, were used to infer the LTBI period (LTBIexamined). If transmission occurs within a patient pair, LTBItrue might differ from LTBIexamined by 2Symptomatic − 1Symptomatic (left panel). If transmission occurs simultaneously to patients in a pair, LTBItrue might differ from LTBIexamined by 2Symptomatic − 1Symptomatic − 1Asymptomatic (right panel). (B) Genome data sets analyzed in this study. Mtb genome data collected once from each of seven patient pairs with high LTBI suspicion and twice longitudinally from eight active disease patients, respectively, were obtained. These data were combined with publicly available Mtb genome data collected by equivalent method and filtered by the collection date differences within pairs and used for the analysis.
To characterize the Mtb genomes of the 24 LTBI pairs, we compared them with sample pairs of chronic active disease; in each pair, two Mtb samples were longitudinally collected after a period of ≥2 years (Table 2). We collected eight Mtb isolate pairs, in which date differences were 2.2–7.9 years, and obtained Illumina reads from them (Fig. 1B). These eight patients included those that were both treated and untreated with antibiotics during the treatment period (Table 2). According to the LTBI pairs, we defined Mtb samples collected earlier and later as first and second samples, respectively. We also retrieved publicly available Illumina sequencing data of 91 sample pairs of chronic active disease and excluded 31 pairs consisting of genetically unrelated strains (Fig. 1B; Table S3) (22–26). Since the pairs whose collection date differences were <2 years were filtered, SNP accumulation rates between the group including and excluding these pairs were compared, but no significant differences were observed (Fig. S2). Therefore, the 60 data pairs of active diseases collected ≥2 years apart were combined with our eight sample pairs and used for further analysis (Fig. 1B).
TABLE 2.
Collection dates of Mtb and treatment history of the eight patients with chronic infectiona
| Patient | Date of collection | Duration (years) | Drug treatment | |
|---|---|---|---|---|
| Primary sample | Secondary sample | |||
| CHR1 | December 2002 | March 2008 | 5.3 | Partial |
| CHR2 | March 2002 | May 2004 | 2.2 | Yes |
| CHR3 | October 2002 | July 2010 | 7.7 | Partial |
| CHR4 | June 2003 | February 2008 | 4.7 | Partial |
| CHR5 | January 2009 | January 2015 | 6 | No |
| CHR6 | July 2009 | June 2012 | 2.4 | No |
| CHR7 | April 2013 | November 2015 | 2.6 | No |
| CHR8 | July 2008 | July 2016 | 7.9 | Partial |
Mtb collection dates from the eight patients with chronic TB and drug treatment histories. The duration represents the differences between collection dates. Drug treatment during the interval is grouped into three categories: yes, no, and partial, which indicates continuous drug treatment, no drug treatment, and a combination of periods with and without drug treatment, respectively.
Each sample of LTBI pairs was collected from different patients, while that of active disease pairs was collected from the same patient. Thus, in LTBI pairs, individuals in the Mtb population of first patients can have minor mutations, which can be transmitted to second patients (Fig. S3A). In such cases, the mutations are not detected, or low allele frequencies (AFs) in the first sample are detected, but high AFs in the second sample are detected, leading to the wrong interpretation that the Mtb population accumulated the mutations within the second patients by in-host evolution. To evaluate this situation, two genomes from each pair were compared and AFs were calculated. AFs of SNPs in the first samples were compared between pairs if these were ≤0.2 in the first samples and ≥0.9 in the second samples (Fig. S3B and C). AFs of SNPs in the first sample were scattered with the peak at zero in LTBI, while most of these were zero in active disease. SNPs in each of the first LTBI samples tended to have similar AFs (Fig. S3D). These results implied that transmissions of minor SNPs occurred between two samples of several LTBI pairs. To examine in-host evolutions and remove transmissions, SNPs within each pair of LTBI and active disease were used for further analysis, if AFs in the first sample are ≤0.01.
Mtb populations from LTBI are genetically more homogeneous than those from active disease
To characterize the genetic diversity of Mtb populations during LTBI, the SNP ratios detected with high and low AF thresholds were compared with the active disease (Fig. 2A). The high and low AF thresholds were set at ≥0.9 and ≥0.2, respectively, and referred to detected SNPs as fixed SNPs (fSNPs) and all SNPs (aSNPs). aSNPs also include fSNPs, as shown in thresholds. AF ≥ 0.2 was chosen as an AF > 0.19 suffices for maintaining SNPs in the Mtb population, and all sequence data have enough depth to detect AF ≥ 0.2 (Table S4) (9). The Mtb population during LTBI had a higher proportion of fSNPs to aSNPs than the Mtb population during active disease (Fig. 2A). To assess the diversity, we further defined the SNP diversity index, an index to examine the ratio of fSNPs and aSNPs (see Materials and Methods), and these significantly differed in LTBI and active disease (Fig. 2B). Both the ratios and SNP diversity index correlated with the number of aSNPs but not with that of fSNPs (Fig. 2C and D). These results indicated that Mtb populations from LTBI had fewer aSNPs compared to fSNPs than those from active disease and were, therefore, genetically homogeneous.
Fig 2.
Homogeneity of the Mtb population in LTBI (A and B). (A) Ratio of fSNPs to aSNPs and (B) SNP diversity index in the Mtb population from active disease and LTBI. P-values of the Wilcoxon rank sum test between active disease and LTBI are shown at the top of the graph. Blue and red points represent data from active and LTBI, respectively. (C and D) Correlation of (C) the ratio of fSNPs to aSNPs and (D) the SNP diversity index with fSNPs and aSNPs in the Mtb population from active disease and LTBI. The size of the points represents the number of corresponding samples. The Spearman’s rank correlation coefficient is shown on the top right. Asterisks indicate significant correlation at P-values of <0.01 (**) and 0.001 (***), while N.S. indicates not significant. Samples from active disease (n = 34) and LTBI (n = 15), which produced ≥1 fSNP(s), were used for analysis.
The seven LTBI pairs and eight active disease pairs were treated according to the same processes, including sputum collection, culturing in media, DNA extraction, and sequencing. To examine the possibility of changes in the AFs in these processes, the AF distribution of the 15 samples was plotted (Fig. S4). All isolates from the active disease pairs had low AFs compared to those from the LTBI, as indicated in Fig. 2.
Mtb genomes during LTBI accumulate aSNPs slower than those during active disease
Due to fewer aSNPs in LTBI, we hypothesized that Mtb accumulates aSNPs slower than active disease. Therefore, SNP accumulation rates per year were estimated and compared between the Mtb population from LTBI and active disease (Fig. 3). Actual LTBI periods were unknown since the infection dates of the second patient were unclear (Fig. 1A). Instead, collection date differences of two samples in each pair were applied, which facilitated actual LTBI periods on the assumption that the duration of active disease before the sample collection (diagnostic delay) was similar between the two patients. In active disease, collection date differences were actual durations between two samples. A statistically significant linear regression of accumulation rates for aSNPs and fSNPs was found during active disease (Fig. 3A and B). Conversely, during LTBI, the SNP accumulation rates were not significantly estimated, as reported in a previous study (Fig. 3C and D) (18). Comparing the estimated slopes, it was observed that the Mtb population accumulated aSNPs slower during LTBI than during the active disease, while no difference was detected in fSNPs (Fig. 3E and F).
Fig 3.
Comparison of SNP accumulation rates between the Mtb population from LTBI and active disease (A–D). Estimated accumulation rates of (A and C) aSNPs and (B and D) fSNPs in the Mtb population from (A and B) active disease or (C and D) LTBI. Dotted lines represent the estimated slopes. α and R2 indicate the estimated slope and Pearson’s correlation coefficient of linear regression, respectively. Asterisks indicate that coefficients are significantly different from zero at P-values of <0.01 (**) and 0.001 (***), while N.S. indicates no significant estimation. (E and F) Comparing the accumulation rate of (E) aSNPs and (F) fSNPs between the Mtb population from active disease (blue) and LTBI (red). Effects of the two infection types on the estimated slopes were tested by two-way ANOVA, and the P-values are represented on the top left. Dotted lines represent the estimated slopes with colors corresponding to points. Samples from active disease (n = 68) and LTBI (n = 24) were used for the analysis.
Synonymous SNPs accumulated during LTBI correlate with latent periods
Compared to the estimated line, several pairs of LTBI have more SNPs consisting of more nonsynonymous SNPs than others (Fig. 4A). Deviations of observed fSNP numbers from estimated fSNP numbers were correlated with nonsynonymous fSNPs but not with synonymous fSNPs (Fig. 4B and C). While accumulation rates of nonsynonymous SNPs fluctuate depending on the selective pressures, synonymous SNPs are relatively neutral; hence, only synonymous SNPs were proposed to be used to estimate a mutation clock (27). Then, the accumulation rate of synonymous SNPs was estimated in the Mtb population from LTBI and active disease (Fig. 4D through G). Coefficients in linear regressions of accumulation rates of synonymous SNPs significantly differed from zero in LTBI (Fig. 4D and E). No significant difference was seen in SNP accumulation rates for synonymous aSNPs and fSNPs between LTBI and active disease (Fig. 4F and G).
Fig 4.
Estimated accumulation rates of synonymous SNPs in the Mtb population from LTBI. (A) Ratio of nonsynonymous (red) and other (white) fSNPs in the Mtb population from LTBI. The ratio is presented as pie charts and plotted with the estimated accumulation rate of fSNPs (black line). (B and C) Correlation between the deviation of observed fSNP numbers from the estimated fSNP numbers and observed number of (B) synonymous or (C) nonsynonymous fSNPs in the Mtb population from LTBI. The Spearman’s rank correlation coefficient is shown on the top right. Asterisks and N.S. indicate that the coefficient was significantly different from zero at a P-value of <0.001 (***) and not significantly different, respectively. (D and E) Estimated accumulation rates of synonymous (D) aSNPs and (E) fSNPs in the Mtb population from LTBI. Dotted lines represent the estimated slopes. α and R2 indicate the estimated slope and Pearson’s correlation coefficient of linear regression, respectively. Asterisks indicate that coefficients are significantly different from zero at P-values of <0.05 (*) and 0.01 (**). (F and G) Comparison of accumulation rate of synonymous (E) aSNPs and (F) fSNPs in the Mtb population from active disease (blue) and LTBI (red). Dotted lines represent the estimated slopes. P-values for the effects of two infection types on the estimated slopes by two-way ANOVA were represented. Samples from active disease (n = 68) and LTBI (n = 24) were used for the analysis.
Mtb populations during LTBI contain fewer nonsynonymous SNPs than those during active disease
There are two selective pressures for genetically homogeneous populations (Fig. S5A and B). Under positive selections, selective pressures remove most of the population but keep individuals having mutations resistant to the pressures. Under purifying selections, emerging mutations are removed since most of the mutations are disadvantageous. To evaluate possible causes of homogeneity of the Mtb population from LTBI, selective pressures were assessed using nonsynonymous SNPs. Although the dN/dS is a widely used index calculated based on synonymous and nonsynonymous SNPs, it can be inefficient in this study since it is used to compare SNPs within independently evolved populations, and many sample pairs differ in only one or a few SNPs. Therefore, we instead used the percentage of nonsynonymous SNPs among genic SNPs and nonsynonymous index (see Materials and Methods) to assess the selective pressure (Fig. 5). Mtb populations from LTBI had fewer nonsynonymous SNPs than those from active disease for both aSNPs and fSNPs (Fig. 5A through D), indicating different selective pressures between them.
Fig 5.
Proportion of nonsynonymous SNPs in the Mtb population from LTBI (A–D). (A and B) Ratio of nonsynonymous to genic SNPs and (C and D) nonsynonymous index of (A and C) aSNPs or (B and D) fSNPs in the Mtb population from active disease (blue) and LTBI (red). Samples from active disease (n = 34) and LTBI (n = 14), which produced ≥1 genic fSNP(s), were used for analysis. (E and F) Codon adaptation index (CAI) caused by synonymous (E) aSNPs or (F) fSNPs in the Mtb population from active disease (blue) and LTBI (red). The total numbers of synonymous aSNPs were 118 and 15 in samples from active disease (n = 29) and LTBI (n = 10), respectively. The total numbers of synonymous fSNPs were 35 and 13 in samples from active disease (n = 22) and LTBI (n = 10), respectively. P-values of the Wilcoxon rank sum test between active disease and LTBI were represented at the top of the graphs.
Patients with active disease should be treated with appropriate drugs according to guidelines (28); these treatments possibly induce and select nonsynonymous SNPs responsible for antibiotic resistance. Next, SNPs responsible for the resistance were compared according to the WHO mutation catalog (Table S5) (29). The number of SNPs matched to the reference mutation catalog and SNPs inside the genes found in the mutation catalog did not significantly differ between LTBI and active disease.
Recent studies indicated that synonymous SNPs contribute to fitness in the host environment; thus, synonymous, but not nonsynonymous, SNPs can be responsible for fitness during LTBI (30). Next, we examined the differences in the effects of synonymous SNPs on fitness using the codon adaptation index (CAI, Fig. 5E and F). CAI during LTBI decreased more than active disease regarding aSNPs, while these were not significant about fSNPs, indicating less contribution of synonymous aSNPs to fitness during LTBI.
Mtb genomes during LTBI and active disease differ in the mutation spectrum
Differences in the accumulation rate of aSNPs and the proportion of nonsynonymous SNPs raised the possibility that mutators differ between the Mtb population from LTBI and active disease. To examine this possibility, we compared the mutation spectrums affected by mutators’ identities (31) (Fig. 6). The mutation spectrum consisting of all detected SNPs significantly differed regarding aSNPs but not fSNPs (Fig. 6A and B). Some sample pairs have more SNPs than others, possibly skewing the spectrum if pairs did not weigh SNPs. To eliminate bias from spectrum comparisons, we compared the ratio of transition mutations (Ti) in each isolate (Fig. 6C and D). As before, the proportion of Ti differed significantly concerning aSNPs.
Fig 6.
Mutation spectrum of SNPs in the Mtb population from LTBI (A and B). Mutation spectrum of (A) aSNPs and (B) fSNPs in the Mtb population from active disease and LTBI. P-values of the chi-square test are shown at the top of the graph, and N.S. represents non-significant differences (P > 0.05). (C and D) Transition (Ti) ratio of (C) aSNPs and (D) fSNPs in the Mtb population from active disease and LTBI. P-values of the Wilcoxon rank sum test between active disease (blue) and LTBI (red) were represented on top of graphs, and N.S. represents nonsignificant differences. Mtb populations from active disease (n = 68) and LTBI (n = 24) were used for the analysis.
Mtb genomes during LTBI accumulated structural variants and displayed an altered variable number of tandem repeats genotype
Structural variants between seven LTBI sample pairs were detected using the complete genomes assembled by HiFi reads (Table S6). Four structural variants were detected in two of the seven pairs. The relation of structural variant numbers to SNP numbers and infection duration could not be analyzed because of the small number of variants detected. Three of the four structural variants were found in loci used in variable number of tandem repeats (VNTR) analysis, which detects genetic relationships from the repeat sequence number across multiple loci (Table S6) (32, 33). Among the multiple loci, several are considered hypervariables that change more often than others. All three variants were located in loci classified as hypervariables, supporting the high occurrence of loci alterations. Another variant was found within the pe-pgrs28 gene locus, which has repetitive sequences and, therefore, is considered prone to recombination. All structural variants detected from LTBI isolates occurred in recombination-prone regions.
DISCUSSION
In bacteria, the frequency of mutations is positively correlated with the likelihood of acquiring drug resistance. Although Mtb also accumulates mutations during LTBI, the homogeneity (Fig. 2), slower mutation rate (Fig. 3 and 4), and lower proportion of nonsynonymous mutations (Fig. 5) suggested the existence of weaker selective pressures and a lower risk of acquiring mutations related to drug resistance and virulence during LTBI. Thus, we propose that LTBI acts as a source of new TB disease rather than as a period for in-host genome evolution.
There are three possible explanations regarding the Mtb population from LTBI pairs, which were more homogeneous than that from active disease but had a similar number of fSNPs. First, Mtb populations become diverse after the infection, and then selective pressures emerge and remove Mtb excluding individuals having adaptive mutations (Fig. S5C). Second, an individual in an initially small population accumulates mutations and grows with purifying selection (Fig. S5D). Third, individuals having mutations grow to occupy most of the population by random stochastic processes called genetic drift. We consider that the contribution of genetic drift is low since random stochastic processes should be similar between LTBI and active disease. We speculate that during LTBI, initial small Mtb populations accumulate mutations, as weak selective pressures are expected by a low nonsynonymous SNP ratio (Fig. 5). This correlates with the expected Mtb dynamics of TB activation after LTBI, wherein Mtb proliferates and accumulates mutations, but the host immune system sterilizes the Mtb, thereby keeping the population small. After LTBI activation, a few individuals within the small Mtb population proliferate, which includes the mutations accumulated during LTBI.
No significant difference was observed between fSNP accumulation rates in LTBI and active disease (Fig. 3F), but it might not be derived from the same mechanisms since the difference in selective pressures was indicated. Moreover, while a lower proliferation rate resulted in genetic homogeneity, different rates of aSNP accumulation between LTBI and active disease cannot be simply explained by the proliferation rate and should also be affected by selective pressures (Fig. 3E). One possible selective pressure is antibiotic treatment. Patients with chronic active disease should be treated with drugs according to the guidelines (28), while the 24 LTBI patient pairs in this study were a mix of untreated, treated to prevent LTBI activation, or no information about the treatments. Although the number of mutations responsible for the resistance during the period was not increased in active disease (Table S5), drug treatment could act as selective pressures to promote unknown resistance mutations and other nonsynonymous SNPs. This scenario can be assessed by comparing the Mtb population from LTBI with preventative drug treatment and active disease. In this study, two out of the seven second patients had received drug treatments to prevent LTBI activation after the first patients had developed active disease (Table 1), and for the others, the treatment histories were unknown. Although the two received drug treatment, neither had high mutation rates or high nonsynonymous mutations (data not shown), and additional samples are required to evaluate this. The existence of different selective pressures during LTBI and active disease was also indicated by the mutation spectrum (Fig. 6). GC > AT is putatively induced by oxidative stress and is the most common mutation in Mtb regardless of LTBI and active disease (9, 17). This was also observed in the study except for aSNPs during active disease, where AT > GC and CG > GC were increased (Fig. 6A and B).
One possible biological interpretation of the different AFs between the Mtb from LTBI and active disease may be determined from the lesion types (Fig. 2). A previous study reported that bacterial compositions differ between individual lesions within a host, regardless of LTBI or active disease (13, 14, 17). Mtb isolates collected from respiratory samples, such as sputum, are a mixture of bacilli produced from multiple lesions. They also reported a smaller number of lesions during LTBI (17). Another interpretation of our data is that the Mtb in each lesion acquires homogeneous mutations regardless of LTBI or active disease, and subsequently, bacilli from fewer lesions are expelled in the sputum of LTBI than active disease. The accumulation rates of all detected SNPs were not estimated by linear regression from LTBI (Fig. 3C and D). The results are consistent with previous reports in which the mutation rates could not be estimated, and different mutation rates between pairs less and more than 2 years apart were considered to be a major cause of the estimation fault (18). However, we could not estimate rates although the pairs whose collection dates differed ≥2 years were selected. Instead, synonymous SNPs could estimate the SNP accumulation rate from LTBI (Fig. 4D and E), as from active disease. Since synonymous SNPs are less affected by selective pressures than nonsynonymous mutations, they are more neutral and used to estimate molecular clocks (31). Indeed, samples with large deviations in the number of SNPs from the estimated line tend to have a high nonsynonymous SNP ratio (Fig. 4A through C), indicating nonsynonymous SNP fluctuated between samples, and thus, selective pressures differ between patients during LTBI. No significant difference was found in the estimated accumulation rates of synonymous SNPs between LTBI and active disease (Fig. 4F and G), while that of all aSNPs was slower in LTBI and active disease. It was reported that the mutation rate of all detected SNPs during LTBI is slower than that during active disease in humans, whereas it was equivalent in macaques (7, 15–18). Although our results from synonymous SNPs and all aSNPs comprised reports in macaques and humans, respectively, the sample size is statically small to reach any conclusion.
Regarding the limitations of this study, Mtb isolates were collected from sputum after developing active TB (Fig. 1A). Thus, isolates with high suspicion of LTBI might have experienced active disease and were not truly latent. To mitigate this problem, we applied a comparison between sample pairs; both samples in each pair experienced active disease, but the difference between them is expected to reflect what happened during the LTBI duration since the duration for developing disease is expected to be similar. We compared the difference in each pair between the Mtb population from LTBI and active disease, which enabled us to characterize mutations occurring during LTBI. However, it should be stressed that we reported genomic traits of LTBI compared with active disease, not absolute values; some mutations were accumulated during the terms of active disease after LTBI.
Another limitation of this study is that genomes were analyzed from the Mtb population grown in in vitro culture, where the composition of the Mtb population can alter. However, in active disease, most of the first samples were homogeneous, while the second samples were heterogeneous (Fig. 3B). Moreover, 15 sample pairs from LTBI and active disease grown under the same conditions represented the tendency of homogeneous and diverse populations, respectively (Fig. S4). These results exclude the possibility that the difference in homogeneity between the Mtb population from LTBI and active disease occurred in the culture process.
Finally, it must be emphasized that the seven sample pairs were “highly suspected of LTBI.” The two patients lived in the same household, their dates of developing active disease differed by more than 3 years, and their isolates differed by ≤5 fSNPs, thus evoking a high suspicion of LTBI (Table 1; Fig. 1). However, the possibility that they were independently infected with a related strain cannot be denied. In the latter case, our results should be interpreted quite differently.
Collecting and identifying Mtb isolates during or after LTBI are challenging because they require tracking for more than a few years after infection. In this study, seven patient pairs highly suspected of having LTBI were analyzed. Although persistent efforts to collect isolates and perform genotyping for >20 years enabled the collection of multiple sample pairs, a larger sample size would be preferable for statistical analysis. Molecular epidemiology using whole-genome sequencing has accelerated the detection of the genetic relatedness of isolates, and it will also accelerate the detection of numerous isolate pairs with high LTBI suspicion indicated by pairs with epidemiological contact, differences in developing dates of TB, and genetic relation by whole-genome sequencing. Isolates from sample pairs 2 and 5 were genetically related in whole-genome sequencing, but with a conventional genotyping method, VNTR, some loci diverged (Table S6). Similar TB cases where two isolates that were genetically related in whole-genome sequencing but differed in another genotyping method using mobile genetic elements were reported (7). Future large-scale analyses of Mtb isolates newly identified as LTBI by whole-genome sequencing could bring new insights.
MATERIALS AND METHODS
Growth conditions and DNA extraction of Mtb
Overall, 30 clinical isolates of Mtb were collected from sputum in Japan, out of which 14 were classified into seven pairs containing two isolates from a given household. The isolates in each pair were collected >3 years apart (Table 1). Since each pair had <12 SNPs, they were genetically related and were regarded as most likely LTBI infections. The remaining 16 isolates were grouped into eight pairs containing two isolates collected from the same patient with chronic active disease at intervals of ≥2 years. The Mtb isolates were cultured on 1% Ogawa medium, and grown isolates were stored at −30°C or −80°C. Mtb stocks were inoculated into Middlebrook 7H10 agar (Difco, Becton-Dickinson and Co., USA) supplemented with 10% Middlebrook OADC (oleic acid, albumin, dextrose, and catalase; Difco, Becton-Dickinson and Co., USA) and 0.5% (vol/vol) glycerol (Fujifilm Wako Pure Chem Co., Japan) and incubated at 37°C for 2 weeks. DNA was extracted from bacteria clumps in one volume of inoculation loops (4 mm) using the conventional phenol–chloroform method after bead beating (0.2 mm glass beads; Vortex Mixer GENIE2 with Microtube Attachment at max speed for 3 and 7 min for HiFi and Illumina sequencing analysis, respectively) (34). The quality and molecular weight of the genomic DNA samples were assessed by pulsed-field gel electrophoresis if these were analyzed by HiFi DNA sequencing.
Whole-genome sequencing
To obtain HiFi reads, high molecular weight DNA samples >50 kb were sent to Macrogen Japan Co., Japan. HiFi SMRTbell libraries were prepared, and PacBio CCS subreads were obtained on the PacBio Sequel II System under a complete long reads mode (Pacific Biosciences of California Inc., USA) according to the standard protocol. To obtain Illumina reads, in our laboratory, TruSeq libraries were prepared and sequenced on the MiSeq II System (Illumina Inc., USA) with 300 bp paired ends according to the standard protocol.
De novo genome assembly and variant calling from HiFi reads
The raw PacBio CCS subreads were converted into HiFi reads by ccs command v6.4.0 from the pbccs package of pbbioconda (Pacific Biosciences of California, Inc., USA) with default parameters. HiFi reads were assembled by flye v2.8.3 (35) to obtain the genomes. Genes in the assembled genomes were predicted by prokka v1.14.6 and compared by panaroo v1.2.8 (36, 37).
HiFi reads were aligned to the assembled genomes by minimap2 v2.24 (38). SNPs and structural variants were called by Deepvariant v1.1.0 and Sniffles v1.0.12, respectively, and filtered by the column FILTER “PASS” (39, 40). Repetitive regions were defined as those whose annotation by prokka included “PE family,” “IS,” “phage,” or “transposase.”
SNP calling from Illumina reads
Sequencing data from Mtb sample pairs collected as our 15 sample pairs were obtained from NCBI Sequence Read Archive (SRA) (Table S3). Since the SNP accumulation rate was previously estimated as 0.56 SNPs/year (9), sample pairs with the number of fSNPs ≤ 0.56 × 5 and aSNPs ≤ 0.56 ×25 multiplied by duration years were regarded as the same strain-derived pairs. Sample pairs obtained from SRA are listed in Table S3 (n = 29 and 179 for LTBI and active disease, respectively), and the same strain-derived pairs of collection date difference ≥2 years were analyzed (n = 17 and 60 for LTBI and active disease, respectively). The Illumina reads from our samples and public data were aligned to the reference H37Rv genome (GenBank accession number: GCA_000195955.2) by BWA mem v0.7.17 (41). SNPs were called for each pair using bcftools mpileup tools and filtered by mapping quality >30 and depths of high-quality bases >20 in the FORMAT column and QUAL column >25 (42). AFs against H37Rv were calculated from allelic read depths of high-quality bases and converted into AFs against the major allele of first sample; if AFs against H37Rv were 0.80 in the first sample and 0.15 in the second sample, they were replaced by 0.20 and 0.85, respectively. aSNPs and fSNPs were defined as SNPs with AFs ≤ 0.01 in first sample and AFs ≥ 0.2 and ≥0.9 in second samples, respectively.
Indexes to specify characteristics of SNPs
To assess population diversity, two indexes from fSNPs and aSNPs were calculated for each pair whose number of fSNPs ≥ 1. First, the ratio of fSNPs to aSNPs was calculated by . It ranged from 0 to 1, decreased when the population was diverse, and was not affected by the number of SNPs. Second, the SNP diversity index was calculated by . It ranged >0, increased when the population was diverse, and increased when the total number of SNPs increased.
To assess the ratio of nonsynonymous SNPs, two indexes from synonymous and nonsynonymous SNPs were also calculated for each pair whose number of genic fSNPs ≥ 1. The ratio of nonsynonymous to genic SNPs was calculated by . It ranged from 0 to 1, increased when the ratio of nonsynonymous SNPs increased, and was not affected by the number of SNPs. The nonsynonymous index was calculated by . It ranged from negative to positive and increased when the ratio and number of nonsynonymous SNPs increased.
The CAI was calculated for each synonymous SNP using the corresponding gene as a reference using Python v3.8.3 with CAI package v1.0.3 (43). For the collected LTBI samples, the assembled genomes were used as a reference for the analysis.
Visualization and statistical analysis
All graphs were visualized using R v4.1.2 with ggplot2 packages. Wilcoxon rank sum tests, Spearman’s rank correlation tests, and chi-square tests were also conducted in R v4.1.2. SNP accumulation rates were estimated by linear regression using the lm function in R, and the difference of estimated slopes was tested by two-way ANOVA using the ANOVA function of the car package.
ACKNOWLEDGMENTS
We thank Professor Tetsuya Hayashi (Graduate School of Medical Sciences, Kyushu University, Japan) for insightful comments and helpful discussions.
This work was partly supported by funding from the Japan Agency for Medical Research and Development (AMED; JP21fk0108090).
The funders had no role in study design, data collection, interpretation, or the decision to submit the work for publication.
A.O. contributed to conceptualization, data curation, formal analysis, investigation, project administration, methodology, visualization, writing of original draft, and review and editing. A.T. contributed to conceptualization, investigation, resources, and review and editing. T.Y. contributed to investigation, resources, and review and editing. T.I. contributed to conceptualization and review and editing. S.M. contributed to conceptualization, supervision, and review and editing. Y.M. contributed to conceptualization, funding acquisition, project administration, supervision, and review and editing. The authors declare no conflict of interest.
Contributor Information
Asami Osugi, Email: osugi@jata.or.jp.
Yoshiro Murase, Email: ymurase@jata.or.jp.
Florence Claude Doucet-Populaire, University Paris-Saclay, Clamart, France.
DATA AVAILABILITY
Sequence data supporting the findings of this study have been deposited in the NCBI SRA database with accession numbers PRJNA1019874 and PRJNA1087460.
SUPPLEMENTAL MATERIAL
The following material is available online at https://doi.org/10.1128/spectrum.04289-23.
Fig. S1 to S5.
Tables S1 to S6.
ASM does not own the copyrights to Supplemental Material that may be linked to, or accessed through, an article. The authors have granted ASM a non-exclusive, world-wide license to publish the Supplemental Material files. Please contact the corresponding author directly for reuse.
REFERENCES
- 1. Houben R, Dodd PJ. 2016. The global burden of latent tuberculosis infection: a re-estimation using mathematical modelling. PLoS Med 13:e1002152. doi: 10.1371/journal.pmed.1002152 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Bagcchi S. 2023. WHO’s global tuberculosis report 2022. Lancet Microbe 4:e20. doi: 10.1016/S2666-5247(22)00359-7 [DOI] [PubMed] [Google Scholar]
- 3. World Health Organization . 2015. Guidelines on the management of latent tuberculosis infection. https://www.who.int/publications/i/item/9789241548908. [PubMed]
- 4. World Health Organization . 2018. Latent tuberculosis infection: updated and consolidated guidelines for programmatic management. https://www.who.int/publications/i/item/9789241550239. [PubMed]
- 5. Getahun H, Matteelli A, Chaisson RE, Raviglione M. 2015. Latent Mycobacterium tuberculosis infection. N Engl J Med 372:2127–2135. doi: 10.1056/NEJMra1405427 [DOI] [PubMed] [Google Scholar]
- 6. Shea KM, Kammerer JS, Winston CA, Navin TR, Horsburgh CR. 2014. Estimated rate of reactivation of latent tuberculosis infection in the United States, overall and by population subgroup. Am J Epidemiol 179:216–225. doi: 10.1093/aje/kwt246 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Comín J, Cebollada A, Samper S, Aragonese Working Group on Molecular Epidemiology of Tuberculosis (EPIMOLA) . 2022. Estimation of the mutation rate of Mycobacterium tuberculosis in cases with recurrent tuberculosis using whole genome sequencing. Sci Rep 12:16728. doi: 10.1038/s41598-022-21144-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Bryant JM, Brown KP, Burbaud S, Everall I, Belardinelli JM, Rodriguez-Rincon D, Grogono DM, Peterson CM, Verma D, Evans IE, Ruis C, Weimann A, Arora D, Malhotra S, Bannerman B, Passemar C, Templeton K, MacGregor G, Jiwa K, Fisher AJ, Blundell TL, Ordway DJ, Jackson M, Parkhill J, Floto RA. 2021. Stepwise pathogenic evolution of Mycobacterium abscessus. Science 372:eabb8699. doi: 10.1126/science.abb8699 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Vargas R, Freschi L, Marin M, Epperson LE, Smith M, Oussenko I, Durbin D, Strong M, Salfinger M, Farhat MR. 2021. In-host population dynamics of Mycobacterium tuberculosis complex during active disease. Elife 10:e61805. doi: 10.7554/eLife.61805 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Ley SD, de Vos M, Van Rie A, Warren RM. 2019. Deciphering within-host microevolution of Mycobacterium tuberculosis through whole-genome sequencing: the phenotypic impact and way forward. Microbiol Mol Biol Rev 83:e00062-18. doi: 10.1128/MMBR.00062-18 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Chiner-Oms Á, López MG, Moreno-Molina M, Furió V, Comas I. 2022. Gene evolutionary trajectories in Mycobacterium tuberculosis reveal temporal signs of selection. Proc Natl Acad Sci U S A 119:e2113600119. doi: 10.1073/pnas.2113600119 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Coll F, Phelan J, Hill-Cawthorne GA, Nair MB, Mallard K, Ali S, Abdallah AM, Alghamdi S, Alsomali M, Ahmed AO, et al. 2018. Genome-wide analysis of multi- and extensively drug-resistant Mycobacterium tuberculosis. Nat Genet 50:307–316. doi: 10.1038/s41588-017-0029-0 [DOI] [PubMed] [Google Scholar]
- 13. Trauner A, Liu Q, Via LE, Liu X, Ruan X, Liang L, Shi H, Chen Y, Wang Z, Liang R, Zhang W, Wei W, Gao J, Sun G, Brites D, England K, Zhang G, Gagneux S, Barry CE, Gao Q. 2017. The within-host population dynamics of Mycobacterium tuberculosis vary with treatment efficacy. Genome Biol 18:71. doi: 10.1186/s13059-017-1196-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Lieberman TD, Wilson D, Misra R, Xiong LL, Moodley P, Cohen T, Kishony R. 2016. Genomic diversity in autopsy samples reveals within-host dissemination of HIV-associated Mycobacterium tuberculosis. Nat Med 22:1470–1474. doi: 10.1038/nm.4205 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Lillebaek T, Norman A, Rasmussen EM, Marvig RL, Folkvardsen DB, Andersen ÅB, Jelsbak L. 2016. Substantial molecular evolution and mutation rates in prolonged latent Mycobacterium tuberculosis infection in humans. Int J Med Microbiol 306:580–585. doi: 10.1016/j.ijmm.2016.05.017 [DOI] [PubMed] [Google Scholar]
- 16. Colangeli R, Arcus VL, Cursons RT, Ruthe A, Karalus N, Coley K, Manning SD, Kim S, Marchiano E, Alland D. 2014. Whole genome sequencing of Mycobacterium tuberculosis reveals slow growth and low mutation rates during latent infections in humans. PLoS One 9:e91024. doi: 10.1371/journal.pone.0091024 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Ford CB, Lin PL, Chase MR, Shah RR, Iartchouk O, Galagan J, Mohaideen N, Ioerger TR, Sacchettini JC, Lipsitch M, Flynn JL, Fortune SM. 2011. Use of whole genome sequencing to estimate the mutation rate of Mycobacterium tuberculosis during latent infection. Nat Genet 43:482–486. doi: 10.1038/ng.811 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Colangeli R, Gupta A, Vinhas SA, Chippada Venkata UD, Kim S, Grady C, Jones-López EC, Soteropoulos P, Palaci M, Marques-Rodrigues P, Salgame P, Ellner JJ, Dietze R, Alland D. 2020. Mycobacterium tuberculosis progresses through two phases of latent infection in humans. Nat Commun 11:4870. doi: 10.1038/s41467-020-18699-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Chitale P, Lemenze AD, Fogarty EC, Shah A, Grady C, Odom-Mabey AR, Johnson WE, Yang JH, Eren AM, Brosch R, Kumar P, Alland D. 2022. A comprehensive update to the Mycobacterium tuberculosis H37Rv reference genome. Nat Commun 13:7068. doi: 10.1038/s41467-022-34853-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Hakamata M, Takihara H, Iwamoto T, Tamaru A, Hashimoto A, Tanaka T, Kaboso SA, Gebretsadik G, Ilinov A, Yokoyama A, Ozeki Y, Nishiyama A, Tateishi Y, Moro H, Kikuchi T, Okuda S, Matsumoto S. 2020. Higher genome mutation rates of Beijing lineage of Mycobacterium tuberculosis during human infection. Sci Rep 10:17997. doi: 10.1038/s41598-020-75028-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Behr MA, Edelstein PH, Ramakrishnan L. 2018. Revisiting the timetable of tuberculosis. BMJ 362:k2738. doi: 10.1136/bmj.k2738 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Bryant JM, Harris SR, Parkhill J, Dawson R, Diacon AH, van Helden P, Pym A, Mahayiddin AA, Chuchottaworn C, Sanne IM, Louw C, Boeree MJ, Hoelscher M, McHugh TD, Bateson ALC, Hunt RD, Mwaigwisya S, Wright L, Gillespie SH, Bentley SD. 2013. Whole-genome sequencing to establish relapse or re-infection with Mycobacterium tuberculosis: a retrospective observational study. Lancet Respir Med 1:786–792. doi: 10.1016/S2213-2600(13)70231-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Casali N, Broda A, Harris SR, Parkhill J, Brown T, Drobniewski F. 2016. Whole genome sequence analysis of a large isoniazid-resistant tuberculosis outbreak in London: a retrospective observational study. PLoS Med 13:e1002137. doi: 10.1371/journal.pmed.1002137 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Guerra-Assunção JA, Houben RMGJ, Crampin AC, Mzembe T, Mallard K, Coll F, Khan P, Banda L, Chiwaya A, Pereira RPA, McNerney R, Harris D, Parkhill J, Clark TG, Glynn JR. 2015. Recurrence due to relapse or reinfection with Mycobacterium tuberculosis: a whole-genome sequencing approach in a large, population-based cohort with a high HIV infection prevalence and active follow-up. J Infect Dis 211:1154–1163. doi: 10.1093/infdis/jiu574 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Walker TM, Ip CLC, Harrell RH, Evans JT, Kapatai G, Dedicoat MJ, Eyre DW, Wilson DJ, Hawkey PM, Crook DW, Parkhill J, Harris D, Walker AS, Bowden R, Monk P, Smith EG, Peto TEA. 2013. Whole-genome sequencing to delineate Mycobacterium tuberculosis outbreaks: a retrospective observational study. Lancet Infect Dis 13:137–146. doi: 10.1016/S1473-3099(12)70277-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Witney AA, Bateson ALE, Jindani A, Phillips PPJ, Coleman D, Stoker NG, Butcher PD, McHugh TD, RIFAQUIN Study Team . 2017. Use of whole-genome sequencing to distinguish relapse from reinfection in a completed tuberculosis clinical trial. BMC Med 15:71. doi: 10.1186/s12916-017-0834-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Bromham L, Penny D. 2003. The modern molecular clock. Nat Rev Genet 4:216–224. doi: 10.1038/nrg1020 [DOI] [PubMed] [Google Scholar]
- 28. World Health Organization . 2022. WHO consolidated guidelines on tuberculosis: module 4: treatment: drug-susceptible tuberculosis treatment. Geneva: World Health Organization. https://www.who.int/publications/i/item/9789240048126. [PubMed] [Google Scholar]
- 29. World Health Organization . 2021. Catalogue of mutations in Mycobacterium tuberculosis complex and their association with drug resistance. Geneva: World Health Organization. https://www.who.int/publications/i/item/9789240028173. [Google Scholar]
- 30. Shen X, Song S, Li C, Zhang J. 2022. Synonymous mutations in representative yeast genes are mostly strongly non-neutral. Nature 606:725–731. doi: 10.1038/s41586-022-04823-w [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. McGrath M, Gey van Pittius NC, van Helden PD, Warren RM, Warner DF. 2014. Mutation rate and the emergence of drug resistance in Mycobacterium tuberculosis. J Antimicrob Chemother 69:292–302. doi: 10.1093/jac/dkt364 [DOI] [PubMed] [Google Scholar]
- 32. Allix-Béguec C, Wahl C, Hanekom M, Nikolayevskyy V, Drobniewski F, Maeda S, Campos-Herrero I, Mokrousov I, Niemann S, Kontsevaya I, Rastogi N, Samper S, Sng LH, Warren RM, Supply P. 2014. Proposal of a consensus set of hypervariable mycobacterial interspersed repetitive-unit-variable-number tandem-repeat loci for subtyping of Mycobacterium tuberculosis Beijing isolates. J Clin Microbiol 52:164–172. doi: 10.1128/JCM.02519-13 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Iwamoto T, Yoshida S, Suzuki K, Tomita M, Fujiyama R, Tanaka N, Kawakami Y, Ito M. 2007. Hypervariable loci that enhance the discriminatory ability of newly proposed 15-loci and 24-loci variable-number tandem repeat typing method on Mycobacterium tuberculosis strains predominated by the Beijing family. FEMS Microbiol Lett 270:67–74. doi: 10.1111/j.1574-6968.2007.00658.x [DOI] [PubMed] [Google Scholar]
- 34. Russell J, Green M, Sambrook J. 2001. Extraction of bacteriophage λ DNA from large-scale cultures using proteinase K and SDS. In Molecular cloning: a laboratory manual. Cold Spring Harbor Laboratory Press, New York. [Google Scholar]
- 35. Kolmogorov M, Yuan J, Lin Y, Pevzner PA. 2019. Assembly of long, error-prone reads using repeat graphs. Nat Biotechnol 37:540–546. doi: 10.1038/s41587-019-0072-8 [DOI] [PubMed] [Google Scholar]
- 36. Seemann T. 2014. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30:2068–2069. doi: 10.1093/bioinformatics/btu153 [DOI] [PubMed] [Google Scholar]
- 37. Tonkin-Hill G, MacAlasdair N, Ruis C, Weimann A, Horesh G, Lees JA, Gladstone RA, Lo S, Beaudoin C, Floto RA, Frost SDW, Corander J, Bentley SD, Parkhill J. 2020. Producing polished prokaryotic pangenomes with the panaroo pipeline. Genome Biol 21:180. doi: 10.1186/s13059-020-02090-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Li H. 2021. New strategies to improve minimap2 alignment accuracy. Bioinformatics 37:4572–4574. doi: 10.1093/bioinformatics/btab705 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Poplin R, Chang PC, Alexander D, Schwartz S, Colthurst T, Ku A, Newburger D, Dijamco J, Nguyen N, Afshar PT, Gross SS, Dorfman L, McLean CY, DePristo MA. 2018. A universal SNP and small-indel variant caller using deep neural networks. Nat Biotechnol 36:983–987. doi: 10.1038/nbt.4235 [DOI] [PubMed] [Google Scholar]
- 40. Sedlazeck FJ, Rescheneder P, Smolka M, Fang H, Nattestad M, von Haeseler A, Schatz MC. 2018. Accurate detection of complex structural variations using single-molecule sequencing. Nat Methods 15:461–468. doi: 10.1038/s41592-018-0001-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Li H, Durbin R. 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25:1754–1760. doi: 10.1093/bioinformatics/btp324 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM, Li H. 2021. Twelve years of SAMtools and BCFtools. Gigascience 10:giab008. doi: 10.1093/gigascience/giab008 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Lee BD. 2018. Python implementation of codon adaptation index. JOSS 3:905. doi: 10.21105/joss.00905 [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Fig. S1 to S5.
Tables S1 to S6.
Data Availability Statement
Sequence data supporting the findings of this study have been deposited in the NCBI SRA database with accession numbers PRJNA1019874 and PRJNA1087460.






