Significance
This manuscript is a systematic evaluation of phase variation and its adaptive capacity in M. tuberculosis (Mtb) and includes large-scale computation work across a globally and genetically diverse sample of 31,428 sequenced Mtb, and experimental validation of putative adaptive phase variants. We focus on phase variation in the homopolymer tracts (HT) in Mtb in which site-specific mispairing can lead to rapid changes in both regulatory and coding regions. We estimate the in-vitro frameshift rate in a neutral HT at 100× the neutral substitution rate and find that phase variation, previously underappreciated in Mtb, plays a substantial role in the emergence of INDELs (insertions and deletions) with 12.4% of INDELs occurring in HTs despite coding for just 0.02% of the genome.
Keywords: genomics, tuberculosis, phase variation, microbiology
Abstract
Phase variation induced by insertions and deletions (INDELs) in genomic homopolymeric tracts (HT) can silence and regulate genes in pathogenic bacteria, but this process is not characterized in MTBC (Mycobacterium tuberculosis complex) adaptation. We leverage 31,428 diverse clinical isolates to identify genomic regions including phase-variants under positive selection. Of 87,651 INDEL events that emerge repeatedly across the phylogeny, 12.4% are phase-variants within HTs (0.02% of the genome by length). We estimated the in-vitro frameshift rate in a neutral HT at 100× the neutral substitution rate at frameshifts/HT/year. Using neutral evolution simulations, we identified 4,098 substitutions and 45 phase-variants to be putatively adaptive to MTBC (P < 0.002). We experimentally confirm that a putatively adaptive phase-variant alters the expression of espA, a critical mediator of ESX-1-dependent virulence. Our evidence supports the hypothesis that phase variation in the ESX-1 system of MTBC can act as a toggle between antigenicity and survival in the host.
Tuberculosis (TB), caused by pathogens of the Mycobacterium tuberculosis complex (MTBC), is a major public health threat causing an estimated 10 million new cases of disease per year (1). Human TB is primarily caused by seven major phylogenetic lineages (L1 to L7) also known as M. tuberculosis sensu stricto, and two more distant human-adapted MTBC lineages L5, L6 are also known as Mycobacterium africanum (2). More recently, studies have revealed two new lineages: L8 in Uganda and Rwanda (3) and L9 in East Africa (4).
MTBC genomes show no evidence for recombination or horizontal gene transfer. Genomic diversity, including more ancient divergence from the MTBC ancestor and between lineage members, is instead driven predominantly by DNA damage and replication error resulting in chromosomal point mutations. A different mechanism with 100 to 1,000 fold faster kinetics is the development of insertion and deletions in short sequence repeats (SSRs) of 1 to 7 bp nucleotides through mispairing (5). This slipped-strand mispairing (SSM) occurs with misalignment between repeats on the mother and daughter strands during DNA synthesis resulting in an increase or decrease in the number of repeat units in the newly synthesized strand (5). These changes can result in frameshifts or alteration in a transcriptional regulatory region leading to phase-variable expression of a protein. Repeats of a single nucleotide, or homopolymeric tracts (HTs), is the simplest form of SSR. SSM within these regions was recently observed in the MTBC resulting in antibiotic tolerance or resistance (6–8).
Of the variants generated by mutation or SSM, the vast majority do not reach appreciable population allele frequencies. The allele frequency spectrum in MTBC supports a high proportion of low-frequency variants, especially singletons consistent with background and/or purifying selection on average across the genome (2, 9). In specific regions, variants may arise more than once in parallel (i.e., among bacterial strains that do no share an immediate common ancestor). This is rare under neutral theory or purifying selection but can be observed due to population demographic shifts or due to positive selection (10, 11). Parallel evolution has been commonly observed in antibiotic resistance genes and specifically variants that allow the organism to withstand antibiotic killing (2, 11, 12). More recently, parallel evolution has been observed in connection with enhanced virulence and transmission (10, 13–15).
Here, we leverage a sample of 31,428 geographically diverse clinical isolates that have undergone whole-genome sequencing (WGS), belonging to M. tuberculosis sensu stricto (N = 31,234) and M. africanum (N = 194), and are representative of the genetic diversity found within the MTBC causative of human TB. These isolates represent more than 30,000 natural evolution experiments of MTBC infecting humans and transmitting to the next host. Using data on these isolates, we infer the number of times each genetic variant has evolved in a parallel fashion within and outside of HTs in the MTBC genome. With simulations, we determine which variants are likely under positive selection. Using precise genome engineering, we functionally validate HT variants measured to be under positive selection that occur in a regulatory region of the MTBC virulence factor espA, a gene essential for type VII ESX-1-mediated secretion. The results support that MTBC continues to tune virulence phenotypes as it evolves genetically in host.
Results
Genetic Diversity in 31,428 MTBC Clinical Isolates.
We curated and processed 33,873 publicly available genomes (16). For quality control, we excluded 1,663 isolates with inadequate sequencing data at of variable sites curated across the full dataset (SI Appendix, Figs. S1A and S2 and Materials and Methods). We excluded an additional 290 isolates because they could not be typed into an MTBC major lineage based on SNV (single-nucleotide variants) barcode (most commonly because of missing calls at lineage defining sites, Materials and Methods and SI Appendix, Fig. S2) (17); excluded 35 isolates because they belonged to L7 that was otherwise not well represented, and excluded 457 isolates because they were typed into L4 but not an L4 sublineage, the latter needed for computational efficiency of the phylogeny estimation (SI Appendix, Fig. S2). In the remaining 31,428 isolates, we detected 836,901 SNV occurring at 782,565 genomic sites across the 4.4-Mb MTBC genome (17.7%) (SI Appendix, Figs. S1A and S2 and Materials and Methods). Of the 782,565 SNV sites, 422,891 (54.04%) were singletons, i.e., only a single isolate harbored a minor allele at that site. Additionally, we detected 47,425 INDELs (insertions and deletions) with 27,937 (58.9%) being singletons (SI Appendix, Fig. S1A and Materials and Methods).
For computational efficiency, the 31,428 isolates’ phylogeny was constructed separately for L1, L2, L3, L4 (split into three subgroups L4A, B, C), L5, and L6 (SI Appendix, Figs. S1A, S2, and S3 and Materials and Methods) (18). We built a multiple sequence alignment of SNV sites and used maximum-likelihood phylogenetic estimation. The phylogenies represented well the global M. tuberculosis sensu stricto diversity: spanning 2,815 isolates from L1, 8,090 L2, 3,398 L3, 5,839 L4A, 6,958 L4B, 4,134 L4C; M. africanum was represented by 98 L5 and 96 L6 isolates. The SNV barcode misclassified only 14/31,428 isolates compared with the full phylogenetic reconstruction (Materials and Methods). Given the size of the phylogeny that challenged visualization, we computed t-Distributed Stochastic Neighbor Embeddings (t-SNE) of the matrix of pairwise SNP distances (SI Appendix, Fig. S4A and Materials and Methods). We visualized isolates in this t-SNE embedding space labeling isolates by lineage and confirmed good separation between sublineages especially at short scales (SI Appendix, Figs. S3 and S4 B–I). Within-lineage diversity was congruent with expected diversity, including highest diversity within L1, L4, and L6 and lowest diversity within L2 (SI Appendix, Fig. S4J) (19).
Parallel Evolution.
Using maximum likelihood ancestral reconstruction, we computed the number of parallel/repeated arisals of minor allele SNV mutations (homoplasy score or Hs) across the eight phylogenies (SI Appendix, Figs. S3 and S5 A and B and Materials and Methods). As ancestral reconstruction methods cannot infer INDEL events simultaneously with SNVs, we developed an alternative method (TopDis) to assess separately for INDEL parallel evolution (16). TopDis relies on observing monophyletic groups harboring the derived allele that are separated in the tree by isolates harboring the reference state (SI Appendix, Fig. S5C and Materials and Methods). We confirmed the accuracy of the TopDis approach by computing TopDis Hs for SNVs and showing they are equal to Hs computed using ancestral reconstruction for most variants (SI Appendix, Figs. S1 B and C and S6).
Putatively Adaptive SNVs.
The distribution of Hs for SNVs was strongly right skewed; 102 SNVs were acquired >100 times (Materials and Methods, Fig. 1A, SI Appendix, Table S1, and Dataset S1) (12, 16). Population bottlenecks can increase the rate of parallel evolution observable in a phylogeny, but estimates of effective population size for M. tuberculosis over similar time and geographic scales, which have been modeled with constant and exponential growth priors, did not identify evidence for population contraction (20). M. tuberculosis molecular clock rate estimates have also been robust to assumptions of constant vs. exponential population growth under a coalescent model (21). Here, to simulate the expected rate of parallel mutation acquisition under neutral evolution, we ran simulations using a range of estimated molecular clock rates for M. tuberculosis assuming a constant population size (SI Appendix, Materials and Methods) (21). We estimated SNVs to arise with Hs with probability <0.002 under these assumptions. In our data, Hs was observed for 4,980 (0.49%) of SNV sites (Fig. 1). Of the subset of 1,525/4,980 with a minor allele frequency >0.1% (Fig. 1 C and D), 470 (30.8%) were coding synonymous, 738 (48.4%) were coding nonsynonymous, 308 (20.2%) were intergenic, and 9 (0.59%) were in noncoding RNA regions. Sites in genomic regions associated with antibiotic resistance represented 13 of the top 30 sites by Hs (>222) (Fig. 1D, SI Appendix, Table S1, and Dataset S1).
Fig. 1.
Parallel evolution of SNVs and INDELs. (A) The distribution of homoplasy scores for 834,981 SNVs and 46,306 INDELs. 0.49% of SNVs have a homoplasy score and 3.01% of INDELs have a homoplasy score . (B) Proportion of INDELs with for varying values of x, split into sets according to whether INDEL occurs within HT, SSR, or other region of the genome. (C and D) Homoplasy score (Hs) for 1,525 SNVs and 655 INDELs with homoplasy score and minor (SNVs)/alternate (INDELs) allele frequency among 31,428 isolates, plotted against position on the genome. Bubble size corresponds to Hs. (C) INDELs broken down by whether they occur within an HT, SSR, or other region of the genome. HTs with a cumulative Hs score (across INDELs occurring within HT) are indicted by blue bars. (D) Variants colored in green occur within genes that have been associated with antibiotic resistance.
Homopolymer Tracts Demonstrate a High Concentration of INDELs.
Because INDELs can be generated by SSM or other mutational processes depending on the genetic sequence context, we divided the 46,306 observed INDELs into the following groups: 1) INDELs in HT regions (n = 330 variants in 121 unique HTs), 2) INDELs in more complex SSR of a pattern of 2 to 6 base-pairs (bp, n = 2,077 variants in 17,689 unique SSRs), and 3) INDELs in other regions of the genome (n = 43,899) (Fig. 1B, SI Appendix, Fig. S7, and Materials and Methods). We confirmed the quality of variant calls in repetitive HT and SSR regions by simulating frameshifts into 33 different M. tuberculosis complete genome assemblies, simulated Illumina reads from these modified assemblies, and then mapped the reads back to H37Rv for variant calling (SI Appendix, Materials and Methods and Fig. S10). Using our read mapping pipeline and variant called criteria, we were able to correctly recall variants in all assemblies for 46/54 HT/SSR regions, in >90% assemblies for 51/54 HT/SSR regions, and ≥75% of assemblies for all 54 HT/SSR regions (Dataset S9). Further, we called no false positives across these simulations. In HTs, the INDEL acquisition rate across the phylogeny normalized by aggregate region length was 9,339.7/kbp, compared to 61.8/kbp in other SSR regions and 16.9/kbp elsewhere on the genome ( across three tests for difference between Poisson rates). For comparison, the SNV acquisition rate across the genome was 242.6/kbp aggregated from 834,981 SNVs detected genome-wide. Further, 75.2% of the INDELs in HT regions were homoplastic at a score Hs > 1 compared to 25.9% and 10.3% of INDELs called in all SSR regions and non-HT-SSR, respectively (Fig. 1 B and C and SI Appendix, Fig. S7).
Putatively Adaptive INDELs.
Of the 46,306 total INDELs observed, the majority or 32,883 (71%) caused frameshifts within open reading frames with a median allele frequency across the sample of 0.003%. The distribution of INDEL acquisitions across the phylogeny was strongly right skewed with 59 mutations acquired independently ≥100 times (Fig. 1A, SI Appendix, Table S2, and Dataset S2) (16). Compared with SNVs, a higher relative proportion of INDELs demonstrated Hs (1,393/46,306, 3.01%, P-value < 1 × 10−5, Fisher exact test) (Fig. 1 A and C). Of the 655/1,393 INDELs with allele frequency >0.1%, 132/655 (20.1%) were in HT regions and 94/352 (26.7%) of the subset that resulted in frameshifts occurred in HT regions (Fig. 1C and Dataset S2). A lower proportion of INDELs was found in known antibiotic resistance–associated genes compared to SNVs (16/655 vs. 162/1525, P-value = 7 × 10−11, Fisher exact test) (Fig. 1D and Datasets S1 and S4). Among the 30 INDELs with the highest Hs (>187), only three occurred in genes associated with antibiotic resistance: gid 103delC (Hs = 202) did not occur within an SSR or HT region and is known to confer streptomycin resistance (SI Appendix, Table S2) (12, 22), glpK nt565-572insC (Hs = 261) located within an HT region and previously implicated in multidrug tolerance (Fig. 1D and SI Appendix, Table S2) (6, 7), and ponA1 nt1878insCCGCCGCCT (Hs = 397) located within an SSR region in a gene that contributes to peptidoglycan biosynthesis and alters sensitivity to the antibiotic rifampicin (11).
Given differences in mutational processes and rates at SSR vs. other sites, we studied potentially adaptive INDELs separately by whether or not they occur in SSRs. We used a Hs cutoff of , similar to SNVs above. Of the 43,899 non-SSR INDELs, 993 (2.3%) demonstrated an Hs (Fig. 1B). The INDEL with the highest Hs was a three amino acid insertion in the putative antigenic protein Rv2823c that was acquired independently 1,534 times affecting 5,093 isolates across members of the six lineages we evaluated (SI Appendix, Table S2). The INDELs with Hs were more likely to affect intergenic regions than INDELs with Hs (257/993, 26% vs. 6,941/42,906, 16%, P-value = 1.5 × 10−14, Fisher exact test).
While intragenic SSM often introduces frameshifts and disrupts ORFs, phase variation at intergenic sites can also have important effects on gene expression (5). We compared the general features of intragenic phase variation with INDELs that putatively alter gene expression based on their occurrence within 50 bp upstream of MTBC transcriptional start sites (TSS') (23) and within regulatory noncoding RNAs (24) (2,077 SSR INDELs and 330 HT INDELs). Overall, we identified frameshift INDELs in HT and other SSR (294/330, 89.1% of HT INDELs and 1,190/2,077, 57.3% of other SSR INDELs) in open reading frames. Of non-HT SSR INDELs, 6.2% (128/2,077) putatively affect gene expression, and 47.2% (981/2,077) introduce translational frameshifts (Fig. 1 B and C andSI Appendix, Fig. S8). A greater proportion of INDELs in HT regions were found in likely regulatory regions 7.6% (25/330) and open reading frames 69.7% (230/330) compared to other SSR INDELs (Fig. 1 B and C and SI Appendix, Fig. S7). The majority of frameshifting INDELs incur a premature stop codon within the first 3/4th of a gene (570/981, 58.1% for SSR INDELs and 117/230, 50.9% for HT INDELs) (SI Appendix, Fig. S7).
Given the measured high rate of frameshift INDELs in HT regions, the expected rapid kinetics of SSM, and the high rate of INDEL homoplasy across the genome, we experimentally measured the neutral rate of +1 frameshifting in a 7G HT derived from the glpK gene in Mycobacterium smegmatis. The measured rate was frameshifts/generation (Materials and Methods). Assuming that MTBC doubles once per day on average, this corresponds to a rate of frameshifts/HT/year (Materials and Methods). To identify potentially adaptive INDELs that should demonstrate more extreme homoplasy than observed under neutral evolution, we ran simulations of HT evolution respecting the 8 observed M. tuberculosis phylogenies (SI Appendix, Fig. S8 and Materials and Methods). We estimated the probability of HT accumulating >45 INDELs across the phylogeny at <0.002 under the neutral rate (SI Appendix, Fig. S8). Forty-five HTs had a (Fig. 1C, SI Appendix, Table S3, and Dataset S3). These putatively adaptive HTs occurred in one aforementioned gene associated with antibiotic resistance, glpK, and the remaining were in other genes spanning a range of functions. Two of the three HT regions with the highest Hs occurred in the 3′ end of ppe13 ( and ), and located 15 bp from the stop codon on the 1,332 bp ORF (SI Appendix, Tables S2 and S5). Of the 3,088 mutation arisals within these adjacent HTs, 49.5% (1,529/3,088) resulted in a premature stop codon while 50.5% (1,559/3,088) resulted in an aberration of the stop codon in the annotated H37Rv gene sequence. Further, 10/45 (22%) of the putatively adaptive HTs occurred in intergenic regions and of these 3/10 occurred within 50 bp upstream of a TSS (Rv3848-espR, vapC2-Rv0302, espA-ephA).
Recency Estimation of Putative Adaptive Variants.
We hypothesized that if positive selection is driving parallel evolution of an allele then the ratio of homoplasic instances of that allele divided by the number of isolates carrying the same allele can capture the recency of positive selection. We separated genes into four nonredundant categories: antigen genes, antibiotic resistance genes, PE/PPE, and other genes (SI Appendix, Materials and Methods). We compared other categories to antibiotic resistance genes, as the selection pressure on variants in the latter only commenced with the introduction of antibiotics for M. tuberculosis treatment 70 to 80 y ago (25). We computed a recency ratio (RcR) for the 1,208 homoplastic SNVs in coding regions. The RcR displayed a strongly right-skewed distribution as most SNVs have very few independent arisals relative to the number of isolates that harbor the minor allele indicating older selection (Fig. 2 A and B, SI Appendix, Table S1, and Dataset S1). As expected, RcR values were highest (indicating more recent evolution) for SNVs in antibiotic resistance regions ( , Mann–Whitney U test between antibiotic resistance and every other gene category) (Fig. 2C).
Fig. 2.
Recency ratio for SNVs and HT INDELs. (A and B) The distribution of the ratio of (homoplasy score) to (# of isolates harboring the minor allele) for 1,208/1,525 SNVs (Fig. 1C) that occur in coding regions. (C) Breaking these SNV recency ratios down by gene category reveals higher ratios overall for antibiotic resistance genes when compared to other gene categories. (D and E) The distribution of the ratio of (homoplasy score) to (# of isolates harboring the alternate allele) for 100/655 INDELs (Fig. 1C) that occur in HT and coding regions. (F) Breaking these INDEL ratios down by gene category reveals higher ratios overall for antibiotic resistance genes when compared to other gene categories; however, the only two INDELs in this gene category were found in the HT of glpK. N = number of alleles, M = median RcR.
The RcR for the 388 coding non-HT INDELs (grouping non-HT SSR and non-SSR INDELs together) closely resembled that for SNVs (SI Appendix, Fig. S9 A and B). Similar to SNVs, RcR values for non-HT INDELs were higher in antibiotic resistance regions ( , Mann–Whitney U test between antibiotic resistance and every other gene category) and median RcR values within gene categories mirrored those for observed for SNVs (SI Appendix, Fig. S9C).
The RcR distribution for the 100 coding HT INDELs demonstrated a shift toward higher values than SNVs or non-HT INDELs in every gene category (Fig. 2 D–F, SI Appendix, Table S2, and Dataset S2). As INDELs in SSR are uniquely prone to revert to the ancestral sequence, this observation may be related to recent selection for the derived allele, recent selection for reversion to the ancestral allele, or both. Regardless, this observation implies recent selection for INDELs in HT tracts.
Frameshifts in a HT Upstream espA Alter Transcription.
To assess the functional consequence of variation we observed in HTs (Fig. 1 B and C and SI Appendix, Table S3), we carried out a genome-wide association with the antibiotic resistance phenotype to 15 antibiotics to uncover any previously unknown associations between frameshift mutations in HTs and resistance to a panel of antibiotics (n = 101 to 14,537, Materials and Methods). Of the 145 HTs studied, 17 were significantly associated with resistance to at least one antibiotic, including the previously known association between convergent frameshifts in the HT of glpK and multidrug resistance (Fig. 3 A and B and Table 1). In addition to glpK, frameshifts in the HT of Rv2264c (Hs = 138) and lysX-infC (Hs = 29) were the top three positively associated HTs with multidrug resistance. The majority of HTs (128/145, 88%) do not, however, appear to potentiate antibiotic resistance. We hypothesized that these regions may be mediating a different form of pathogenic adaptation.
Fig. 3.
Genetic map confirms homoplastic variants. (A) The t-SNE plot serves as a genetic similarity map, isolates are colored according to which group they belong to (L1, L2, L3, L4A, L4B, L4C, L5, L6). (B–D) Isolates are labeled if they harbor a given mutant allele (N = # of isolates that harbor the mutant allele). These mutations within HTs (glpK nt565-572insC, delT upstream espA nt−105/−112, insT upstream espA nt−105/−112, espK nt797-803insC and espK nt797-803delC) are detected in isolates belonging to different clusters, confirming that these mutations must have arisen independently in different genetic backgrounds.
Table 1.
HTs with significant association with antibiotic resistance
Gene Symbol | HT H37Rv coords | Hs | drug | S (FS/WT) | R (FS/WT) | OR 95% CI (Fisher Exact Test) | −log10(bonf P-val) | *other antibiotics |
---|---|---|---|---|---|---|---|---|
Rv2264c | 2536625–2536632 | 138 | STR | 4,341 (968/3,373) | 2,101 (985/116) | 29.59 (24.1–36.33) | 90.3 | AMK, CAP, EMB, INH, KAN, MXF, OFX, PZA, RFB, RIF |
lysX-infC | 1852176–1852183 | 29 | MXF | 3,243 (22/3,221) | 338 (49/289) | 24.82 (14.8–41.64) | 67.6 | AMK, CAP, CYS, EMB, ETA, INH, KAN, OFX, PZA, RFB, RIF, STR |
glpK | 4139183–4139190 | 282 | RIF | 10.89k (50/10,840) | 3,868 (172/3,696) | 10.09 (7.35–13.85) | 66.6 | EMB, ETA, INH, PZA, RFB, RIF, STR |
Rv3413c | 3832356–3832363 | 39 | KAN | 3,077 (5/3,072) | 577 (35/542) | 39.68 (15.48–101.72) | 34.0 | AMK, CAP, EMB, INH, RIF, STR |
Rv2177c-aroG | 2440187–2440194 | 69 | PZA | 9,018 (174/8,844) | 1,804 (121/1,683) | 3.65 (2.88–4.64) | 27.5 | CAP, EMB, ETA, INH, RIF |
PE_PGRS25 | 1572680–1572687 | 68 | PZA | 9,018 (345/8,673) | 1,804 (178/1,626) | 2.75 (2.28–3.32) | 25.2 | EMB, INH, RIF |
bioF2 | 36470–36477 | 140 | EMB | 9,307 (1395/7,912) | 2,394 (558/1,836) | 1.72 (1.54–1.93) | 19.8 | INH, PZA, RFB, RIF |
vapC2-Rv0302 | 364498–364505 | 216 | PZA | 9,018 (153/8,865) | 1,804 (97/1,707) | 3.29 (2.54–4.27) | 18.8 | EMB, ETA, INH, RIF |
Rv1373 | 1546465–1546472 | 58 | RIF | 10,890 (204/10,686) | 28 (21/3,847) | 0.29 (0.18–0.45) | 6.2 | CFX, EMB, INH, MXF, RIF |
Rv3192 | 3559990–3559997 | 55 | EMB | 9,307 (25/9,282) | 2,394 (31/2,363) | 4.87 (2.87–8.27) | 8.1 | RIF |
Rv0759c-Rv0760c | 854252–854261 | 776 | EMB | 9,307 (8,483/824) | 2,394 (2,272/122) | 1.81 (1.49–2.2) | 6.8 | AMK, CAP, INH, KAN, PZA, RIF |
lipR | 3450182–3450189 | 40 | EMB | 9,307 (18/9,289) | 2,394 (21/2,373) | 4.57 (2.43–8.58) | 4.7 | INH |
Rv1894c | 2141408–2141415 | 72 | RFB | 431 (5/426) | 607 (43/564) | 6.5 (2.55–16.54) | 3.3 | PZA |
Rv0694-Rv0695 | 794672–794679 | 8 | MXF | 3,243 (0/3,243) | 338 (2/336) | N/A | 3.0 | |
espK | 4358979–4358986 | 192 | PZA | 9,018 (177/8,841) | 1,804 (65/1,739) | 1.87 (1.4–2.49) | 2.8 | INH, MXF |
PE_PGRS31 | 2001789–2001796 | 57 | INH | 9.844k (79/9,765) | 4,693 (72/4,621) | 1.93 (1.4–2.66) | 2.4 | |
mshD-phoT | 912694–912701 | 27 | CAP | 3,611 (0/3,611) | 663 (3/660) | N/A | 2.4 |
Associations between frameshift variants in HTs and resistance to antibiotics. Variants in 17 HT regions were significantly associated with resistance to at least one antibiotic at the Bonferroni corrected threshold (Methods). S: number of isolates susceptible, R: number of isolates resistant, FS: number of isolates that harbor a frameshift, WT: number of isolates with wild-type state.
*For HTs associated with resistance to more than one antibiotic, details for the most significant association are reported while other antibiotics are listed in the last column. AMK: Amikacin, CAP: Capreomycin, CYS: Cycloserine, EMB: Ethambutol, ETA: Ethionamide, INH: Isoniazid, KAN: Kanamycin, MXF: Moxifloxacin, OFX: Ofloxacin, PZA: Pyrazinamide, RFB: Rifabutin, RIF: Rifampicin, STR: Streptomycin.
As mentioned above our top HT and non-HT INDEL hits occurred in PPE13, and in a putatively antigenic protein, respectively suggesting that they mediate adaptation at the immune or host–pathogen interface. The PPE13 HT frameshifts are predicted to shorten the protein product by ~5 amino acid (AA), and hence were difficult to evaluate experimentally. We noted that other HTs with high Hs appeared in or near ESX-1-related genes (SI Appendix, Tables S5). These regions include: 1) the HT between espA and ephA. This poly-A stretch is optimally suited to act as an UP element (26): It is found ~48 bp upstream of one of two putative TSS’ of the espACD operon (Fig. 4A), which encodes ESX-1 components that control the rate of secretion (Fig. 3C). 2) An intragenic HT disrupts the open reading frame of the ESX-1-associated espK gene (Fig. 3D), and 3) An HT in the 5′ UTR of the ESX-1 regulator, espR (SI Appendix, Table S3).
Fig. 4.
A single basepair deletion within the espA homopolymer results in decreased espA expression. (A) Schematic showing location of 7 basepair homopolymer upstream of Rv3616c. A highly variable, 7 basepair adenine repeat 105 basepairs upstream of the translational start site for Rv3616c (espA), which forms an operon with downstream genes espCD. Upstream of Rv3616c, two transcriptional start sites have been identified. The longer of which sits along the homopolymeric stretch, the other is found another 41 basepairs downstream of the homopolymer. A single basepair deletion in the poly-A tract results in a ~twofold decrease in espACD expression. (B) A volcano plot highlighting the results of an RNAseq experiment comparing a recombineered espA homopolymer mutant to WT H37Rv. Results are pooled from 2 independent experiments consisting of at least 3 biological replicates each. espA (green), espC (red), and espD (blue) are highlighted. Also highlighted Rv3612c (purple) and Rv3613c (pink), two genes immediately downstream of espACD. (C) Relative expression levels of the espACD operon in the mutant espA strain compared to WT H37Rv.
To assess the phenotypic consequence of these mutations, we engineered the most abundant +A (8A) HT variant upstream espACD operon into the H37Rv genome and assessed the effect of this variant on gene regulation during exponential growth in 7H9 broth (Fig. 4A). Comparing the transcriptome of this mutant to its isogenic parent, we found only a small number (22) of significantly differentially expressed genes, most prominently a decrease in the expression of espA, espC, and espD (by approximately 40%, log2-fold-changes = −0.7) (Fig. 4 B and C), along with the downstream genes Rv3613c and Rv3612c (Dataset S4). These data verify the functional effect of this intergenic HT INDEL, and suggest positive selection for decreased ESX-1 activity. The 1,359 bp intergenic region upstream of espA is the target of several well-characterized regulators of the espACD operon (27). The poly-A HT is positioned between −112 bp and −106 bp upstream of the espA start codon, ~48 bp upstream of the first TSS that is mapped to −65 bp, and immediately downstream of the second TSS of espACD that is mapped to −112 bp. The transcription factor EspR positively regulates the espACD operon by binding to several sites upstream of espA and the first espA TSS (27). EspR-binding sites include two regions separated by 19 bp located between −506 bp and −444 bp, another centered between −857 and −695 bp, and another between −1,214 and −1,113 bp relative to the espA start codon (28). The MprAB two-component system represses expression of espA and has two binding sites, one close to the −112 TSS, starting at −149 bp, and one at −303 bp (29). The predicted binding motif for MprA does not contain the 7A HT. Hence, it is unlikely that the observed effect of the HT indels on espA transcription is directly mediated by altered EspR or MprA binding. Given that the location of the HT region is immediately downstream of the second TSS, we hypothesize that the observed effect of the HT INDELs may relate to RNA-polymerase binding affecting the rate of initiation. Future work can help confirm this hypothesis.
Gene-Wide Mutational Density Reveals Variable ESX and PE/PPE Genes.
Given the apparent convergence of HT variants on ESX-1 function, we aggregated independent variant arisals at the gene-level to better understand the adaptive landscape of genomic variants in MTBC. Specifically, we aggregated Hs for all variants found within each gene (regardless of frequency) and normalized the resulting score by gene length to obtain the mutational density (SI Appendix, Materials and Methods). We separated this analysis by SNVs (Fig. 5A and Dataset S5) and INDELs (Fig. 5B and Dataset S6) because Hs were computed differently for each (Materials and Methods), and because of the different mechanisms at play in generating each type of diversity. We simulated the number of arisals that occur on each gene using a modified molecular clock rate normalized by gene length to obtain a neutral mutation rate for each gene (SI Appendix, Materials and Methods). We found that a gene has an estimated neutral mutational density ≥0.45 with probability <0.002 under these assumptions.
Fig. 5.
SNV and INDEL mutational density per gene. (A) The homoplasy scores for all SNVs within each gene were aggregated to approximate all SNV mutation events (independent arisals) that occurred within the gene body then normalized by the gene length (Materials and Methods). Dataset S5 contains the calculations for each gene as well as columns for # SNVssynonymous homoplasy score, and nonsynonymous homoplasy score. (B) A similar computation was carried out for INDELs in which homoplasy scores for all INDELs within each gene were aggregated and normalized by gene length (blue denotes genes containing an HT, orange denotes genes containing an SSR, black denotes genes containing neither an HR or SSR) (Materials and Methods). Dataset S6 contains the calculations for each gene as well as # INDELsinframe homoplasy score, and frameshift homoplasy score. (C) Homoplasy scores for all SNVs were aggregated at the level of pathways then normalized by the gene lengths for each gene set (Dataset S7 and Materials and Methods). (D) Homoplasy scores for all INDELs were aggregated at the level of pathways then normalized by the gene lengths for each gene set (Dataset S8 and Materials and Methods).
Among the calculations for SNVs (Fig. 5A and Dataset S5), several outlier genes are involved in the acquisition of antibiotic resistance (gyrA, rpoB, rpsL, gid, katG, pncA, embB) (11, 12). Additionally, several outliers belonged to the ESX protein family (esxL, esxO, esxN, esxM, esxW) which are involved in host–pathogen interactions (30) and the PE/PPE protein family (PPE18, PPE19, PPE59, PPE60) which include antigenic proteins (31). For INDELs (Fig. 5B and Dataset S6), outliers included the antibiotic resistance loci: pncA, gid (12, 22) and additional members of the PE/PPE family (PPE13, PE-PGRS15, PPE57). Further, we observed that 86.7% of genes contain an HT and/or SSR within their sequence, and that the outliers for INDEL density were most commonly genes containing an HT or SSR sequence (Fig. 5B). Next, we extended this analysis for SNVs & INDELs at the pathway level by aggregating Hs across different gene sets belonging to 410 pathways (SI Appendix, Materials and Methods). The pathway with the most mutational density per SNVs belonged to a Mycobacterium virulence operon with Esat6-like proteins (Fig. 5C and Dataset S7), while the pathway most enriched for mutational density per INDELs belonged to the CRISPR-associated cluster that contains the aforementioned putative antigen Rv2823c (Fig. 5D and Dataset S8).
Discussion
As MTBC evolved into a professional pathogen from a saprophytic mycobacterium, it underwent step-wise adaptation to the intracellular environment. This adaptation is thought to comprise genome contraction, expansion of specific gene families especially toxin–antitoxin systems, the type VII secretion systems, and the PE-PPE gene family, as well as gene modification through mutation (2). Population genetic studies of MTBC have largely concluded that the modern MTBC genome is under purifying selection with most newly fixed diversity attributable to antibiotic selection pressure (10, 13–15). It has thus been suggested that MTBC has reached a pathogenic fitness peak (32). Here, we update this view by analyzing the largest to date collection of MTBC genome sequences characterizing the timing and pattern of genetic variation acquisition across the phylogeny. We find 4,980 SNVs, 993 non-SSR-related INDELs, and 45 HT regions to have evolved in a parallel manner with high frequencies suggestive of an adaptive role. Although a subset of this variation can be linked to resistance based on known genetic determinants, the majority has no known association with resistance. Among the highest scoring variants, we find proteins that encode putative antigens (esxL, esxW, Rv2823c) (33), other PE/PPE proteins (PPE54 and PPE18) (15), toxin–antitoxin bicistrons (vapC2, mazF6) and ESX-1 system (espK, espA, espR) strongly suggestive of a role in virulence (34). The highest scoring variants also heavily overrepresent intergenic regions (20%, 22%, and 26% of putatively adaptive SNVs, non-SSM INDELs, and HTs, respectively) even though intergenic regions constitute only 10% of the genome by length. Putatively adaptive transcriptional variants appear to converge with protein variants in impacting ESX-1 function. We identify a substantial proportion of putatively adaptive variation to be acquired recently and on par with acquisition of resistance-related variants, suggesting that modern MTBC continues to refine its virulence strategies likely in the context of a dynamic host environment.
Phase variation was recently recognized to mediate MTBC drug-tolerance through frameshifts in the glycerol kinase gene glpK that likely act by altering the metabolic state of the cell (6, 7). In other bacterial pathogens, phase variation can alter antibiotic efficacy and the immunogenicity of cell surface proteins through altered transcription, translation and/or the creation of protein diversity (5). Here, we take a genome-wide approach to assess the frequency and impact of phase variation in MTBC. We measure the frequency of INDEL acquisition in HTs at 38× the rate observed for SNVs in clinical isolates. Based on in vitro measurements, we estimate the frameshift rate under expected neutral conditions at frameshifts/HT/year, ~100× the rate previously reported MTBC SNV acquisitions (35). The discrepancy between the in vitro and observed event rate in HTs in clinical isolates is likely attributable to INDEL reversions. Remarkably despite the undercounting of INDEL events in HTs, more than 12% of all INDEL events observed in the MTBC clinical isolate phylogeny occur in an HT region. We find a few examples of frequent SSM in non-HT SSR regions, e.g., in ponA1, a gene previously identified to modulate growth in the presence of the drug rifampicin (11). However, we measure a substantially lower rate of INDELs in the latter regions compared with HTs (Fig. 1B and SI Appendix, Fig. S7). Using a GWAS approach, we find a subset of frameshifts in HTs to be associated with antibiotic resistance. These include genes of unknown function Rv3413c and Rv2264c as well as an HT upstream of lysyl-tRNA synthetase lysX. This gene is conditionally essential for bacterial growth in vivo, its higher expression correlates positively with virulence in clinical isolates, and in Mycobacterium avium hominis lysX mutants associate with resistance to cationic antimicrobials and increased inflammatory response after macrophage infection (36–38). Hence, the frameshifts in the HT upstream of lysX may plausibly affect both antibiotic resistance and virulence in MTBC.
Multiple different pressures may differentially select for variants related to ESX-1 activity. This secretion system influences virulence and antigenicity in MTBC (34, 39) by controlling the secretion of the immunodominant antigens ESAT-6 (esxA) and CFP-10 (esxB) (40–42), stimulating the innate immune response and cytokine secretion (43, 44), and promoting the intracellular growth of the pathogen (45, 46). Previous work has implicated the ESX-1 secretion system in driving host–pathogen interactions that affect TB severity (47). Through modulating the immune response, as well as cellular permeability (34), ESX-1 function may also influence antibiotic activity or resistance (48). Indeed, we identified phase variants that truncate espK, an ESX-1-associated gene that when disrupted in vitro promotes bacterial growth (49) to associate with resistance. In contrast, INDELs that reduce the expression of the espACD operon were not associated with the resistant phenotype, suggesting that another host-derived pressure may be responsible for selecting these variants. These indels might be expected to reduce bacterial fitness, as deletion of espA abrogates secretion of ESAT-6 and CFP-10 and attenuates growth in mice to a similar degree as deletion of the ESX-1 locus (50). However, more subtle reduction in ESX-1 function could also result in reduced antigen presentation and/or cytokine production, thus aiding immune evasion (51). A previous study of 5,977 clinical isolates, reported a high incidence of reversible frameshift–related scars (i.e., two sequential frameshifts that result in only a local change in frame) in genes belonging to the ESX-1 secretion system including the espR upstream region similar to our results (52); espI had the largest number of unique scars, and is implicated in negative regulation of the ESX-1 secretion system providing further support that frequently occurring INDELs may be a mechanism for reversible gene silencing and adaptation in MTBC (52, 53). We thus hypothesize that multiple modes of phase variation tune ESX-1 activity to optimize growth, survival, or transmission. These states may influence antibiotic susceptibility through modulation of growth and membrane permeability, or by altering the local environment. These hypotheses are testable in in vivo experimental systems.
We evaluated conservation of the HT length and position upstream of espA in other mycobacteria focusing on Mycobacterium bovis (AF2122/97), Mycobacterium leprae (TN), Mycobacterium marinum (M) as all three have orthologues of espA. All three genomes contained HTs at a distance ~100 bp upstream of espA and similar to the distance observed in M. tuberculosis. More specifically, M. bovis had a 6A HT between −111 bp and −106 bp, M. leprae had a 7T HT between −115 bp and −109 bp, and M. marinum had a 7A HT between −113 bp and −107 bp, upstream of the respective espA orthologues. This high degree of sequence conservation suggests that phase variation in these other mycobacterial species may also have an adaptive role or be otherwise important for pathogen survival; however, experiments demonstrating transcriptional changes in these species is beyond the scope of this study and will have to be examined in future studies.
This analysis is not without limitations. First is our inability to functionally validate all associations due to the time and resources needed to manipulate M. tuberculosis genetically in vitro. Instead, we provide a proof-of-concept validation of transcriptional regulation for one HT candidate in the TSS' of espA. Second is our inability to assess adaptive INDELs in non-HT SSR regions as they vary in their sequence composition and the expected rate of SSM, thus challenging our ability to simulate neutral evolution in these regions. Similarly, it is difficult to account for the reversibility of INDELs in SSR regions, and it is possible that some homoplasic variants represent a combination of mutation and reversion, as opposed to two distinct arisals. Regardless, the reported Hs values still represent the number of independent mutational events observable at a site.
Third is the possibility of sampling bias that may impact variants identified in our analysis. We used public M. tuberculosis WGS data in this study that consisted largely of clinical isolates selected for sequencing based on their phenotypic antibiotic resistance or to investigate disease outbreaks and transmission (54). Although we have aggregated a geographically and genetically diverse sample of M. tuberculosis (SI Appendix, Fig. S12), our analysis may miss rare variants of interest that are more common in unsampled settings including those with low capacity for WGS. Accurate RcR estimation for antibiotic resistance variants assumes some representation of antibiotic susceptible strains, which is met for most antibiotics in pubic data (25). In this work, we also make the assumption that SNV mutation rates are homogeneous outside of SSR regions. We recognize that many forces likely determine the neutral mutation rate across the genome including GC content, repetitive sequence, and transcription coupled repair to name a few factors. Driving both extremes of evolutionary rates are forces of positive and purifying selection, respectively, that shape the genome. The approach we take in simulating neutral evolution is only a useful approximation to gauge the very extreme rates of evolution. It is likely that regions with seemingly borderline rates of Hs may also have functional consequences, and at the other extreme are genes under purifying selection that are beyond the scope of this work.
In summary, in this work we present evidence that MTBC genomes are strongly and regionally shaped by positive selection not only to modulate the resistance phenotype but likely also virulence mechanisms. We hypothesize that phase variation in ESX-1 system of MTBC can act as a toggle between antigenicity and survival in the host. The ongoing regional evolution of MTBC suggests that the host environment in MTBC infection is dynamic, including potentially opposing forces that shape transmissibility and survival in host. Overall, the insights gained in this analysis can inform vaccine design and host- and pathogen-directed therapy against MTBC that have recently been expanded to include ESX-1 targeting compounds (55).
Materials and Methods
Sequence Data.
We initially downloaded raw Illumina sequence data for 33,873 clinical isolates from National Center for Biotechnology Information (NCBI) (56). We identified the BioSample for each isolate and downloaded all of the associated Illumina sequencing runs. Isolates had to meet the following quality control measures for inclusion in our study: i) At least 90% of the reads had to be taxonomically classified as belonging to MTBC after running the trimmed FASTQ files through Kraken (57) and ii) at least 95% of bases had to have coverage of at least 10× after mapping the processed reads to the H37Rv reference genome (Genbank accession: NC_000962).
Illumina Sequencing FastQ Processing and Mapping to H37Rv.
The raw sequence reads from all sequenced isolates were trimmed with version 0.20.4 Prinseq (settings: -min_qual_mean 20) (58) and then aligned to H37Rv with version 0.7.15 of the BWA mem algorithm using the -M settings (59). The resulting SAM files were then sorted (settings: SORT_ORDER = coordinate), converted to BAM format, and processed for duplicate removal with version 2.8.0 of Picard (http://broadinstitute.github.io/picard/) (settings: REMOVE_DUPLICATES = true, ASSUME_SORT_ORDER = coordinate). The processed BAM files were then indexed with Samtools (60). We used Pilon (settings: --variant) on the resulting BAM files to generate VCF files that contained calls for all reference positions corresponding to H37Rv from pileup (61).
Variant Calling.
SNP calling.
To prune out low-quality base calls that may have arisen due to sequencing or mapping error, we dropped any base calls that did not meet any of the following criteria: i) the call was flagged as Pass by Pilon, ii) the mean base quality at the locus was >20, iii) the mean mapping quality at the locus was >30, iv) none of the reads aligning to the locus supported an INDEL, v) a minimum coverage of 20 reads at the position, and vi) at least 75% of the reads aligning to that position supported 1 allele (using the INFO.QP field, which gives the proportion of reads supporting each base weighted by the base and mapping quality of the reads, BQ and MQ, respectively, at the specific position). A base call that did not meet all filters (i) to (vi) was inferred to be low-quality/missing (SI Appendix, Fig. S2).
INDEL calling.
To prune out low-quality INDEL variant calls, we dropped any INDEL that did not meet any of the following criteria: i) The call was flagged as Pass by Pilon, ii) the maximum length of the variant was 10 bp, iii) the mean mapping quality at the locus was >30, iv) a minimum coverage of 20 reads at the position, and v) at least 75% of the reads aligning to that position supported the INDEL allele (determined by calculating the proportion of total reads TD aligning to that position that supported the insertion or deletion, IC and DC respectively). A variant call that met filters (i), (iii), and (iv) but not (ii) or (v) was inferred as a high-quality call that did not support the INDEL allele. Any variant call that did not meet all filters (i), (iii), and (iv) was inferred as low-quality/missing.
Lineage Typing and Classifying Isolates into Groups.
After excluding 1,663/33,873 isolates that had missing calls >10% SNP sites, we determined the global lineage of each isolate ( ) using base calls from Pilon-generated VCF files and a 95-SNP lineage-defining diagnostic barcode (SI Appendix, Fig. S2) (17). We further excluded 290 isolates that had no lineage call or more than one lineage call (low-quality calls at lineage-defining SNP sites or a rare SNP call characterized as monophyletic for another lineage in the SNP barcode), and 35 isolates that had L7 lineage calls (SI Appendix, Fig. S2). Our remaining 31,885 isolates were typed as: L1 (2,815), L2 (8,090), L3 (3,398), L4 (17,388), L5 (98), and L6 (96). We aimed to cluster isolates into groups of no more than 8,000 isolates based on lineage & sublineage to achieve feasible phylogeny construction runtimes so we further divided L4 isolates based on sublineage calls. We excluded 457 isolates that were typed as L4 but did not have any sublineage calls. We analyzed the sublineage calls of the remaining 16,931 L4 isolates and grouped isolates according to sublineages that were located next to each other on the L4 phylogeny (17). We grouped the L4 isolates into three groups: L4A (sublineages 4.1.x & 4.2.2.x, ), L4B (sublineage 4.2.1.2.x, ), and L4C (sublineage 4.2.1.1.x, ) where .x is a place-holder for any further resolution on the sublineage call under the hierarchical lineage typing scheme (17).
SNP Genotype Matrix.
A schematic diagram outlining the following steps is given in SI Appendix, Fig. S2. First, we detected SNP sites at 899,035 H37Rv reference positions (of which 64,950 SNPs were not biallelic) among our global sample of 33,873 isolates. We constructed a 899,035 × 33,873 genotype matrix (coded as 0:A, 1:C, 2:G, 3:T, 9:Missing) and filled in the matrix for the allele supported at each SNP site (row) for each isolate, according to the SNP Calling filters outlined above. If a base call at a specific reference position for an isolate did not meet the filter criteria that allele was coded as Missing. We excluded 20,360 SNP sites that had an empirical base-pair recall (EBR) score <0.90 (SI Appendix, Materials and Methods), another 9,137 SNP sites located within mobile genetic element regions (e.g., transposases, intergrases, phages, or insertion sequences) (15, 62), then 31,215 SNP sites with missing calls in >10% of isolates, and 2,344 SNP sites located in overlapping genes (coding sequences). These filtering steps yielded a genotype matrix with dimensions 835,979 × 33,873. Next, we excluded 1,663 isolates with missing calls in >10% of SNP sites yielding a genotype matrix with dimensions 835,979 × 32,210 (22). We used an expanded 96-SNP barcode to type the global lineage of each isolate in our sample (17). We further excluded 325 isolates that either did not get assigned a global lineage, assigned to more than one global lineage, or were typed as lineage 7. We then excluded 41,760 SNP sites from the filtered genotype matrix in which the minor allele count = 0, which resulted in a 794,219 × 31,885 matrix. To provide further MTBC lineage resolution on the lineage 4 isolates, we required an MTBC sublineage call for each lineage 4 isolate. We excluded 457 isolates typed as global lineage 4 but had no further sublineage calls and then again excluded 11,654 SNP sites from the filtered genotype matrix in which the minor allele count = 0. The genotype matrix used for downstream analysis had dimensions 782,565 × 31,428, representing 782,565 SNP sites across 31,428 isolates (SI Appendix, Fig. S2). The global lineage (L) breakdown of the 31,428 isolates was: L1 = 2,815, L2 = 8,090, L3 = 3,398, L4 = 16,931, L5 = 98, L6 = 96.
INDEL Genotype Matrix.
We detected 53,167 unique INDEL variants within 50,576 H37Rv reference positions (INDELs called at the same site with the same base but with different lengths were treated as different INDEL variants, additionally we distinguished between INDEL) among our global sample of 33,873 isolates. We constructed a 53,167 × 33,873 genotype matrix (coded as 1:high-quality call for the INDEL allele, 0:high-quality call not for the INDEL allele, 9:Missing) and filled in the matrix according to whether the INDEL allele was supported for each INDEL variant (row) for each isolate, according to the INDEL Calling filters outlined above. If a variant call at the reference position for an INDEL variant did not meet the filter criteria that call was coded as Missing. We excluded 2,006 INDELs that had an EBR score <0.90, another 694 INDELs located within mobile genetic element regions, then 207 INDELs located in overlapping genes (coding sequences). These filtering steps yielded a genotype matrix with dimensions 50,260 × 33,873. Next, we excluded any isolate that was dropped while constructing the SNP genotype matrix to retain the same 31,428 isolates as described above. Finally, we excluded 2,835 INDELs in which the alternate allele count=0. The genotype matrix used for downstream analysis had dimensions 47,425 × 31,428 (SI Appendix, Fig. S1A).
Phylogeny Construction.
To generate the phylogenies, we first merged the VCF files of the isolates in each group (L1, L2, L3, L4A, L4B, L4C, L5, L6) with bcftools (60) using only SNPs detected within the VCF files. We then removed repetitive, antibiotic resistance, and low-coverage regions (17). We generated a multisequence FASTA alignment from the merged VCF file with vcf2phylip (version 1.5, https://doi.org/10.5281/zenodo.1257057) to create a SNP concatenate of all samples, in addition to a sequence of Mycobacterium canettii, which we used as the outgroup. We constructed the phylogenetic trees with IQ-TREE (63). For all groups, we used the mset option to restrict model selection to GTR models (mset GTR), and specified 1,000 bootstrap replicates for both ultrafast bootstrap and SH-aLRT algorithms to compute support values (bb 1,000 -alrt 1,000). To construct phylogenies for groups L1, L2, L3, L4A, L4B & L4C, we specified the substitution model as GTR+F+I+R (m GTR+F+I+R). To construct phylogenies for groups L5 & L6, we implemented the automatic model selection with ModelFinder Plus (m MFP) (64). The runtimes to construct the phylogenies were: L1 (2 d, 1.5 h), L2 (63 d, 9 h), L3 (11 d, 20 h), L4A (6 d, 11 h), L4B (6 d, 18 h), L4C (2 d, 18 h), L5 (4 min), and L6 (2.5 min). Upon closer inspection of the phylogenies, we observed that a handful of isolates (14/31,428) were misclassified based on the SNP barcode. The misclassified isolates belonged to the following groups: L1 (3), L2 (4), L3 (2), L4A (1), L4B (0), L4C (4), L5 (o), and L6 (0). The small number of mistyped isolates did not affect our inferences, so we kept these phylogenies for downstream analyses.
Assessment of Parallel Evolution for SNVs.
To quantify the number of independent arisals for each SNV, we used the SNP genotype matrix in conjunction with the phylogenies for each isolate group (SI Appendix, Fig. S1B). We used an ancestral reconstruction approach to quantify the number of times each SNV arose independently within each phylogeny using SNPPar (SI Appendix, Fig. S5B) with options: sorting intermediate --no_all_calls - - - -no_homoplasic (18). We parsed the SNPPar output files all_muation_events.tsv and node_sequences.fasta to check each mutation reported in the mutation events table against the inferred sequences at the nodes of the phylogeny and the isolates sequences. Mutations that were not found in the sequences were discarded, the number of reported mutation events not located between inferred node/isolate sequences is broken down by phylogeny as follows: L1 (447), L2 (2,472), L3 (392), L4A (839), L4B (1,177), L4C (559), L5 (2), and L6 (3). We then parsed the filtered mutation events tables corresponding to each isolate group and counted the number of times each unique SNV in our dataset was inferred to have arisen, counting only the number of times that the major allele (ancestor call) mutated toward the minor allele (derived call) for each SNV (SI Appendix, Fig. S5B). This yielded a Hs or an estimate for the number of independent arisals for each SNV across all 31,428 isolates (SI Appendix, Table S1 and Dataset S1). We note that 1,920/836,901 SNVs in our SNP genotype matrix had a Hs = 0, this was likely due to error in the ancestral reconstructions, or may have been the result of subsetting isolates into groups before running ancestral reconstruction (i.e., if an SNV is fixed in isolates belonging to one of the phylogenies but not called in any other isolates, no mutation event would be reported). These SNVs were dropped from downstream analysis.
Assessment of Parallel Evolution for INDELs.
To quantify the number of independent arisals for each INDEL, we developed a simple method to count the number of times each a given allele “breaks” the phylogenies (SI Appendix, Fig. S5C). If a given minor/alternate allele is observed in two separate parts of a phylogeny, then we can assume that this allele arose twice in pool of isolates used to construct the tree. If the minor/alternate allele is observed in three separate parts of the phylogeny, then we assume that the allele arose independently three times. We extended this idea to count the total number of times a given minor/alternate allele arises within a phylogeny. To do this, we specify a minor/alternate allele of interest and code the phylogeny tips (according to whether the corresponding isolates harbor the allele) as follows: minor/alternate allele = 1, major/reference allele = 0, low quality call = 9. We create a vector from the coded phylogeny tips and then count the number of times each consecutive string of 1s appears in the vector. These consecutive 1s (“1 blocks”) must be separated by 0s on either side, and the number of 0s required in between the strings of 1s is controlled by the spacer parameter. If spacer = 1, then only one 0 is required in between 1 blocks to count different arisals. If spacer = 2, then two 0s are required between 1 blocks to count them as separate arisals (SI Appendix, Fig. S5C). We allowed the presence of 9s in the 1 blocks as long as a 1 was present in the block. As an example, suppose that a phylogeny of 15 isolates had tips coded as [0,0,1,1,0,1,0,0,0,1,1,1,0,0,0] for a given allele. If spacer = 1, then [0,0,1,1,0,1,0,0,0,1,1,1,0,0,0] would correspond to three 1 blocks (or three sets of consecutive bolded numbers), and we would infer three independent arisals or a Hs = 3. If spacer = 2, then [0,0,1,1,0,1,0,0,0,1,1,1,0,0,0] would correspond to two 1 blocks (or two sets of consecutive bolded numbers), and we would infer two independent arisals or a Hs = 2. Higher values of the spacer parameter yield more conservative estimates for Hs calculations.
We calculated a Hs by counting these topology disruptions (TopDis) or “blocks” for SNVs using the SNP genotype matrix in conjunction with the phylogenies for each isolate group to assess the number of independent arisals for each mutation observed, coding the tips as 1 if they carried the minor allele for each SNV (SI Appendix, Figs. S1C and S5C). We computed these Hs for different values of the spacer parameter (1-6) to assess the congruence of these estimates with the Hs computed from the ancestral reconstructions (SI Appendix, Fig. S6). The results were concordant between both methods, although TopDis appeared to overestimate the Hs for some SNVs with spacer = 1 and spacer = 2 (SI Appendix, Fig. S6 A and B). These results validated our approach for computer Hs using TopDis. To compute the Hs for INDELs, we conservatively chose spacer = 4, at which point the Hs for each SNV computed from TopDis appeared to be equal or less than the Hs computed from SNPPar (SI Appendix, Fig. S6D). To quantify the number of independent arisals for each INDEL, we used the INDEL genotype matrix in conjunction with the phylogenies for each isolate group as input to TopDis with spacer = 4 (SI Appendix, Fig. S1D), coding the tips as 1 if they carried the alternate allele for each INDEL (SI Appendix, Fig. S5C). We note that 1,119/47,425 indels had Hs = 0, this may have been the result of subsetting isolates into groups before running TopDis (i.e., if an INDEL is fixed in isolates belonging to one of the phylogenies but not called in any other isolates, no “block” would be observed) or if the INDEL alternate allele was only present at the ends of the coded phylogeny tips vector. These INDELs were dropped from downstream analysis.
Media.
M. tuberculosis H37Rv and M. smegmatis were both grown in 7H9 broth with 0.05% Tween 80, 0.2% glycerol, and OADC (oleic acid-albumin-dextrose-catalase; Becton, Dickinson); transformants were selected on 7H10 plates with 0.5% glycerol and OADC. When needed, the following supplements were added: kanamycin (25 μg/mL), hygromycin (50 μg/mL), and anhydrotetracycline (aTc).
Recombineering Single-Nucleotide espA Mutant.
M. tuberculosis harboring pKM402 (65) and pKM427 (66) were grown in 30 mL 7H9 media containing OADC, 0.2% glycerol, 0.05% Tween 80, and 25 μg/mL kanamycin. Atc was added to a final concentration of 500 ng/mL at OD ~0.4, Electroporation was performed as described (66) using 2 μg espA target oligo (GGCCTACAGTCTGGCTGTCATGCTTGGCCGATGTCAACAGTTTTTTCATGCTAAGCAGATCGTCAGTTTTGAGTTCGTGAAGACGG) and 200 ng hygR repair oligo (CGGTCCAGCAGCCGGGGCGAGAGGTAGCCCCACCCGCGGTGGTCCTCGACGGTCGCCGCG). Candidate clones were expanded in into 4 mL 7H9-OADC-Tween with 50 μg/mL hygromycin. The upstream region of Rv3616 was amplified by PCR using the following primers: GACCGGGATGTAGGTCAGGTC) and GCTAGGTGTTTAGCGGACGCG. The PCR product was sequenced with GCTAGGTGTTTAGCGGACGCG as a primer to confirm the presence of the mutation.
RNA Extraction.
Here, 10 mL of WT and mutant H37Rv were grown at 37 °C in 7H9-OADC-Tween to an OD ~0.6. Immediately prior to harvesting RNA, 40 mL of guanidine isothiocyanate buffer was added to the culture (5 M guanidine isothiocyanate, 0.5% N-lauryl sarcosine, 25 mM trisodium citrate, 0.1 M beta-mercaptoethanol, 0.5% Tween, pH = 7.0). Bacteria were collected by centrifugation at 4,000 rpm for 10 min at 4 °C, resuspended in 500 μL of Trizol reagent, and lysed by lysing matrix B (MP Bio) and bead beating three times at 6.5 M/s for 45 s. After centrifugation, 100 μL of chloroform was added to the supernatant, inverted several times, and incubated at room temperature for 3 min. Samples were then centrifuged at 10,000 × g at 4 °C for 15 min. RNA was then extracted using the Zymo Research Direct-zol RNA Miniprep Plus Kit. Samples were processed according to the manufacturer’s instructions, including 15 min on-column Dnase digestion. After eluting in 50 µL, an additional Dnase digestion was performed using NEB Rnase-free Dnase I, bringing the total volume of the reaction up to 100 µL. Samples were incubated at 37 °C for 2 h. 100 µL of the reaction was then added to 400 µL of Trizol reagent, to which 500 µL of ethanol was then added. RNA extraction using the Direct-zol kit was repeated as before, but this time skipping the on-column Dnase digestion. Samples were eluted in 50 µL of water and the concentration of each sample was determined via NanoDrop.
The RNA extracted from the espA mutant was sequenced on an Illumina 4000 in paired-end mode, with a read-length of 2 × 150 bp. Two runs were performed, and 3 replicates each for the espA mutant (+1 bp insertion in homopolymer) and wild type (M. tuberculosis H37Rv) were collected on each run. The reads were mapped to the H37Rv genome (Genbank accession NC_000962.2) using BWA (v0.7.12), and read counts for each ORF (open reading frame) were tabulated. The R package DeSeq (67) was used to analyze the counts and identify differentially expressed genes as genes with an adjusted P-value < 0.05 (after multiple-tests correction). DeSeq internally normalizes the count data by computing scaling factors for each dataset. The model was fit with 2 covariates, strain and run, and the statistical analysis was based on the strain coefficient (as contrast), to evaluate the average effect of the espA mutant on the counts for each gene relative to the WT samples from the same run.
RNAseq Library Preparation.
Using the Illumina Ribo-Zero Plus rRNA Depletion Kit and NEBNext® Ultra II Directional RNA Library Prep Kit for Illumina, 250 ng of total RNA was processed. Adaptor-ligated DNA was PCR enriched for 9 cycles according to the protocol using indexed primers from NEBNext Multiplex Oligos for Illumina. Samples were purified using SPRIselect Beads at each clean-up step. Prepared libraries were diluted to equal concentrations and pooled at a concentration of 30 nM. Samples were processed on an Illumina HiSeq 4000 machine with a 2 × 150 basepair sequencing configuration.
Plasmid Construction.
A homopolymer frameshifting reporter was constructed from a hygromycin-resistant pDE43-MCtH vector, which is a version of pDE43-MCK with a swapped antibiotic marker [Addgene plasmid #49523; (68)]. Using Gibson assembly, the homopolymer sequence from glpK (Rv3696c) along with 79 basepairs of flanking sequence (40 basepairs preceding, and 39 basepairs following, the homopolymer) was fused to an out-of-frame kanamycin resistance cassette, such that the addition of a single-nucleotide insertion would produce an in-frame kanamycin resistance gene, all of which is driven by a P16 mycobacterial-specific promoter (a gift from Dirk Schnappinger).
Fluctuation Analysis.
M. smegmatis harboring the homopolymer-frameshifting reporter was thawed from glycerol stock and grown in 4 mL 7H9-OADC-Tween to an OD ~1.0. This culture was split and diluted into 20 parallel cultures, each with an OD = 0.01. These cultures were rotated at 37 °C for ~20 h. Total bacterial numbers were determined by plating on 7H10 plates with OADC and 0.5% glycerol. To enumerate frameshifted mutants, entirety of each culture was plated on 7H10 with 25 µg/mL kanamycin. Plates were incubated at 37 °C for 4 to 5 d prior to counting colonies. In a subset of kanamycin-resistant colonies, frameshifts were verified by PCR with the following primers: GCTCGAATTCACTGGCCATGCATC) and GATCCTGGTATCGGTCTGCGATTC. The PCR product was then sequenced using GATCCTGGTATCGGTCTGCGATTC as a primer. After accounting for the proportion of kanamycin-resistant colonies that contained a frameshifted homopolymer (20/28), the mutation rate was calculated as described by ref. 69.
Homopolymeric Tract and SSR Regions.
We used the H37Rv reference genome to search for positions that corresponded to homopolymeric tracts and SSRs in the M. tuberculosis genome. As phase variation has been documented with repeat units that consist between 1 and 7 nucleotides (5), we first classified regions with a single nucleotide repeated times as homopolymeric tracts (HT) given the recent association in antibiotic tolerance (7, 8).
We scanned the genome for HTs at least 7 bp in length and found 145 HT regions to cover 1,024 bp or 0.023% of the genome (Dataset S3). Next, we searched the genome for regions in which a repeat unit, with any combination of nucleotides between 2 and 6 bp, repeated at least 3 consecutive times (permutating four nucleotides for a 7 bp unit yields too many possibilities to hold into memory). We classified these regions as SSRs and found them to cover 99,665 bp or 2.26% of the genome. While we observed 145 HT regions and 18,316 SSR regions across H37Rv, we were only able to ascertain 121 HT and 17,689 SSR regions as these contained no positions excluded during our quality control steps.
Association between Frameshifts in HTs and Antibiotic Resistance.
In order to study the potential associations between the presence/absence of frameshift INDELs (relative to H37Rv) in specific HTs and antibiotic resistance, we used a publicly available dataset of antibiotic resistance phenotypic data (https://github.com/farhat-lab/resdata-ng/blob/master/resistance_data/summary_tables/resistance_summary.txt) (70). We determined the associations using a linear mixed model as implemented in GEMMA (71), allowing a maximum missingness of 1% (-miss parameter) and a minimum minor allele frequency of 1% (-maf parameter). In order to correct for population structure, we used a matrix of all SNP differences between the isolates tested. Finally, P-values were corrected for multiple testing using the Bonferroni method. For each test between frameshifts in a particular HT and antibiotic, we ensured that we had resistant isolates to that antibiotic in our sample.
Supplementary Material
Appendix 01 (PDF)
Dataset S01 (XLSX)
Dataset S02 (XLSX)
Dataset S03 (XLSX)
Dataset S04 (XLSX)
Dataset S05 (XLSX)
Dataset S06 (XLSX)
Dataset S07 (XLSX)
Dataset S08 (XLSX)
Dataset S09 (XLSX)
Acknowledgments
We thank the members of the Farhat lab for helpful discussions and comments on the research project and manuscript. R.V. was supported by the NSF Graduate Research Fellowship under Grant No. DGE1745303. M.R.F. was supported by NIH NIAID R01 AI55765. E.A.C. was supported by NIH R01 GM114450. Portions of this research were conducted on the O2 High Performance Compute Cluster, supported by the Research Computing Group, at Harvard Medical School. Portions of the paper were developed from the doctoral dissertation of R.V.
Author contributions
R.V., E.A.C., T.R.I., C.M.S., and M.R.F. designed research; R.V., M.J.L., L.F., R.F., and M.R.F. performed research; R.V. and K.C.M. contributed new reagents/analytic tools; R.V., M.J.L., L.F., M.M., R.F., K.C.M., E.A.C., T.R.I., C.M.S., and M.R.F. analyzed data; and R.V. and M.R.F. wrote the paper.
Competing interests
The authors declare no competing interest.
Footnotes
Preprint: https://www.biorxiv.org/content/10.1101/2022.06.10.495637v1 (CC-BY-NC-ND 4.0 International license).
This article is a PNAS Direct Submission.
Contributor Information
Roger Vargas, Jr., Email: roger.vargas.1793@gmail.com.
Maha Reda Farhat, Email: Maha_Farhat@hms.harvard.edu.
Data, Materials, and Software Availability
M. tuberculosis sequencing data were collected from NCBI and is publicly available (Materials and Methods). All packages and software used in this study have been noted in the Materials and Methods. Custom scripts written in python 2 were used to conduct all analyses and interfaced via Jupyter Notebooks. Jupyter Notebooks and scripts written for data processing and analysis can be found in the following GitHub repository–https://github.com/farhat-lab/phase-variation-in-Mtbc (72).
Supporting Information
References
- 1.Shaw G. B., “Practical uses of litmus paper in Möbius strips” (Tech. Rep. CUCS-29-82, Columbia University, NY, 1982). [Google Scholar]
- 2.Gagneux S., Ecology and evolution of Mycobacterium tuberculosis. Nat. Rev. Microbiol. 16, 202–213 (2018). [DOI] [PubMed] [Google Scholar]
- 3.Ngabonziza J. C. S., et al. , A sister lineage of the Mycobacterium tuberculosis complex discovered in the African Great Lakes region. Nat. Commun. 11, 1–11 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Coscolla M., et al. , Phylogenomics of Mycobacterium africanum reveals a new lineage and a complex evolutionary history. Microb. Genomics 7, 000477 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Van Der Woude M. W., Bäumler A. J., Phase and antigenic variation in bacteria. Clin. Microbiol. Rev. 17, 581–611 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Bellerose M. M., et al. , Common variants in the glycerol kinase gene reduce tuberculosis drug efficacy. mBio 10, e00663-19 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Safi H., et al. , Phase variation in Mycobacterium tuberculosis glpK produces transiently heritable drug tolerance. Proc. Natl. Acad. Sci. U.S.A. 116, 19665–19674 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Vargas R., Farhat M. R., Antibiotic treatment and selection for glpK mutations in patients with active tuberculosis disease. Proc. Natl. Acad. Sci. U.S.A. 117, 3910–3912 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Pepperell C., et al. , Bacterial genetic signatures of human social phenomena among M. tuberculosis from an Aboriginal Canadian population. Mol. Biol. Evol. 27, 427–440 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Brynildsrud O. B., et al. , Global expansion of Mycobacterium tuberculosis lineage 4 shaped by colonial migration and local adaptation. Sci. Adv. 4, eaat5869 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Farhat M. R., et al. , Genomic analysis identifies targets of convergent positive selection in drug-resistant Mycobacterium tuberculosis. Nat. Genet. 45, 1183–1183 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Manson A. L., et al. , Genomic analysis of globally diverse Mycobacterium tuberculosis strains provides insights into the emergence and spread of multidrug resistance. Nat. Genet. 49, 395–402 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Chiner-Oms Á., et al. , Genomic determinants of speciation and spread of the Mycobacterium tuberculosis complex. Sci. Adv. 5, eaaw3307 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Holt K. E., et al. , Frequent transmission of the Mycobacterium tuberculosis Beijing lineage and positive selection for the EsxW Beijing variant in Vietnam. Nat. Genet. 50, 849–849 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Vargas R., et al. , In-host population dynamics of Mycobacterium tuberculosis complex during active disease. Elife 10, e61805 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Vargas R. Jr., “A genomics love story: Tying the knot between micro-and macroevolution in the Mycobacterium tuberculosis complex”, Doctoral dissertation, Harvard University Graduate School of Arts and Sciences; (2021). [Google Scholar]
- 17.Freschi L., et al. , Population structure, biogeography and transmissibility of Mycobacterium tuberculosis. Nat. Commun. 12, 6099 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Edwards D. J., Duchêne S., Pope B., Holt K. E., SNPPar: Identifying convergent evolution and other homoplasies from microbial whole-genome alignments. Microb. Genom. 7, 000694 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Coscolla M., Gagneux S., Consequences of genomic diversity in Mycobacterium tuberculosis. Semin. Immunol. 26, 431–444 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.O’Neill M. B., et al. , Lineage specific histories of Mycobacterium tuberculosis dispersal in Africa and Eurasia. Mol. Ecol. 28, 3241–3256 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Menardo F., Duchêne S., Brites D., Gagneux S., The molecular clock of Mycobacterium tuberculosis. PLoS Pathog. 15, e1008067 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Coll F., et al. , Genome-wide analysis of multi-and extensively drug-resistant Mycobacterium tuberculosis. Nat. Genet. 50, 307 (2018). [DOI] [PubMed] [Google Scholar]
- 23.Shell S. S., et al. , Leaderless transcripts and small proteins are common features of the mycobacterial translational landscape. PLoS Genet. 11, e1005641 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Gerrick E. R., et al. , Small RNA profiling in Mycobacterium tuberculosis identifies MrsI as necessary for an anticipatory iron sparing response. Proc. Natl. Acad. Sci. U.S.A. 115, 6464–6469 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Ektefaie Y., Dixit A., Freschi L., Farhat M. R., Globally diverse Mycobacterium tuberculosis resistance acquisition: A retrospective geographical and temporal analysis of whole genome sequences. Lancet Microbe 2, e96–e104 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Estrem S. T., et al. , Bacterial promoter architecture: Subsite structure of UP elements and interactions with the carboxy-terminal domain of the RNA polymerase α subunit. Genes Dev. 13, 2134–2147 (1999). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Hunt D. M., et al. , Long-range transcriptional control of an operon necessary for virulence-critical ESX-1 secretion in Mycobacterium tuberculosis. J. Bacteriol. 194, 2307–2320 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Blasco B., et al. , Virulence regulator EspR of Mycobacterium tuberculosis is a nucleoid-associated protein. PLoS Pathog. 8, e1002621 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Pang X., et al. , MprAB regulates the espA operon in Mycobacterium tuberculosis and modulates ESX-1 function and host cytokine response. J. Bacteriol. 195, 66–75 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Uplekar S., Heym B., Friocourt V., Rougemont J., Cole S. T., Comparative genomics of esx genes from clinical isolates of Mycobacterium tuberculosis provides evidence for gene conversion and epitope variation. Infect. Immun. 79, 4042–4049 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Brennan M. J., The enigmatic PE/PPE multigene family of mycobacteria and tuberculosis vaccination. Infect. Immun. 85, e00969-16 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Pepperell C. S., et al. , The role of selection in shaping diversity of natural M. tuberculosis populations. PLoS Pathog. 9, e1003543 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Tak U., Dokland T., Niederweis M., Pore-forming Esx proteins mediate toxin secretion by Mycobacterium tuberculosis. Nat. Commun. 12, 1–17 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Garces A., et al. , EspA acts as a critical mediator of ESX-1-dependent virulence in Mycobacterium tuberculosis by affecting bacterial cell wall integrity. PLoS Pathog. 6, e1000957 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Walker T. M., et al. , Whole-genome sequencing to delineate Mycobacterium tuberculosis outbreaks: A retrospective observational study. Lancet Infect. Dis. 13, 137–146 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Kirubakar G., Schäfer H., Rickerts V., Schwarz C., Lewin A., Mutation on lysX from Mycobacterium avium hominissuis impacts the host–pathogen interaction and virulence phenotype. Virulence 11, 132–144 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Montoya-Rosales A., et al. , lysX gene is differentially expressed among Mycobacterium tuberculosis strains with different levels of virulence. Tuberculosis 106, 106–117 (2017). [DOI] [PubMed] [Google Scholar]
- 38.Sassetti C. M., Rubin E. J., Genetic requirements for mycobacterial survival during infection. Proc. Natl. Acad. Sci. U.S.A. 100, 12989–12994 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Lim Z. L., Drever K., Dhar N., Cole S. T., Chen J. M., Mycobacterium tuberculosis EspK has active but distinct roles in the secretion of EsxA and EspB. J. Bacteriol. 204, e00060-22 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Covert B. A., Spencer J. S., Orme I. M., Belisle J. T., The application of proteomics in defining the T cell antigens of Mycobacterium tuberculosis. Proteomics 1, 574–586 (2001). [DOI] [PubMed] [Google Scholar]
- 41.Guinn K. M., et al. , Individual RD1-region genes are required for export of ESAT-6/CFP-10 and for virulence of Mycobacterium tuberculosis. Mol. Microbiol. 51, 359–370 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Hsu T., et al. , The primary mechanism of attenuation of bacillus Calmette-Guerin is a loss of secreted lytic function required for invasion of lung interstitial tissue. Proc. Natl. Acad. Sci. U.S.A. 100, 12420–12425 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Pandey A. K., et al. , NOD2, RIP2 and IRF5 play a critical role in the type I interferon response to Mycobacterium tuberculosis. PLoS Pathog. 5, e1000500 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Stanley S. A., Johndrow J. E., Manzanillo P., Cox J. S., The type I IFN response to infection with Mycobacterium tuberculosis requires ESX-1-mediated secretion and contributes to pathogenesis. J. Immunol. 178, 3143–3152 (2007). [DOI] [PubMed] [Google Scholar]
- 45.Lewis K. N., et al. , Deletion of RD1 from Mycobacterium tuberculosis mimics bacille Calmette-Guerin attenuation. J. Infect. Dis. 187, 117–123 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Stanley S. A., Raghavan S., Hwang W. W., Cox J. S., Acute infection and macrophage subversion by Mycobacterium tuberculosis require a specialized secretion system. Proc. Natl. Acad. Sci. U.S.A. 100, 13001–13006 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Sousa J., et al. , Mycobacterium tuberculosis associated with severe tuberculosis evades cytosolic surveillance systems and modulates IL-1β production. Nat. Commun. 11, 1–14 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Torres Ortiz A., et al. , Genomic signatures of pre-resistance in Mycobacterium tuberculosis. Nat. Commun. 12, 1–13 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.DeJesus M. A., et al. , Comprehensive essentiality analysis of the Mycobacterium tuberculosis genome via saturating transposon mutagenesis. MBio 8, e02133-16 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Fortune S., et al. , Mutually dependent secretion of proteins required for mycobacterial virulence. Proc. Natl. Acad. Sci. U.S.A. 102, 10676–10681 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Clemmensen H. S., et al. , An attenuated Mycobacterium tuberculosis clinical strain with a defect in ESX-1 secretion induces minimal host immune responses and pathology. Sci. Rep. 7, 1–13 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Gupta A., Alland D., Reversible gene silencing through frameshift indels and frameshift scars provide adaptive plasticity for Mycobacterium tuberculosis. Nat. Commun. 12, 1–11 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Zhang M., et al. , EspI regulates the ESX-1 secretion system in response to ATP levels in M ycobacterium tuberculosis. Mol. Microbiol. 93, 1057–1065 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Dixit A., et al. , Estimation of country-specific tuberculosis antibiograms using genomic data. medRxiv [Preprint] (2021). 10.1101/2021.09.23.21263991. Accessed 24 April 2022. [DOI]
- 55.Cole S. T., Inhibiting Mycobacterium tuberculosis within and without. Philos. Trans. R. Soc. B Biol. Sci. 371, 20150506 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Benson D. A., Karsch-Mizrachi I., Lipman D. J., Ostell J., Sayers E. W., GenBank. Nucleic Acids Res. 28, 15–18 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Wood D. E., Salzberg S. L., Kraken: Ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, 1–12 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Schmieder R., Edwards R., Quality control and preprocessing of metagenomic datasets. Bioinformatics 27, 863–864 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Li H., Durbin R., Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Li H., et al. , The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Walker B. J., et al. , Pilon: An integrated tool for comprehensive microbial variant detection and genome assembly improvement. PloS One 9, e112963 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Comas I., et al. , Human T cell epitopes of Mycobacterium tuberculosis are evolutionarily hyperconserved. Nat. Genet. 42, 498–503 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Nguyen L.-T., Schmidt H. A., Von Haeseler A., Minh B. Q., IQ-TREE: A fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268–274 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Kalyaanamoorthy S., Minh B. Q., Wong T. K., Von Haeseler A., Jermiin L. S., ModelFinder: Fast model selection for accurate phylogenetic estimates. Nat. Methods 14, 587–589 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Ioerger T. R., et al. , Identification of new drug targets and resistance mechanisms in Mycobacterium tuberculosis. PloS One 8, e75245 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Murphy K. C., “Oligo-mediated recombineering and its use for making SNPs, knockouts, insertions, and fusions in Mycobacterium tuberculosis” in Mycobacteria Protocols, (Springer, 2021), pp. 301–321. [DOI] [PubMed] [Google Scholar]
- 67.Love M. I., Huber W., Anders S., Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 1–21 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Kim J.-H., et al. , A genetic strategy to identify targets for the development of drugs that prevent bacterial persistence. Proc. Natl. Acad. Sci. U.S.A. 110, 19095–19100 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Gillet-Markowska A., Louvel G., G. Fischer, bz-rates: A web tool to estimate mutation rates from fluctuation analysis. G3 (Bethesda) 5, 2323–2327 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Gröschel M. I., et al. , GenTB: A user-friendly genome-based predictor for tuberculosis resistance powered by machine learning. Genome Med. 13, 1–14 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Zhou X., Stephens M., Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 44, 821–824 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Vargas R. Jr., Phase variation in Mtbc. GitHub. https://github.com/farhat-lab/phase-variation-in-Mtbc. Accessed 30 May 2023.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Appendix 01 (PDF)
Dataset S01 (XLSX)
Dataset S02 (XLSX)
Dataset S03 (XLSX)
Dataset S04 (XLSX)
Dataset S05 (XLSX)
Dataset S06 (XLSX)
Dataset S07 (XLSX)
Dataset S08 (XLSX)
Dataset S09 (XLSX)
Data Availability Statement
M. tuberculosis sequencing data were collected from NCBI and is publicly available (Materials and Methods). All packages and software used in this study have been noted in the Materials and Methods. Custom scripts written in python 2 were used to conduct all analyses and interfaced via Jupyter Notebooks. Jupyter Notebooks and scripts written for data processing and analysis can be found in the following GitHub repository–https://github.com/farhat-lab/phase-variation-in-Mtbc (72).