Large genomic deletions delineate Mycobacterium tuberculosis L4 sublineages in South American countries

Andres Baena; Felipe Cabarcas; Juan C Ocampo; Luis F Barrera; Juan F Alzate

doi:10.1371/journal.pone.0285417

. 2023 May 19;18(5):e0285417. doi: 10.1371/journal.pone.0285417

Large genomic deletions delineate Mycobacterium tuberculosis L4 sublineages in South American countries

Andres Baena ^1,^2,³, Felipe Cabarcas ⁴, Juan C Ocampo ^1,³, Luis F Barrera ^1,^3,⁵, Juan F Alzate ^2,^3,^4,^*

Editor: Li Xing⁶

PMCID: PMC10198500 PMID: 37205685

Abstract

Mycobacterium tuberculosis (Mtb) is still one of the primary pathogens of humans causing tuberculosis (TB) disease. Mtb embraces nine well-defined phylogenetic lineages with biological and geographical disparities. The lineage L4 is the most globally widespread of all lineages and was introduced to America with European colonization. Taking advantage of many genome projects available in public repositories, we undertake an evolutionary and comparative genomic analysis of 522 L4 Latin American Mtb genomes. Initially, we performed careful quality control of public read datasets and applied several thresholds to filter out low-quality data. Using a genome de novo assembly strategy and phylogenomic methods, we spotted novel south American clades that have not been revealed yet. Additionally, we describe genomic deletion profiles of these strains from an evolutionary perspective and report Mycobacterium tuberculosis L4 sublineages signature-like gene deletions, some of the novel. One is a specific deletion of 6.5 kbp that is only present in sublineage 4.1.2.1. This deletion affects a complex group of 10 genes with putative products annotated, among others, as a lipoprotein, transmembrane protein, and toxin/antitoxin system proteins. The second novel deletion spans for 4.9 kbp and specific of a particular clade of the 4.8 sublineage and affects 7 genes. The last novel deletion affects 4 genes, extends for 4.8 kbp., and is specific to some strains within the 4.1.2.1 sublineage that are present in Colombia, Peru and Brasil.

Introduction

Mycobacterium tuberculosis (Mtb) is a major human pathogen, infecting 10 million people worldwide and killing 1.5 million in 2021 [1, 2]. Mtb belongs to the MTBC complex and is classified into nine main lineages (L1 to L9) [3, 4]. Specifically, lineage L4 is the most geographically widespread of all lineages and is also the most prevalent in south America [4, 5]. Compared with other lineages, the L4 lineage has shown an increased virulence in in vitro infected macrophages and in animal models, although there is variability between different L4 Mtb strains [6]. In Latin America, the actual L4 Mtb sublineages have been determined mainly by European colonial movements, recent immigration, and population stratification [6].

Nowadays Mtb strain discrimination is performed using methods that analyze specific SNPs (barcoding) and whole genome sequencing (WGS) technologies. Although the SNP barcoding strategy do not offer the same resolution compared to WGS, they provide rapid and valuable insights into the population structure of circulating strains. The first well-known SNP barcode initiative used 60 loci. Later it was broaden to a 90 SNPs barcode panel that allows locating the Mtb strains within clades inside the seven human lineages and defined 86 sublineages [4].

As observed in other bacterial species, Mtb has evolved mainly through single nucleotide mutations and genome InDels [7]. Single nucleotide mutations are the primary source of variation in Mtb genomes, in addition to short indels (<50 bp) that can occur throughout the whole genome but are more commonly found in the PE-PPE genes [8–10]. On the other hand, large deletions, previously called large-sequence polymorphisms (LSPs), are also common and important because they usually lead to disruptions of CDSs but also can also work as junction points of truncated genes. This phenomenon has been shown to affect a range of metabolic pathways and putative virulence factors [10]. Some of these indels may cause attenuation like in the proved case of RD1 loss or may cause a hyper-virulent phenotype like in the kdpDE two-component system [11].

In this study, we assessed the quality of 866 publicly available genome projects (WGS) of Mtb L4 lineage, isolated in 9 different Latin-American countries. After discerning the evolutionary history of 522 isolates of the L4 sublineage in seven Central and South American countries. With this comprehensive phylogenetic analysis, we discovered modern country-specific lineages that thrive within the South American countries and that haven not been spotted so far. Thanks to the genome assembly strategy, we characterized large genomic deletion profiles of these strains, within an evolutionary framework and report signature-like gene deletions for certain South American L4 sublineages. We also present, to date, the most comprehensive analysis of large genomic deletions in the Mtb L4 lineage in Latin America.

Materials and methods

Bacterial culture and DNA extraction for the Colombian genomes

Mycobacterium tuberculosis was isolated from the sputum of HIV-negative recently diagnosed pulmonary TB patients (PTB, n = 88). These PTB patients were 53% women and 47% men between 25 to 35 years old. The decontaminated sputum was grown in Ogawa Kudoh medium and then transferred to 10mL of 7H9 liquid medium supplemented with OADC (10%), Tyloxapol (0.05%) and glycerol (0,05%). The samples were verified using the SD BIOLINE TB Ag MPT64 rapid test (Abbot, Illinois, USA). Then, the bacteria were cultured to an OD₆₀₀ nm of 0.5, harvested by centrifugation (3000rpm), and the pellet stored at -20°C. For genomic DNA extraction, we combined previously described protocols [12, 13].

The study procedure was approved by the Universidad de Antioquia human ethical committee, Medellin, Colombia. All methods were performed in accordance with the Declaration of Helsinki. All participants in the study were informed of the risk involved in the study, and voluntarily signed the written informed consent in order to have access to the sputum samples from each patient.

Genome sequencing of Colombian genomes

The Mycobacterium tuberculosis WGS experiments were performed by Novogene (Sacramento, CA) using an Illumina Novaseq 6000 instrument. One Truseq nano DNA library prepared for each isolate and for each library we aimed for 1Gb raw bases reading PE reads of 150 bases. All sequenced libraries showed at least 90% of Q30 bases. Raw reads were deposited at the NCBI SRA database under the bioproject PRJNA867148 (https://dataview.ncbi.nlm.nih.gov/object/PRJNA867148?reviewer=okjkdkaevi75kfu8phla1a30su).

Latin American Mycobacterium tuberculosis L4 genomic data downloaded from the NCBI SRA

We searched the NCBI SRA database for M. tuberculosis genomes projects sequenced using Illumina platforms (PE reads) that involved Latin American isolates and separated them according to the country of origin. We excluded all isolates labeled as M. bovis and that were classified into sublineages different than L4. For countries with hundreds or thousands of genomes, we selected those labeled as sublineage L4 and those with the raw read bases summed at least 500 Mb. If the list was still above 500 projects, we increased the threshold to a minimum of 900 Mb of raw reads. From this repository we selected the bioprojects PRJNA755956, PRJEB30933, PRJEB29069, PRJEB44165, PRJEB27366, PRJNA670836, PRJNA671770, PRJEB8689, PRJEB7669, PRJEB8689, PRJEB41837, PRJEB29408, PRJEB23681, PRJNA749058, PRJNA595834, PRJNA454477, PRJNA300846 and PRJNA422870.

As an additional control, we classified all the isolates using the TB-profiler tool and excluded the isolates assigned to a sublineage different from the L4. A recent research paper [14] presents several Mtb L4 genomes from Ecuador. This genome dataset is not available in the SRA database but was requested by the authors, and they kindly shared the raw reads with us. A flow chart that depicts the genome selection strategy can be found in the supplementary material (S1 Fig).

Latin American L4 Mycobacterium tuberculosis genome analysis

We performed the same strategy for all the isolates starting with WGS paired-end reads. First, we filtered low-quality reads using CUTADAPT v2.10 [15]. Read ends with bases below the Q30 quality threshold were removed, then reads with lengths below 70 bases were excluded. Singleton reads were also excluded from further analysis.

Filtered reads were assembled using SPADES v3.14.1 [16]. The assemblies’ descriptive statistics were calculated with an in-house python script.

Average sequencing depth was calculated using SAMTOOLS [17] coverage tool while the alternate allele count was obtained by counting the number of variants from the VCF file created with the program BCFTOOLS mpileup [18].

Phylogenomic analysis

We used 2,726 M. tuberculosis conserved single-copy genes for the phylogenomic analysis. Individual genes were searched within each assembly using BLASTN [19]. Then, they were extracted and aligned each CDS with its respective homologues using MAFFT [20]. Then, aligned individual CDSs were merged using the program catsequences (https://github.com/ChrisCreevey/catsequences). As outgroups, M. tuberculosis sublineage L2 genomes T67 and T85 were included.

A Maximum-likelihood (ML) tree was computed using IQTREE2 v. 2.1.32 [21]. The matrix was organized into partitions with different substitutions models selected according to BIC (Bayesian Information Criterion) [22]. The matrix included 540 taxa, including references and outgroups (L2), with 56 partitions, 2,895,445 total sites, and 8,339 parsimony-informative sites. One thousand pseudoreplicates of SH-aLRT and ultra-fast bootstrap were performed, and consensus tree generated was edited in FigTree v1.4.4 (http://tree.bio.ed.ac.uk/software/figtree/). The tree terminals that exhibited too long branches were removed to reduce noise in the phylogeny.

Genome deletion analysis

To determine structural deletions, each genome was aligned to H37Rv using the MUMMER v2 DNADIFF [23] script and the ASSEMBLYTICS tool [24]. The deletion events list (coordinates and size) of each genome was obtained using an in-house python script that processed the assemblytics’ output files. Using the H37Rv annotation we mapped each deletion coordinates to the gene or genes that affect and plotted the deletion frequency on each lineage using matplotlib. Figures of selected deletions were prepared using the program Artemis [25] and using as reference the M. tuberculosis H37Rv genome downloaded from GenBank, accession GCF_000195955.2.

Statistical and graphical analysis

Statistical and graphical analyses like boxplots were performed in the R environment unless stated otherwise (R version 4.0.4 64, x86_64-apple-darwin17.0 (64-bit)) and R studio (Version 1.2.1335). The scaffold and alternate allele count quality thresholds were calculated in R using the box plot upper limit formula Q3 + 1.5 x IQR.

Availability of data and materials

Raw reads were deposited at the NCBI SRA database under the bioproject PRJNA867148. There is also a reviewer link that will be active during the manuscript review process: https://dataview.ncbi.nlm.nih.gov/object/PRJNA867148?reviewer=okjkdkaevi75kfu8phla1a30su.

Results

Genome quality assessment of the Latin American M. tuberculosis L4 isolates

We searched the NCBI SRA database for Illumina WGS data of Latin-American Mtb isolates of the lineage 4 (L4), the most prevalent lineage in the Latin American region. We found WGS reads of public projects of Mtb isolates from Mexico, Guatemala, Costa Rica, Panama, Colombia, Peru, Brazil, and Argentina. We also included six Mtb L4 genomes from Ecuador whose reads are unavailable in the SRA database. In the case of Colombia, we generated 88 new genomes from infected patients that live in the Andean city of Medellin. In the case of countries like Argentina, Brazil, Peru, and Mexico, hundreds of genome projects are publicly available. For the rest, a limited number of genome projects are available, in some cases below 10.

The first analysis was directed to detect and remove genome projects that exhibited poor-quality data based on filtering of low-quality reads or poor assembly performance. With this goal, reads were filtered with a Q30 threshold at both ends, and reads with lengths below 70 bases, were filtered out. Singleton reads were also excluded. In Fig 1, boxplots depict the general assembly statistics of the scaffolds, including scaffold average read coverage (Fig 1, panel A), total assembled bases (Fig 1, panel C), N50 value (Fig 1, panel D), largest assembled scaffold (Fig 1, panel E), and scaffold count (Fig 1, panel F). Complementary, we analyzed the presence of alternate alleles using a self-mapping approach and a subsequent alternate allele calling and counting for each isolate (Fig 1, panel B). Since Mtb genome is haploid, detecting of any alternate allele indicates a mixture of genomes. Low alternate allele counts are normal even in clonal strains, but a more significant number might indicate a mixture of distant strains.

As shown in the Fig 1 boxplots, the assembly metrics showed a wide dispersion in all tested variables, especially in the sequencing coverage and assembly N50 measurements. The average sequencing depth ranged between 8.89X and 1358X with a median value of 136X. The N50 value showed a minimum value of 188 and a maximum of 205,097. The median N50 value was 108,926 bp. By contrast, the total assembled bases and scaffold count showed the narrowest dispersion ranges. Percentiles 25th and 75th for the total assembled bases were 4,348,965 and 4,380,025 bp., with a median value of 4,373,122 bp. In the case of the scaffold count, percentiles 25th and 75th were 141 and 238, with a median of 168 scaffolds. Outlier values observed in all tested variables indicated the presence of low-quality assemblies; for instance, five assemblies were below 2 Mb, 18 were above 5 Mb, and three were above 10Mb (indicating possible severe contamination in these 3 genomes). On the other hand, assembly fragmentation could also indicate noisy genomic reads that lead to less accurate scaffold models. The median value of 168 scaffolds per assembly coincides with the expected performance for short-read technology. This scaffold count result agrees even with the old 454 WGS strategy. The narrow quartile ranges Q1 and Q3, 141 and 238, respectively, in such a large number of analyzed genomes, might denote what could be expected in good performing WGS experiment for this bacterium. According to the boxplot analysis, assemblies with more than 332 scaffolds are considered upper limit outliers. Notably, 43 assemblies showed to be highly fragmented with more than 1,000 scaffolds. One final metric allowed us to detect possible genome mixtures in the WGS experiments: the presence of high alternate allele counts. The median value for this metric was 111 alternate alleles in the tested genomes, corresponding to 0.0025% of the median genome assembly size. One conclusion that can be drawn from this result is that most genomes are highly clonal, and the expected accuracy of the scaffolds models presented in this work should be very high for most Latin American Mtb L4 genome assemblies.

Integrating these results, we can infer that, notwithstanding most of the genomes have acceptable quality assembly metrics, several genomes denote poor quality and must be removed to reduce noise in subsequent phylogenetic and comparative genomic analysis. In this sense, we decided to perform low-quality genome filtering. To do so, we focused on three different thresholds: i) average sequencing depth, ii) scaffold count, and iii) alternate allele counts. Combining these three thresholds allow the removal of low-covered, highly fragmented, or mixed genomes. The average sequencing depth threshold was set to 100X, as is recommended for most Illumina short read WGS experiments. The scaffold and alternate allele count thresholds were set using the box plot upper limit formula Q3 + 1.5 x IQR. In the case of the scaffold count, the threshold was set at 332, while for the alternate alleles, it was set at 186. Using these three filters, out of the 866 isolates that we included at the beginning, only 522 were retained for phylogenetic and comparative genomic analysis, as follows: Argentina (n = 210), Brazil (n = 86), Colombia (n = 88), Ecuador (n = 6), Guatemala (n = 10), Mexico (n = 10), and Peru (n = 112). In Fig 1, panels G to H, we present boxplots that depict the filtered assembly genome metrics.

Phylogenomic analysis of the M. tuberculosis Latin-American L4 lineage

We wanted to gain insights into the sublineage assignation of the 522 filtered Latin American Mtb L4 genomes. Using the TB-profiler tool, we found that L4.1.1, L4.3.2, and L.4.8 are the most widespread sublineages being present in at least five different Latin American countries (Fig 2). However, the L4.1.2.1 was the most prevalent lineage accounting for nearly 47% of the filtered genomes. In conclusion, the L4.1 is the more successful sublineage in Latin America, representing around 57% of the Mtb analyzed (filtered) genomes. Peru showed the most diverse population of Mtb genomes with 10 different sublineages, and Argentina was the less diverse with just 2 sublineages (Fig 2). In a few cases, the TB-profiler tool reported two different L4 sublineages for one single isolate.

Fig 2 — Argentina (n = 210), Brazil (n = 86), Colombia (n = 88), Ecuador (n = 6), Guatemala (n = 10), Mexico (n = 10), and Peru (n = 112).

A maximum-likelihood phylogenomic tree was constructed using 2,726 parsimony informative single copy genes to confirm and complement the barcoding TB-profiler classification of the 522 genomes with an evolutionary perspective (Fig 3 and S2 Fig). The phylogenomic tree recreates the already described topology of the Mtb L4 sublineages, with 100% UF-bootstrap support for all basal nodes. At the tips of the tree, the lowest branch support was 95, suggesting good confidence in the evolutionary history represented in the tree for the different Latin-American isolates. Notably, we can see that the L4 lineage is divided into two major clades. The first one contains only the L4.1 sublineage and the second encompasses the remaining sublineages (4.3 to 4.9). Furthermore, the tree also shows that the sublineages 4.1.1.1, 4.3.2, and 4.8, show the separation into two respective lineages that have not been reported for this subcontinent. Interestingly, while the sublineages from 4.3 to 4.9 show a more widespread distribution in south American countries, for instance, Colombia, Brazil, and Peru, three more modern lineages of the 4.1.2.1 sublineage show a well-supported country-specific pattern for countries like Colombia, Peru, and Argentina: 4.1.2.1Col1, 4.1.2.1Peru1, and 4.1.2.1.1Arg, respectively (Fig 3).

Fig 3 — Maximum-likelihood phylogenomic tree constructed using 2,726 parsimony informative single copy genes tree depicting the phylogenetic relationships among the M. *tuberculosis* Latin-American L4 QC filtered genomes. At the bottom, in brown lines, is the L2 outgroup. The terminals were collapsed to reduce the size of the figure. The colored collapsed branches indicate new lineages. Green stacked bars at the center represent the antibiotic resistance profile based on the *TB-profiler* tool. At the leftmost end of the figure, the colored stacked bars represent the frequency of each lineage within Latin-American countries.

Latin American L4 M. tuberculosis drug resistance profiles

Using the TBprofiler tool, we could detect gene mutations that render the bacteria resistant to commonly used antibiotics (S3 Fig). We found a wide distribution of antibiotic-resistant strains in all lineages and sublineages. We can observe the presence of Mtb isoniazid-resistant strains (Hr-TB), rifampicin-resistant strains (RR-TB), multi-drug resistant strains (MDR-TB), pre-Extensively Drug Resistant (pre-XDR-TB), and sensitive strains. Nevertheless, the majority of Mtb strains are sensitive across all sublineages.

Genome deletions analysis in Latin American L4 M. tuberculosis strains

Genomic deletions are common in M. tuberculosis genomes. To gain insights into the genome deletions profiles of the L4 Latin American isolates, considering the advantage of having the genomes assemblies, we focused on large deletion events, i.e., >30 bases. Large deletions are usually overlooked using typical read mapping strategies. Compared to the H37Rv genome reference, the genomic deletions observed in the 522 filtered genomes ranged between 1 and 10,039 bases, being, as already reported, the most common ones those involving only 1 base (n = 44,366) (Fig 4) [7]. Large deletions (≥ 30 bases) were also common, with a total count of 23,882 events. It is noteworthy to mention that we also observed 4,111 deletion events that exceeded 1,000 lost bases.

Fig 4 — Bar plot depicting the frequency of the genomic deletions found on the 522 M. *tuberculosis* L4 QC filtered genomes grouped by their length in bases.

Next, we wanted to study if there is any specific signal of the deletion profiles within the Latin American L4 lineages. To do so, boxplot analysis of the deletion events count, the accumulated lost bases per isolate, and the largest deletion event per isolate were performed, grouping the genomes according to their respective evolutionary lineage (Fig 5). The number of deletions events detected per isolate varied from 9 to 142, showing some lineages with higher deletion event counts, like 4.1.1.1b and 4.1.2.1col1. By contrast, sublineage 4.9 accumulated fewer deletions events with a median of 41. We calculated the total lost bases by adding the individual lengths of each deletion event. This analysis allowed us to see how many bases can be lost within each lineage. The sublineage 4.9 showed the less lost genome value with a median of 585 bases, while the 4.1.1.1b lineage showed the highest genome losses with a median of 10,875 lost bases. Again, as observed in the previous variables, total genome loss showed different patterns depending on the analyzed lineage. Lastly, the largest deletion event was spotted and compared among the Mtb L4 Latin American lineages. This boxplot analysis allowed us to differentiate, based on the median size of the largest genome deletion, several lineages or groups of related lineages that specifically share a particular deletion; like 4.1.2.1 (4.1.2.1.1Arg, 4.1.2.1Col1, 4.1.2.1cpb, and 4.1.2.1Peru1), 4.3.2a/b, 4.3.3 and 4.3.4 (4.3.4.1 and 4.3.4.2) (Fig 5). Remarkably, a large deletion of 6,479 bases appears like a signature for all 4.1.2.1 strains. This deletion affects 10 contiguous genes.

Fig 5 — Boxplots depict quantitative characteristics of deletion events observed on the 522 Latin-American M. *tuberculosis* L4 QC filtered genomes grouped according to their evolutive lineage. The top panel presents the results of the sum of the deletion events, the middle panel presents the total accumulated lost bases per isolate, and the bottom panel shows the distribution of the largest deletion event in bp.

Given the importance of gene deletions as a source of loss of gene function, we wanted to annotate and compare, from an evolutionary perspective, the genes affected by these large deletions. Thus, we positioned the coordinates of the large genomic deletion on the reference chromosome of Mtb H37Rv and spotted the affected genes. We performed this analysis by grouping the deletion events according to the respective L4 lineages and we also quantified the relative frequency of each deletion event per lineage in the form of a heatmap (Fig 6). Most of the deletion events affected one single gene, and according to the genomic coordinates of the affected genes, these losses occurred along the whole bacterial chromosome. Conversely, the largest deletion events sometimes spanned several genes. Nucleotide losses in PPE and PE_PGRS genes were the most common.

Fig 6 — Genes affected by deletions are ranked according to lineages. The lineages are shown on the x-axis, while the y-axis shows the genes. The intensity of the color represents the frequency of the deletion in a particular gene within each lineage (the proportion of genomes that harbor a deletion in the gene). When one deletion affects several genes, the list of all affected genes is written on the y-axis.

While some of the genes affected by deletions showed patterns related to the phylogenetic background of the isolates and looks like synapomorphies, other do not follow this scheme and appear to behave more likely as a polyphyletic character. In this sense, it is noteworthy to mention that genes like ppe24, ppe8, rv2277c, and accE5, accumulated mutations in most lineages, following a non-vertically driven inheritance pattern. By contrast, some deletion events served as signatures for some lineages. For instance, a single deletion event that spans genes rv3083, lipR, and rv3085 shows a trend as a marker for the 4.8 sublineage in the Latin American M. tuberculosis genomes. On the other hand, a deletion that spans over the genes eccD2, rv3888c, and espG2 seem specific for the lineage 4.1.1.1b.

We detected the 3.6 kb genomic deletion known as RD174 that is specific of the lineage 4.3.4 and extends over the genes ctpG, rv1993c, cmtR, rv1995, rv1996, and ctpF.

We also found two novel deletion events that specifically spot lineage 4.1.2.1. One of these deletions affects gene PPE69, and the other is a larger deletion that spans 6,479 nucleotides and affects the genes ippN, rv2271, rv2272, rv2273, mazF8, mazE8, rv2275, cyp121, rv2277c, and rv2280 (Fig 7, panel A). Additionally, several strains of the sister clades 4.1.2.1Col1, 4.1.2.1Peru1, and 4.1.2.1cpb share a specific large deletion that spans 4,752 bases and that affects genes rv1353c, rv1354c, moeY, and rv1356c (Fig 7, panel B). Finally, we observed another novel deletion that affects several strains in the lineage 4.8.a. This deletion spans 4,886 base pairs and affects genes rv0762c, rv0763c, cyp51, rv0765c, cyp123, rv0767c, and aldA (Fig 7, panel C).

Fig 7 — The top half of each figure depicts the plus DNA strand of the M. *tuberculosis* H37Rv reference genome with its nucleotide coordinates and its respective reading frames, while the bottom half depicts the minus strand similarly. Gene CDS sequences are depicted as white rectangles, while blue rectangles depict predicted peptide sequences. The deletion event on the chromosome is labeled as a yellow rectangle while the affected peptide regions are labeled with a pink rectangle. **Panel A**. Genomic deletion of 6,479 bp. that is specific to the Latin American L4.1.2.1 lineage and is present in 249 genomes. This deletion affects the genes *ippN*, *rv2271*, *rv2272*, *rv2273*, *mazF8*, *mazE8*, *rv2275*, *cyp121*, *rv2277c*, and *rv2280*. **Panel B**. Genomic deletion of 4,752 bp. that is common in the Latin American lineages 4.1.2.1Col1, 4.1.2.1Peru1, and 4.1.2.1cpb, and is present in 30 genomes. This deletion affects genes *rv1353c*, *rv1354c*, *moeY*, and *rv1356c*. **Panel C.** Genomic deletion of 4,886 bp. that is present in some genomes of the Latin American lineage 4.8.a and is present in 3 genomes. This deletion affects genes *rv0762c*, *rv0763c*, *cyp51*, *rv0765c*, *cyp123*, *rv0767c*, and *aldA*.

Discussion

Current genomic technologies have pushed microbiology to a new era, and M. tuberculosis is probably one of the few bacterial models where the progress is more notable. In this sense, phylogenomic analysis unraveled that what we considered a single organism for nearly two centuries is, in fact, a complex mixture of nine different lineages with evident biological, geographical, and pathogenic dissimilarities [6, 26, 27].

Any reliable genomic analysis starts with good quality sequencing reads, but the typical read quality metrics based on PHRED scores alone appear insufficient to filter out poor quality genomic libraries. To tackle this situation, in the first part of this work, we wanted to assess the quality of Illumina shotgun reads of hundreds of Latin-American Mtb genomes deposited in the SRA database. We relied on commonly used descriptive statistics to compare the assembly performance and used a complementary strategy to detect a mixture of different Mtb strains. Despite that most genomes displayed a suitable profile of raw read quality metrics, our analysis noticed that nearly 40% of the read datasets were of poor quality and exhibited low coverage, contamination, or noisy DNA sequence signal. We advised researchers to perform a similar analysis to check Mtb read datasets used for de novo assembly, evolutionary or comparative genomic analysis. Our assembly metrics thresholds might be a good starting point for Mtb genome quality filtering.

M. tuberculosis lineage 4 showed to be diverse within the Latin-American countries, but some lineages showed to be widespread among several countries while others showed a narrower distribution [6]. We observed that Mtb sublineages L4.1.1, L4.1.2, L4.3.3, and L4.3.2 dominated in Latin-American human populations. However, we must stress that most of the strains included in this study were collected by other researchers, some in TB outbrake studies. This condition might introduce bias to the strain frequencies. Nevertheless, it must also be mentioned that we included a large number of strains that geographical cover from Mexico to Argentina, an area of nearly 20 million km².

Our high-resolution phylogenomic analysis included 2,726 CDSs that showed to be parsimony-informative, allowing us to portray a well-supported evolutionary history of the Mtb L4 in Latin America. The tree coincided with previous reports and showed a good concordance with the TB-profiler tool [4–6, 27, 28]. Nevertheless, the program reported two different, sometimes none-related, sublineages in a few strains. Our phylogenetic analyses positioned these conflictive samples in one lineage in the tree with 100% support. While the TB-profiler tool uses a small number of SNPs to classify the genomes (<100), our phylogenomic approach included 8,339 parsimony-informative sites.

The phylogeny also revealed that sublineages 4.8, 4.3.2, and 4.1.1.1 split into two well-supported internal clades. One additional observation is that sublineage 4.1.2.1 showed a more complex evolutionary past and wide distribution in most studied south American countries. This sublineage showed to be pretty successful and spreads throughout the whole subcontinent from Colombia (the northwesternmost nation in south America) to one of the most southern countries, Argentina. Furthermore, it shows several internal lineages and three of them with country-specific distributions: 4.1.2.1Col1 (Colombia), 4.1.2.1Peru1 (Peru), and 4.1.2.1.1Arg (Argentina). Similar observations, regarding geographical structure of this mycobacteria, have been observed in similar investigations performed in other M. tuberculosis lineages and continents [6, 29].

Regarding the resistance profiles of the Latin American L4 Mtb strains, it’s clear that the antibiotic resistance genotype does not show a vertically inherited pattern associated with specific lineages. On the contrary, most sublineages exhibited a mixture of resistant genotypes. In Mtb and other bacterial pathogens convergent evolution has been associated with the phenomenon of antibiotic resistance [5, 30, 31].

Genomic deletions are sculpting the Mtb L4 genomes of the future. Large deletions were observed in all Latin America L4 Mtb genomes but, interestingly, some lineages accumulated a higher number of deletions with their respective genome losses. Deletions spanned from 1 to 10kb, and it was common to find events where more than 1,000 bases were lost. Ten percent of the analyzed Latin American Mtb L4 strains have lost at least 0.25% of their genome. Most large deletions affected coding sequences, explained in part thanks to the high gene density of the Mtb genome [32]. Recent works on Mtb genomic deletions, saw congruent findings with our results within the Mtb lineages studied [33, 34], Nonetheless, as will be discussed below, our focus on Latin American genomes allowed us to detect novel large deletions that haven´t been reported so far.

The deletion events occurred along the whole chromosome, but as already described, there is a bias towards the repetitive gene families PE_PGRS and PPE [8, 10, 12]. Single base losses were the most common, and they are described as a reversible phenomenon that adds genetic plasticity to this pathogenic bacterium [7]. Nonetheless, large DNA deletions seem unlikely to be reversed and, in some cases, become genomic signatures for certain sublineages. Our evolutive genome analysis of the large deletions exposed that some of these events serve as signatures for several South America L4 sublineages, for instance, 4.1.2.1, 4.3.2, 4.3.3, and 4.3.4; these lineages show higher frequencies in the studied South American nations.

The genes ppe24, ppe8, rv2277c, and accE5, showed some bias to accumulate deletion regardless of their evolutive lineage. The PPE24 and PPE8 proteins belong to the multicopy family PPE, they encode 1,051 and 3,300 amino acids proteins, respectively, and their function is still unknown. Gene rv2277c encodes for a 301-aminoacids putative glycerolphosphodiesterase of unknown function. The gene accE5 encodes for a probable bifunctional acetyl-/propionyl-coenzyme A carboxylase (epsilon chain) (177 amino acids) involved in synthesizing long-chain fatty acids.

Sublineage 4.8 signature deletion affects the genes rv3083, lipR, and rv3085, which products are annotated as FAD-containing monooxygenase MymA, acetyl-hydrolase LipR, and oxidoreductase SadH, respectively. Sublineage 4.1.1.1b signature deletion affects two proteins related to the ESX-2 type VII secretion system and a probable membrane protein.

In the case of lineage 4.3.4, the 3,650 bases deletion is a signature scar in its genome.

We previously reported this deletion in the year 2012 in a Colombian Mtb strain (UT205 strain) that extends over the genes ctpG, rv1993c, cmtR, rv1995, rv1996, and ctpF, part of the DosR regulon [12]. This specific deletion has been labelled as RD174. These results agree with the works recently published by Liu et al. and Bespiatykh et al. and confirms that this deletion is a signature of the 4.3.4 lineage [33, 34]. This deletion was inherited in the two branches of this lineage: 4.3.4.1 and 4.3.4.2. It is noteworthy that this deletion differentiates 4.3.4 from its sister clade 4.3.3. These two lineages have been demonstrated to have biological and virulence disparities [27].

One of the most striking lineage-specific deletions is the one that spans 6.5 kb, which is specific to the 4.1.2.1 lineage. Despite previous efforts oriented to describe MTB large genome deletions, this deletion was overlooked in the works published by Bespiatykh et al. [34] and Liu et al. [33] and is reported for the first time in this paper. This deletion affects a complex group of 10 genes positioned at both DNA strands. At the 5’ end of the deletion, within the sense strand, organized in one operon-like structure, affects four genes annotated as lipoprotein LppN, hypothetical protein, and two transmembrane proteins. Subsequently, in the antisense strand, the toxin-antitoxin (TA) operon mazE8/ mazF8 (Pandey and Gerdes, 2005) is completely lost, as well as two neighbor genes that partially overlap mazE8: rv2275 and cyp121. These former two genes encode for a probable cyclo(L-tyrosyl-L-tyrosyl) synthase, and a protein annotated as cytochrome P450, respectively. This Toxin-antitoxin locus is usually lost is host-associated bacteria [35]. The following lost gene is located at the antisense strand, rv2277c, which encodes a glycerolphosphodiesterase. Lastly, this large deletion ends at its 3’ in a DNA repetitive element that is annotated as an IS6110 transposase. All the genes described above, that are affected by the large deletions, have been described as non-essential for in vitro growth [36, 37].

Another two interesting novel deletions that we observed in the Latin American genomes were those that were partially shared by its respective sublineages 4.8a or 4.1.2.1Col1, 4.1.2.1Peru1, and 4.1.2.1cpb. The former (present in sublineage 4.8a) disrupts the genes rv0762c, rv0763c, cyp51, rv0765c, cyp123, rv0767c, and aldA. Rv0762 and Rv0767 are annotated as conserved hypothetical proteins. Rv0763 is annotated as a Possible ferredoxin. Cyp51 and Cyp123 are annotated as non-essential P450 Cytochromes. Rv0765 is annotated as a non-essential probable oxidoreductase. Finally, AldA is annotated as a probable NAD-dependent aldehyde dehydrogenase that is also non-essential. The latter disrupts partially the genes rv1353c and rv1356c and entirely the genes rv1354c and moeY. Rv1354c and Rv1356c are annotated as hypothetical proteins, while Rv1353c and MoeY are annotated as Probable transcriptional regulatory proteins and Possible molybdopterin biosynthesis proteins, respectively. All these last genes are considered non-essential after transposon mutagenesis experiments [36].

Despite the broad spectrum of deletions that we observed in the L4 Mtb lineage infecting Latin-American individuals, it’s clear that the bacteria are not losing their capacity to thrive in humans notwithstanding their gene losses. Gene loss, even loss of biochemical pathways, is an adaptation mechanism for specialist intracellular pathogens. This phenomenon was demonstrated in protozoan parasites like Cryptosporidium spp. [38, 39] and in pathogenic bacteria like Shigella [40]. The latter evolved from E. coli showing an accelerated process of gene loss and acquiring plasmids that harbor virulence genes. This process rendered the bacteria highly specialized in the human intestine and highly virulent [41–43]. For Mtb L4 Latin American strains we might witness a similar process where the gene losses shape the genetic adaptations of these mycobacteria to the changing Latin American human populations.

Supporting information

S1 Fig. Flow chart depicting the genome selection strategy.

(TIF)

Click here for additional data file.^{(405.6KB, tif)}

S2 Fig. Maximum-likelihood phylogenomic tree constructed using 2,726 parsimony informative single copy genes tree depicting the phylogenetic relationships among the M. tuberculosis Latin-American L4 522 QC filtered genomes.

At the bottom, in brown lines, is the L2 outgroup. Numbers at nodes indicate ultrafast bootstrap support. Sublineages are labeled in their respective branches and highlighted with different colors.

(TIF)

Click here for additional data file.^{(1MB, tif)}

S3 Fig. Stacked bar plot representing the drug-resistant type profiles of the different Latin American M. tuberculosis L4 sublineages.

Drug-resistant types are depicted as normalized values on the X axis.

(TIF)

Click here for additional data file.^{(406.2KB, tif)}

S1 File

(CSV)

Click here for additional data file.^{(63.5KB, csv)}

Data Availability

All newly generated genomic read data files are available from the NCBI SRA Database bioproject PRJNA867148.

Funding Statement

Minciencias, Colombia. CODE: 111584467121, Contrato No. 393-2020.

References

1.Barberis I, Bragazzi NL, Galluzzo L, Martini M. The history of tuberculosis: From the first historical records to the isolation of Koch’s bacillus. Journal of Preventive Medicine and Hygiene. 2017;58: E9–E12. [PMC free article] [PubMed] [Google Scholar]
2.Global Tuberculosis Programme. Global tuberculosis report 2021. 2021. [Google Scholar]
3.Coscolla M, Gagneux S, Menardo F, Loiseau C, Ruiz-Rodriguez P, Borrell S, et al. Phylogenomics of Mycobacterium africanum reveals a new lineage and a complex evolutionary history. Microbial Genomics. 2021;7: 000477. doi: 10.1099/mgen.0.000477 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Napier G, Campino S, Merid Y, Abebe M, Woldeamanuel Y, Aseffa A, et al. Robust barcoding and identification of Mycobacterium tuberculosis lineages for epidemiological and clinical studies. Genome Medicine. 2020;12: 1–10. doi: 10.1186/s13073-020-00817-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Brynildsrud OB, Pepperell CS, Suffys P, Grandjean L, Monteserin J, Debech N, et al. Global expansion of Mycobacterium tuberculosis lineage 4 shaped by colonial migration and local adaptation. Science advances. 2018;4: eaat5869–eaat5869. doi: 10.1126/sciadv.aat5869 [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Stucki D, Brites D, Jeljeli L, Coscolla M, Liu Q, Trauner A, et al. Mycobacterium tuberculosis lineage 4 comprises globally distributed and geographically restricted sublineages. Nature Genetics. 2016;48: 1535–1543. doi: 10.1038/ng.3704 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Gupta A, Alland D. Reversible gene silencing through frameshift indels and frameshift scars provide adaptive plasticity for Mycobacterium tuberculosis. Nature Communications. 2021;12: 1–11. doi: 10.1038/s41467-021-25055-y [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Ates LS. New insights into the mycobacterial PE and PPE proteins provide a framework for future research. Molecular microbiology. 2020;113: 4–21. doi: 10.1111/mmi.14409 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Karboul A, Mazza A, van Pittius NCG, Ho JL, Brousseau R, Mardassi H. Frequent Homologous Recombination Events in Mycobacterium tuberculosis PE/PPE Multigene Families: Potential Role in Antigenic Variability. Journal of Bacteriology. 2008;190: 7838–7846. doi: 10.1128/JB.00827-08 [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Tsolaki AG, Hirsh AE, DeRiemer K, Enciso JA, Wong MZ, Hannan M, et al. Functional and evolutionary genomics of Mycobacterium tuberculosis: Insights from genomic deletions in 100 strains. Proceedings of the National Academy of Sciences. 2004;101: 4865–4870. doi: 10.1073/pnas.0305634101 [DOI] [PMC free article] [PubMed] [Google Scholar]
11.ten Bokum AMC, Movahedzadeh F, Frita R, Bancroft GJ, Stoker NG. The case for hypervirulence through gene deletion in Mycobacterium tuberculosis. Trends in microbiology. 2008;16: 436–441. doi: 10.1016/j.tim.2008.06.003 [DOI] [PubMed] [Google Scholar]
12.Isaza JP, Duque C, Gomez V, Robledo J, Barrera LF, Alzate JF. Whole genome shotgun sequencing of one Colombian clinical isolate of Mycobacterium tuberculosis reveals DosR regulon gene deletions. FEMS Microbiology Letters. 2012;330: 113–120. doi: 10.1111/j.1574-6968.2012.02540.x [DOI] [PubMed] [Google Scholar]
13.Warren RM, Gey van Pittius NC, Barnard M, Hesseling A, Engelke E, de Kock M, et al. Differentiation of Mycobacterium tuberculosis complex by PCR amplification of genomic regions of difference. The international journal of tuberculosis and lung disease: the official journal of the International Union against Tuberculosis and Lung Disease. 2006;10: 818–822. [PubMed] [Google Scholar]
14.Garzon-Chavez D, Garcia-Bereguiain MA, Mora-Pinargote C, Granda-Pardo JC, Leon-Benitez M, Franco-Sotomayor G, et al. Population structure and genetic diversity of Mycobacterium tuberculosis in Ecuador. Scientific Reports. 2020;10: 6237. doi: 10.1038/s41598-020-62824-z [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal. 2011;17: 10. doi: 10.14806/ej.17.1.200 [DOI] [Google Scholar]
16.Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology. 2012;19: 455–477. doi: 10.1089/cmb.2012.0021 [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011;27: 2987–2993. doi: 10.1093/bioinformatics/btr509 [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, et al. Twelve years of SAMtools and BCFtools. GigaScience. 2021;10: giab008. doi: 10.1093/gigascience/giab008 [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of molecular biology. 1990;215: 403–410. doi: 10.1016/S0022-2836(05)80360-2 [DOI] [PubMed] [Google Scholar]
20.Katoh K, Standley DM. MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Molecular Biology and Evolution. 2013;30: 772–780. doi: 10.1093/molbev/mst010 [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, et al. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Mol Biol Evol. 2020;37: 1530–1534. doi: 10.1093/molbev/msaa015 [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Lanfear R, Calcott B, Kainer D, Mayer C, Stamatakis A. Selecting optimal partitioning schemes for phylogenomic datasets. BMC Evolutionary Biology. 2014;14: 1–14. doi: 10.1186/1471-2148-14-82 [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Marçais G, Delcher AL, Phillippy AM, Coston R, Salzberg SL, Zimin A. MUMmer4: A fast and versatile genome alignment system. Darling AE, editor. PLOS Computational Biology. 2018;14: e1005944. doi: 10.1371/journal.pcbi.1005944 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Nattestad M, Schatz MC. Assemblytics: a web analytics tool for the detection of variants from an assembly. Bioinformatics. 2016;32: 3021–3023. doi: 10.1093/bioinformatics/btw369 [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Carver T, Harris SR, Berriman M, Parkhill J, McQuillan JA. Artemis: An integrated platform for visualization and analysis of high-throughput sequence-based experimental data. Bioinformatics. 2012;28: 464–469. doi: 10.1093/bioinformatics/btr703 [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Comas I, Coscolla M, Luo T, Borrell S, Holt KE, Kato-Maeda M, et al. Phylogenomics of Mycobacterium africanum reveals a new lineage and a complex evolutionary history. Nature Genetics. 2013;45: 1176–1182. doi: 10.1038/ng.2744 [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Baena A, Cabarcas F, Alvarez-Eraso KL, Isaza JP, Alzate JF, Barrera LF. Differential determinants of virulence in two Mycobacterium tuberculosis Colombian clinical isolates of the LAM09 family. Virulence. 2019. doi: 10.1080/21505594.2019.1642045 [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Mahé P, Azami M El, Barlas P, Tournoud M A large scale evaluation of tbprofiler and mykrobe for antibiotic resistance prediction in mycobacterium tuberculosis. PeerJ. 2019;2019. doi: 10.7717/peerj.6857 [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Freschi L, Vargas R, Husain A, Kamal SMM, Skrahina A, Tahseen S, et al. Population structure, biogeography and transmissibility of Mycobacterium tuberculosis. Nature Communications. 2021;12: 6099. doi: 10.1038/s41467-021-26248-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Keshri V, Arbuckle K, Chabrol O, Rolain J-M, Raoult D, Pontarotti P. The functional convergence of antibiotic resistance in β-lactamases is not conferred by a simple convergent substitution of amino acid. Evolutionary Applications. 2019;12: 1812–1822. doi: 10.1111/eva1.2835 [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Fajardo-Lubián A, Ben Zakour NL, Agyekum A, Qi Q, Iredell JR. Host adaptation and convergent evolution increases antibiotic resistance without loss of virulence in a major human pathogen. Skurnik D, editor. PLOS Pathogens. 2019;15: e1007218. doi: 10.1371/journal.ppat.1007218 [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Cole ST, Brosch R, Parkhill J, Garnier T, Churcher C, Harris D, et al. Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature. 1998;393: 537–544. doi: 10.1038/31159 [DOI] [PubMed] [Google Scholar]
33.Liu Z, Jiang Z, Wu W, Xu X, Ma Y, Guo X, et al. Identification of region of difference and H37Rv-related deletion in Mycobacterium tuberculosis complex by structural variant detection and genome assembly. Frontiers in Microbiology. 2022;13: 1–14. doi: 10.3389/fmicb.2022.984582 [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Bespiatykh D, Bespyatykh J, Mokrousov I, Shitikov E. A Comprehensive Map of Mycobacterium tuberculosis Complex Regions of Difference. mSphere. 2021;6: 1–14. doi: 10.1128/mSphere.00535-21 [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Pandey DP, Gerdes K. Toxin-antitoxin loci are highly abundant in free-living but lost from host-associated prokaryotes. Nucleic acids research. 2005;33: 966–976. doi: 10.1093/nar/gki201 [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Minato Y, Gohl DM, Thiede JM, Chacón JM, Harcombe WR, Maruyama F, et al. Genomewide Assessment of Mycobacterium tuberculosis Conditionally Essential Metabolic Pathways. mSystems. 2019;4. doi: 10.1128/mSystems.00070-19 [DOI] [PMC free article] [PubMed] [Google Scholar]
37.DeJesus MA, Gerrick ER, Xu W, Park SW, Long JE, Boutte CC, et al. Comprehensive Essentiality Analysis of the Mycobacterium tuberculosis Genome via Saturating Transposon Mutagenesis. mBio. 2017;8. doi: 10.1128/mBio.02133-16 [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Xu P, Widmer G, Wang Y, Ozaki LS, Alves JM, Serrano MG, et al. The genome of Cryptosporidium hominis. Nature. 2004;431: 1107–1112. doi: 10.1038/nature02977 [DOI] [PubMed] [Google Scholar]
39.Abrahamsen MS, Templeton TJ, Enomoto S, Abrahante JE, Zhu G, Lancto CA, et al. Complete genome sequence of the apicomplexan, Cryptosporidium parvum. Science (New York, NY). 2004;304: 441–445. doi: 10.1126/science.1094786 [DOI] [PubMed] [Google Scholar]
40.Lan R, Reeves PR. Escherichia coli in disguise: molecular origins of Shigella. Microbes and Infection. 2002;4: 1125–1132. doi: 10.1016/s1286-4579(02)01637-4 [DOI] [PubMed] [Google Scholar]
41.The HC, Thanh DP, Holt KE, Thomson NR, Baker S. The genomic signatures of Shigella evolution, adaptation and geographical spread. Nature Reviews Microbiology. 2016;14: 235–250. doi: 10.1038/nrmicro.2016.10 [DOI] [PubMed] [Google Scholar]
42.Hershberg R, Tang H, Petrov DA. Reduced selection leads to accelerated gene loss in Shigella. Genome Biology. 2007;8: R164. doi: 10.1186/gb-2007-8-8-r164 [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Yang F, Yang J, Zhang X, Chen L, Jiang Y, Yan Y, et al. Genome dynamics and diversity of Shigella species, the etiologic agents of bacillary dysentery. Nucleic Acids Research. 2005;33: 6445–6458. doi: 10.1093/nar/gki954 [DOI] [PMC free article] [PubMed] [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0285417.r001

Decision Letter 0

Christophe Sola

9 Feb 2023

PONE-D-22-28017Large genomic deletions delineate Mycobacterium tuberculosis L4 sublineages in South American countriesPLOS ONE

Dear Dr. Juan Fernando Alzate,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process (point by point, as raised by reviewers). In particular, the paper appears too lenghty and needs a major revision according to one of the two reviewers.May I add that, in an evolutionary genomic perspective, "South America" or "Latin America" does not make such cultural or geographical unity (indigenous and portuguese and spanish influences) and the paper as such could be improved by tryin to distinguish specifically what is Peru-specific as compared to other south-american countries characteristics of MTBC.

=============================Please ensure that your decision is justified on PLOS ONE’s publication criteria and not, for example, on novelty or perceived impact.

For Lab, Study and Registered Report Protocols: These article types are not expected to include results but may include pilot data.

==============================

Please submit your revised manuscript by Mar 26 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Christophe Sola, Pharm.D. Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please provide additional details regarding participant consent. In the ethics statement in the Methods and online submission information, please ensure that you have specified (1) whether consent was informed and (2) what type you obtained (for instance, written or verbal, and if verbal, how it was documented and witnessed). If your study included minors, state whether you obtained consent from parents or guardians. If the need for consent was waived by the ethics committee, please include this information.

If you are reporting a retrospective study of medical records or archived samples, please ensure that you have discussed whether all data were fully anonymized before you accessed them and/or whether the IRB or ethics committee waived the requirement for informed consent. If patients provided informed written consent to have data from their medical records used in research, please include this information

3. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide.

Additional Editor Comments (if provided):

This paper deserves to be rewritten according to one of the two reviewers. the new findings are not evident. it is apparently legnthy, focus on the main discoveries and on the methodology if brand new

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: N/A

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: No

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: COMMENTS to authors

1. Major critique:

Overall, the paper should be made concise and clear

Now it is very lengthy.

Interest/novelty is not very clear. To me, the more interest may be not in study of the phylogeny in South America, but rather bioinformatics approach.

Other comments

2. Short Title: Genomic deletions in LATAM M. tuberculosis L4 lineage

LATAM abbreviation seems uncommon to me.

3. Title Large genomic deletions delineate Mycobacterium tuberculosis L4 sublineages in South American countries

Very general and obvious. It may be applied to any world region or globally.

4. ABSTRACT

Poorly structured with very long introductory sentences (half of abstract) and unclear part of what was the objective and content of this study. In contrast, some non-essential sentences are included (“Initially, we performed careful quality control of public read datasets and applied several thresholds to filter out low-quality data”)

How these genomes are representative: 522 L4 Latin American Mtb genomes

5. “Using a genome de novo assembly strategy and phylogenomic methods, we spotted new south American lineages that have not been revealed yet.”

Do you mean you have found new large deletions?

Otherwise, assembly of genomes is not needed for Mtb phylogenetic analysis (that is based on genome-wide SNPs, this species is devoid of HGT).

6. “Additionally, we describe genomic deletion profiles of these strains from an evolutionary perspective and report Mycobacterium tuberculosis L4 sublineages signature-like gene deletions.”

Are these deletions are novel? And never reported?

See Bespiatykh, D., Bespyatykh, J., Mokrousov, I., Shitikov, E. (2021). A comprehensive map of mycobacterium tuberculosis complex regions of difference. mSphere. 6 (4), e0053521. doi: 10.1128/mSphere.00535-21 and references therein

INTRODUCTION

7. Very long (this is sci paper not MS or PhD thesis).

METHODS

8. Bacterial culture and DNA extraction for the Colombian genomes – THIS SECTION IS TOO DETAILED. A ref is enough

9. Latin American Mycobacterium tuberculosis L4 genomic data downloaded from the NCBI SRA –

This section needs flowchart how to genomes were searched for and elected or filtered out

10. L4 is very large and heterogeneous and I am not sure that 522 genomes are sufficient

11. Some details on lanes 170-177 seem too technical

12. Lane 181 “We used 2,726 M. tuberculosis conserved single-copy genes for the phylogenomic analysis.” –

Do you mean that you used all these genes, 2,895,445 total sites? Not only variant snps found through alignment of reads to reference genome H37Rv.

This is quite redundant. Concatenated fasta of only variant snps would suffice.

Reviewer #2: This is a very nice manuscript telling us about Mycobacterium tuberculosis L4 lineages in South American countries.

I miss the information on the possibility if the samples are nationwide or not. I assume it is not. It could be a subtype bias and the distribution of subtypes can slightly different.

In this study only shortread WGS has been performed. Just one reference based on long and shortread would have been a help to the analysis instead of H37Rv. But that is optional!

Figure 1 can be excluded and thses results could be contained in the manuscript.

I would appreciate the space/not space in the M&M section between numbers and units are the same.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Erik Michael Rasmussen

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2023 May 19;18(5):e0285417. doi: 10.1371/journal.pone.0285417.r002

Author response to Decision Letter 0

16 Feb 2023

In particular, the paper appears too lenghty and needs a major revision according to one of the two reviewers.

May I add that, in an evolutionary genomic perspective, "South America" or "Latin America" does not make such cultural or geographical unity (indigenous and portuguese and spanish influences) and the paper as such could be improved by tryin to distinguish specifically what is Peru-specific as compared to other south-american countries characteristics of MTBC.

R/: Dear Editor, thanks for your kind assistance in handling the manuscript. We appreciate the time that you and the reviewers invested in the process.

Regarding your comment that Latin America is not a cultural or geographical unit, we respectfully disagree.

The term Latin America is of French origin and was coined during the 1860s and refers to the American countries where languages were Spanish and Portuguese, and to a lesser extent, French. This concept has been gradually accepted since that time. You can find it in the Encyclopedia Britannica (https://www.britannica.com/place/Latin-America).

Geographically speaking, Latin America is well defined, comprising the countries of the American continent from Mexico to the south end of it. Similar biogeographic features make us share some similar ecosystems.

Culturally we share two main languages, Spanish and Portuguese, and the creed is mostly Catholic, with a high rate of practitioners of this religion. From the genetic point of view, Latin American populations are known to be among the most mestizo in the world, given the extensive genetic mixing that occurred between Latin Europeans, Amerindians, and Africans. This topic was well reviewed by Adhikari et al in 2016. (https://doi.org/10.1016/j.gde.2016.09.003).

The contemporary population of Latin America is the result of complex, recent demographic events including admixture, bottlenecking, and subsequent recent, rapid post-colonial population growth. These demographic processes can result in founder effects that can create the conditions necessary for disease variants to drift to detectable frequencies, which directly affect disease susceptibility, and is critically important because it may determine the pathogenic strains that could be circulating among these specific populations (https://doi.org/10.1016/j.gde.2018.07.006).

This particular admixed genetic background has a substantial contribution to phenotypes related to health and disease, as it was demonstrated by Norris et al, 2018 (doi: 10.1186/s12864-018-5195-7).

From the point of view of infectious disease profiles, we share similar profiles with high rates of intestinal and extraintestinal parasitic infections. Precisely this is a point of great difference with Saxon America, the USA, and Canada. In these countries, human populations have different genetic backgrounds with almost null ethnic mixing. Furthermore, USA and Canada are developed nations and display different cultural and infectious disease profiles. Additionally, in recent years the migration of Latin American citizens within the countries of the region has increased dramatically, favoring the phenomena of dispersion and homogenization of pathogenic microorganisms. Therefore, we believe that it is appropriate to understand tuberculosis within the framework of Latin American countries, given its geographical relationship, cultural similarities, population dynamics, and the mestizo genetic base of its population. Drawing a parallel between Latin America and the Asian or European regions, in the former two, where cultural, linguistic, and religious dissimilarities are much greater, there is no objection to understanding these blocks as defined geographical and cultural groups.

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #2: Yes

R:/ Thanks for the review

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: N/A

R:/ Thanks for the review

3. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #2: Yes

R:/ Thanks for the review

4. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: No

Reviewer #2: Yes

R:/ Thanks for the review. English was reviewed again.

5. Review Comments to the Author

Reviewer #1: COMMENTS to authors

1. Major critique:

Overall, the paper should be made concise and clear

Now it is very lengthy.

Interest/novelty is not very clear. To me, the more interest may be not in study of the phylogeny in South America, but rather bioinformatics approach.

R:/ Thanks for the review. The manuscript was modified as suggested by the reviewers. The introduction and methods sections were reduced. The abstract was modified making it more concise and making explicit the novelty of the work. We consider that the bioinformatic approach, as well as the comparative genomic results, are both interesting for the scientific community.

Other comments

2. Short Title: Genomic deletions in LATAM M. tuberculosis L4 lineage

LATAM abbreviation seems uncommon to me.

R:/ LATAM abbreviation was changed by Latin American in the entire document.

3. Title Large genomic deletions delineate Mycobacterium tuberculosis L4 sublineages in South American countries

Very general and obvious. It may be applied to any world region or globally.

R:/ We partially agree with the reviewer, but still, we consider the title is original and valid. If it can be applied to any world region of the whole planet, may be, but for now this is a “probable” speculation. Our results are solid, and we focused our effort on Latin America. Moreover, we observed at least three novel large deletions in one of the most frequent sublineages observed in the Mtb-infected individual in South America.

4. ABSTRACT

How these genomes are representative: 522 L4 Latin American Mtb genomes

R:/ The Abstract was modified reducing the introductory sentences and adding more information about the aim of the study and the most relevant findings.

We started this work with more than 1000 Latin American strains that have public genomic data (SRA repository) with an acceptable sequencing depth. After taxonomy (L4 lineage) and quality filtering these 522 were the ones that best quality with our selection thresholds. For us, it was a paramount goal to reduce noise that can be introduced by low-qual genomic reads. Although you can increase the number of strains, this will not change our main findings of novel country-specific sublineage/clades and the genomic deletion profiles.

5. “Using a genome de novo assembly strategy and phylogenomic methods, we spotted new south American lineages that have not been revealed yet.”

Do you mean you have found new large deletions?

Otherwise, assembly of genomes is not needed for Mtb phylogenetic analysis (that is based on genome-wide SNPs, this species is devoid of HGT).

R:/ We are referring to the lineage definition according to a phylogenetic point of view. It is synonymous with clade, in this case, a subspecies clade that evolved in Latin American countries from the original European introduced L4 Mtb strains.

6. “Additionally, we describe genomic deletion profiles of these strains from an evolutionary perspective and report Mycobacterium tuberculosis L4 sublineages signature-like gene deletions.”

Are these deletions are novel? And never reported?

R:/ Indeed we found at least three novel genomic deletions. We compared our results to those published by Bespiatykh et al. (doi:10.1128/mSphere.00535-21) and Liu et al. (doi:10.3389/fmicb.2022.984582), and they didn’t find these three large deletions that we spotted here.

INTRODUCTION

7. Very long (this is sci paper not MS or PhD thesis).

R:/ The introduction length was significantly reduced in the revised version of the manuscript.

METHODS

8. Bacterial culture and DNA extraction for the Colombian genomes – THIS SECTION IS TOO DETAILED. A ref is enough

R:/ We deleted the description of the procedure of DNA extraction and left to references where the protocols are described in more detail.

9. Latin American Mycobacterium tuberculosis L4 genomic data downloaded from the NCBI SRA –

This section needs flowchart how to genomes were searched for and elected or filtered out

R:/ A flow chart was prepared and added as supplementary material.

10. L4 is very large and heterogeneous and I am not sure that 522 genomes are sufficient

R:/ We agree with the reviewer that L4 is a vast lineage, but it was demonstrated which sublineages were introduced into America since the XVI century (DOI:10.1126/sciadv.aat5869 ) and our results are in concordance with these finds. We analyzed more than 500 high-quality genomes representing 9 countries separated by thousands of kilometers. We think that for actual scientific similar works, this is a representative number. The Phylogenomic results also support the phylogeographic findings presented in the paper.

Additionally, ss has been already described elsewhere (Nat Genet. 2016 December; 48(12): 1535–1543. doi:10.1038/ng.3704.). L4 possess geographically restricted sublineages. This phenomenon is congruent to our observations of evolutionary patterns observed in Latin American human populations.

11. Some details on lanes 170-177 seem too technical.

R:/ We agree with the reviewer. But as other journals have pointed us, it is important to describe details of the bioinformatic pipelines to allow other researchers to reproduce the results. Following this comment from the reviewer, we removed several details. The description in the methods section was modified as follows:

“Filtered reads were assembled using SPADES v3.14.1 [22]. The assemblies' descriptive statistics were calculated with an in-house python script.

Average sequencing depth was calculated using SAMTOOLS [23] coverage tool while the alternate allele count was obtained by counting the number of variants from the VCF file created with the program BCFTOOLS mpileup [24].”

12. Lane 181 “We used 2,726 M. tuberculosis conserved single-copy genes for the phylogenomic analysis.” –

Do you mean that you used all these genes, 2,895,445 total sites? Not only variant snps found through alignment of reads to reference genome H37Rv.

This is quite redundant. Concatenated fasta of only variant snps would suffice.

R:/ Indeed we used all those genes. We understand that a typical read mapping strategy is more commonly used in this kind of work thanks to the lower computation demand. In our case, we preferred to follow a different strategy disregarding the computational load of it. We aimed to extract each orthologous CDS sequence, the complete CDS in most strains, and then align them individually. Then, we concatenated all the 2726 single-copy orthologous genes into one single supermatrix that was fed into the IQTREE2 program. Although the matrix was pretty big, IQTREE2's first step is to mask conserved sites and only work with those positions that are phylogenetically informative. In this case, the complete matrix comprises 2,895,445 sites, but only 8,339 were described for IQTREE2 as parsimony informative. This step does not require much time on our servers. This information is described in the Phylogenomic analysis subheading, second paragraph of the methods section.

Reviewer #2: This is a very nice manuscript telling us about Mycobacterium tuberculosis L4 lineages in South American countries.

I miss the information on the possibility if the samples are nationwide or not. I assume it is not. It could be a subtype bias and the distribution of subtypes can slightly different.

In this study only shortread WGS has been performed. Just one reference based on long and shortread would have been a help to the analysis instead of H37Rv. But that is optional!

R:/ The genomic data that we have generated, according to our limited budget, was in the city of Medellin-Colombia, the second largest city in Colombia, which has a high incidence rate of TB in the country. Luckily, there were public genome sequences of L4 TB isolates for another Colombian Andean City, Manizales. These last genomes (those from Manizales) were sequenced by another researcher a few years ago. Medellin and Manizales cities are in the Andean region of the country with similar geographic features and are separated by around 227 km.

The coverage is not nationwide, so we cannot confirm the national subtype frequencies. To fulfill this goal, we will require a much greater budget due to the size of the country, which is like 2 times Spain, and a large number of inhabitants, nearly 50 million.

We agree that we cannot refer to the actual frequencies of the sublineages along the country and, we could have missed some other subtypes. But the main findings of our work are well supported, taking into account that we wanted to see the broader picture of the historical evolution and genome deletion profiles on a subcontinental scale.

Figure 1 can be excluded and thses results could be contained in the manuscript.

I would appreciate the space/not space in the M&M section between numbers and units are the same.

R:/ We consider that the quality control of genomic data is extremely important to settle the validity of the evolutionary y comparative genomic analysis. This information can be useful for other researchers that want to replicate our bioinformatic strategy. In this sense, if it’s possible, we would like to keep Figure 1 as one main figure in the manuscript.

Attachment

Submitted filename: Response to reviewers_R1.docx

Click here for additional data file.^{(25.4KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0285417.r003

Decision Letter 1

Li Xing

24 Apr 2023

Large genomic deletions delineate Mycobacterium tuberculosis L4 sublineages in South American countries

PONE-D-22-28017R1

Dear Dr. Juan Fernando Alzate,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Li Xing

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

**********

6. Review Comments to the Author

Reviewer #1: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

**********

PLoS One. doi: 10.1371/journal.pone.0285417.r004

Acceptance letter

Li Xing

10 May 2023

PONE-D-22-28017R1

Large genomic deletions delineate Mycobacterium tuberculosis L4 sublineages in South American countries

Dear Dr. Alzate:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Professor Li Xing

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Fig. Flow chart depicting the genome selection strategy.

(TIF)

Click here for additional data file.^{(405.6KB, tif)}

At the bottom, in brown lines, is the L2 outgroup. Numbers at nodes indicate ultrafast bootstrap support. Sublineages are labeled in their respective branches and highlighted with different colors.

(TIF)

Click here for additional data file.^{(1MB, tif)}

S3 Fig. Stacked bar plot representing the drug-resistant type profiles of the different Latin American M. tuberculosis L4 sublineages.

Drug-resistant types are depicted as normalized values on the X axis.

(TIF)

Click here for additional data file.^{(406.2KB, tif)}

S1 File

(CSV)

Click here for additional data file.^{(63.5KB, csv)}

Attachment

Submitted filename: Response to reviewers_R1.docx

Click here for additional data file.^{(25.4KB, docx)}

Data Availability Statement

All newly generated genomic read data files are available from the NCBI SRA Database bioproject PRJNA867148.

[pone.0285417.ref001] 1.Barberis I, Bragazzi NL, Galluzzo L, Martini M. The history of tuberculosis: From the first historical records to the isolation of Koch’s bacillus. Journal of Preventive Medicine and Hygiene. 2017;58: E9–E12. [PMC free article] [PubMed] [Google Scholar]

[pone.0285417.ref002] 2.Global Tuberculosis Programme. Global tuberculosis report 2021. 2021. [Google Scholar]

[pone.0285417.ref003] 3.Coscolla M, Gagneux S, Menardo F, Loiseau C, Ruiz-Rodriguez P, Borrell S, et al. Phylogenomics of Mycobacterium africanum reveals a new lineage and a complex evolutionary history. Microbial Genomics. 2021;7: 000477. doi: 10.1099/mgen.0.000477 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0285417.ref004] 4.Napier G, Campino S, Merid Y, Abebe M, Woldeamanuel Y, Aseffa A, et al. Robust barcoding and identification of Mycobacterium tuberculosis lineages for epidemiological and clinical studies. Genome Medicine. 2020;12: 1–10. doi: 10.1186/s13073-020-00817-3 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0285417.ref005] 5.Brynildsrud OB, Pepperell CS, Suffys P, Grandjean L, Monteserin J, Debech N, et al. Global expansion of Mycobacterium tuberculosis lineage 4 shaped by colonial migration and local adaptation. Science advances. 2018;4: eaat5869–eaat5869. doi: 10.1126/sciadv.aat5869 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0285417.ref006] 6.Stucki D, Brites D, Jeljeli L, Coscolla M, Liu Q, Trauner A, et al. Mycobacterium tuberculosis lineage 4 comprises globally distributed and geographically restricted sublineages. Nature Genetics. 2016;48: 1535–1543. doi: 10.1038/ng.3704 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0285417.ref007] 7.Gupta A, Alland D. Reversible gene silencing through frameshift indels and frameshift scars provide adaptive plasticity for Mycobacterium tuberculosis. Nature Communications. 2021;12: 1–11. doi: 10.1038/s41467-021-25055-y [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0285417.ref008] 8.Ates LS. New insights into the mycobacterial PE and PPE proteins provide a framework for future research. Molecular microbiology. 2020;113: 4–21. doi: 10.1111/mmi.14409 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0285417.ref009] 9.Karboul A, Mazza A, van Pittius NCG, Ho JL, Brousseau R, Mardassi H. Frequent Homologous Recombination Events in Mycobacterium tuberculosis PE/PPE Multigene Families: Potential Role in Antigenic Variability. Journal of Bacteriology. 2008;190: 7838–7846. doi: 10.1128/JB.00827-08 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0285417.ref010] 10.Tsolaki AG, Hirsh AE, DeRiemer K, Enciso JA, Wong MZ, Hannan M, et al. Functional and evolutionary genomics of Mycobacterium tuberculosis: Insights from genomic deletions in 100 strains. Proceedings of the National Academy of Sciences. 2004;101: 4865–4870. doi: 10.1073/pnas.0305634101 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0285417.ref011] 11.ten Bokum AMC, Movahedzadeh F, Frita R, Bancroft GJ, Stoker NG. The case for hypervirulence through gene deletion in Mycobacterium tuberculosis. Trends in microbiology. 2008;16: 436–441. doi: 10.1016/j.tim.2008.06.003 [DOI] [PubMed] [Google Scholar]

[pone.0285417.ref012] 12.Isaza JP, Duque C, Gomez V, Robledo J, Barrera LF, Alzate JF. Whole genome shotgun sequencing of one Colombian clinical isolate of Mycobacterium tuberculosis reveals DosR regulon gene deletions. FEMS Microbiology Letters. 2012;330: 113–120. doi: 10.1111/j.1574-6968.2012.02540.x [DOI] [PubMed] [Google Scholar]

[pone.0285417.ref013] 13.Warren RM, Gey van Pittius NC, Barnard M, Hesseling A, Engelke E, de Kock M, et al. Differentiation of Mycobacterium tuberculosis complex by PCR amplification of genomic regions of difference. The international journal of tuberculosis and lung disease: the official journal of the International Union against Tuberculosis and Lung Disease. 2006;10: 818–822. [PubMed] [Google Scholar]

[pone.0285417.ref014] 14.Garzon-Chavez D, Garcia-Bereguiain MA, Mora-Pinargote C, Granda-Pardo JC, Leon-Benitez M, Franco-Sotomayor G, et al. Population structure and genetic diversity of Mycobacterium tuberculosis in Ecuador. Scientific Reports. 2020;10: 6237. doi: 10.1038/s41598-020-62824-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0285417.ref015] 15.Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal. 2011;17: 10. doi: 10.14806/ej.17.1.200 [DOI] [Google Scholar]

[pone.0285417.ref016] 16.Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology. 2012;19: 455–477. doi: 10.1089/cmb.2012.0021 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0285417.ref017] 17.Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011;27: 2987–2993. doi: 10.1093/bioinformatics/btr509 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0285417.ref018] 18.Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, et al. Twelve years of SAMtools and BCFtools. GigaScience. 2021;10: giab008. doi: 10.1093/gigascience/giab008 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0285417.ref019] 19.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of molecular biology. 1990;215: 403–410. doi: 10.1016/S0022-2836(05)80360-2 [DOI] [PubMed] [Google Scholar]

[pone.0285417.ref020] 20.Katoh K, Standley DM. MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Molecular Biology and Evolution. 2013;30: 772–780. doi: 10.1093/molbev/mst010 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0285417.ref021] 21.Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, et al. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Mol Biol Evol. 2020;37: 1530–1534. doi: 10.1093/molbev/msaa015 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0285417.ref022] 22.Lanfear R, Calcott B, Kainer D, Mayer C, Stamatakis A. Selecting optimal partitioning schemes for phylogenomic datasets. BMC Evolutionary Biology. 2014;14: 1–14. doi: 10.1186/1471-2148-14-82 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0285417.ref023] 23.Marçais G, Delcher AL, Phillippy AM, Coston R, Salzberg SL, Zimin A. MUMmer4: A fast and versatile genome alignment system. Darling AE, editor. PLOS Computational Biology. 2018;14: e1005944. doi: 10.1371/journal.pcbi.1005944 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0285417.ref024] 24.Nattestad M, Schatz MC. Assemblytics: a web analytics tool for the detection of variants from an assembly. Bioinformatics. 2016;32: 3021–3023. doi: 10.1093/bioinformatics/btw369 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0285417.ref025] 25.Carver T, Harris SR, Berriman M, Parkhill J, McQuillan JA. Artemis: An integrated platform for visualization and analysis of high-throughput sequence-based experimental data. Bioinformatics. 2012;28: 464–469. doi: 10.1093/bioinformatics/btr703 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0285417.ref026] 26.Comas I, Coscolla M, Luo T, Borrell S, Holt KE, Kato-Maeda M, et al. Phylogenomics of Mycobacterium africanum reveals a new lineage and a complex evolutionary history. Nature Genetics. 2013;45: 1176–1182. doi: 10.1038/ng.2744 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0285417.ref027] 27.Baena A, Cabarcas F, Alvarez-Eraso KL, Isaza JP, Alzate JF, Barrera LF. Differential determinants of virulence in two Mycobacterium tuberculosis Colombian clinical isolates of the LAM09 family. Virulence. 2019. doi: 10.1080/21505594.2019.1642045 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0285417.ref028] 28.Mahé P, Azami M El, Barlas P, Tournoud M A large scale evaluation of tbprofiler and mykrobe for antibiotic resistance prediction in mycobacterium tuberculosis. PeerJ. 2019;2019. doi: 10.7717/peerj.6857 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0285417.ref029] 29.Freschi L, Vargas R, Husain A, Kamal SMM, Skrahina A, Tahseen S, et al. Population structure, biogeography and transmissibility of Mycobacterium tuberculosis. Nature Communications. 2021;12: 6099. doi: 10.1038/s41467-021-26248-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0285417.ref030] 30.Keshri V, Arbuckle K, Chabrol O, Rolain J-M, Raoult D, Pontarotti P. The functional convergence of antibiotic resistance in β-lactamases is not conferred by a simple convergent substitution of amino acid. Evolutionary Applications. 2019;12: 1812–1822. doi: 10.1111/eva1.2835 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0285417.ref031] 31.Fajardo-Lubián A, Ben Zakour NL, Agyekum A, Qi Q, Iredell JR. Host adaptation and convergent evolution increases antibiotic resistance without loss of virulence in a major human pathogen. Skurnik D, editor. PLOS Pathogens. 2019;15: e1007218. doi: 10.1371/journal.ppat.1007218 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0285417.ref032] 32.Cole ST, Brosch R, Parkhill J, Garnier T, Churcher C, Harris D, et al. Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature. 1998;393: 537–544. doi: 10.1038/31159 [DOI] [PubMed] [Google Scholar]

[pone.0285417.ref033] 33.Liu Z, Jiang Z, Wu W, Xu X, Ma Y, Guo X, et al. Identification of region of difference and H37Rv-related deletion in Mycobacterium tuberculosis complex by structural variant detection and genome assembly. Frontiers in Microbiology. 2022;13: 1–14. doi: 10.3389/fmicb.2022.984582 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0285417.ref034] 34.Bespiatykh D, Bespyatykh J, Mokrousov I, Shitikov E. A Comprehensive Map of Mycobacterium tuberculosis Complex Regions of Difference. mSphere. 2021;6: 1–14. doi: 10.1128/mSphere.00535-21 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0285417.ref035] 35.Pandey DP, Gerdes K. Toxin-antitoxin loci are highly abundant in free-living but lost from host-associated prokaryotes. Nucleic acids research. 2005;33: 966–976. doi: 10.1093/nar/gki201 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0285417.ref036] 36.Minato Y, Gohl DM, Thiede JM, Chacón JM, Harcombe WR, Maruyama F, et al. Genomewide Assessment of Mycobacterium tuberculosis Conditionally Essential Metabolic Pathways. mSystems. 2019;4. doi: 10.1128/mSystems.00070-19 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0285417.ref037] 37.DeJesus MA, Gerrick ER, Xu W, Park SW, Long JE, Boutte CC, et al. Comprehensive Essentiality Analysis of the Mycobacterium tuberculosis Genome via Saturating Transposon Mutagenesis. mBio. 2017;8. doi: 10.1128/mBio.02133-16 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0285417.ref038] 38.Xu P, Widmer G, Wang Y, Ozaki LS, Alves JM, Serrano MG, et al. The genome of Cryptosporidium hominis. Nature. 2004;431: 1107–1112. doi: 10.1038/nature02977 [DOI] [PubMed] [Google Scholar]

[pone.0285417.ref039] 39.Abrahamsen MS, Templeton TJ, Enomoto S, Abrahante JE, Zhu G, Lancto CA, et al. Complete genome sequence of the apicomplexan, Cryptosporidium parvum. Science (New York, NY). 2004;304: 441–445. doi: 10.1126/science.1094786 [DOI] [PubMed] [Google Scholar]

[pone.0285417.ref040] 40.Lan R, Reeves PR. Escherichia coli in disguise: molecular origins of Shigella. Microbes and Infection. 2002;4: 1125–1132. doi: 10.1016/s1286-4579(02)01637-4 [DOI] [PubMed] [Google Scholar]

[pone.0285417.ref041] 41.The HC, Thanh DP, Holt KE, Thomson NR, Baker S. The genomic signatures of Shigella evolution, adaptation and geographical spread. Nature Reviews Microbiology. 2016;14: 235–250. doi: 10.1038/nrmicro.2016.10 [DOI] [PubMed] [Google Scholar]

[pone.0285417.ref042] 42.Hershberg R, Tang H, Petrov DA. Reduced selection leads to accelerated gene loss in Shigella. Genome Biology. 2007;8: R164. doi: 10.1186/gb-2007-8-8-r164 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0285417.ref043] 43.Yang F, Yang J, Zhang X, Chen L, Jiang Y, Yan Y, et al. Genome dynamics and diversity of Shigella species, the etiologic agents of bacillary dysentery. Nucleic Acids Research. 2005;33: 6445–6458. doi: 10.1093/nar/gki954 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Large genomic deletions delineate Mycobacterium tuberculosis L4 sublineages in South American countries

Andres Baena

Felipe Cabarcas

Juan C Ocampo

Luis F Barrera

Juan F Alzate

Roles

Abstract

Introduction

Materials and methods

Bacterial culture and DNA extraction for the Colombian genomes

Genome sequencing of Colombian genomes

Latin American Mycobacterium tuberculosis L4 genomic data downloaded from the NCBI SRA

Latin American L4 Mycobacterium tuberculosis genome analysis

Phylogenomic analysis

Genome deletion analysis

Statistical and graphical analysis

Availability of data and materials

Results

Genome quality assessment of the Latin American M. tuberculosis L4 isolates

Fig 1. Assembly statistics.

Phylogenomic analysis of the M. tuberculosis Latin-American L4 lineage

Fig 2. Stacked bar plot representing the TB-profiler sublineage classifications within each country.

Fig 3. Phylogenomic tree.

Latin American L4 M. tuberculosis drug resistance profiles

Genome deletions analysis in Latin American L4 M. tuberculosis strains

Fig 4. Genomic deletions.

Fig 5. Principal deletion events.

Fig 6. Genes affected by deletions.

Fig 7. Graphical representation of three selected (novel) genomic deletions observed within the Latin American L4 MTB genomes.

Discussion

Supporting information

Data Availability

Funding Statement

References

Decision Letter 0

Christophe Sola

Roles

Author response to Decision Letter 0

Decision Letter 1

Li Xing

Roles

Acceptance letter

Li Xing

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases