Abstract
Despite a widespread global distribution and highly variable disease phenotype, there is little DNA sequence diversity among isolates of Mycobacterium tuberculosis. In addition, many regional population genetic surveys have revealed a stereotypical structure in which a single clone, lineage, or clade makes up the majority of the population. It is often assumed that dominant clones are highly adapted, that is, the overall structure of M. tuberculosis populations is the result of positive selection. In order to test this assumption, we analyzed genetic data from extant populations of bacteria circulating in Aboriginal communities in Saskatchewan, Canada. Demographic parameters of the bacterial population were estimated from archival epidemiological data collected over ∼130 years since the onset of epidemic tuberculosis in the host communities. Bacterial genetic data were tested against neutral theory expectations and the local evolutionary history of M. tuberculosis investigated by phylogenetic analysis. Our findings are not consistent with positive selection on the bacterial population. Instead, we uncovered founder effects persisting over decades and barriers to gene flow within the bacterial population. Simulation experiments suggested that a combination of these neutral influences could result in the stereotypical structure of M. tuberculosis populations. Some aspects of population structure were suggestive of background selection, and data were on the whole consistent with combined effects of population bottlenecks, subdivision, and background selection. Neutral phenomena, namely, bottlenecks and partitions within populations, are prominent influences on the evolution of M. tuberculosis and likely contribute to restricted genetic diversity observed within this species. Given these influences, a complex evolutionary model will be required to define the relative fitness of different M. tuberculosis lineages and, ultimately, to uncover the genetic basis for its success as a pathogen.
Keywords: Mycobacterium tuberculosis, founder effect, genetic drift, population subdivision, background selection, Aboriginal
Introduction
Mycobacterium tuberculosis is an extremely important human pathogen, estimated to infect one third of the world's population and to cause more deaths on an annual basis than any other pathogenic bacterium (WHO 2008). Unable to survive freely in the environment and restricted to human hosts (Murray et al. 2003), the organism is clearly well adapted for infection and transmission among humans. However, little is known about specific neutral and Darwinian influences on the structure of M. tuberculosis populations.
Diverse population structures have been identified among human pathogens. Patterns of genetic variation consistent with successive selective sweeps (Bush et al. 1999; Shankarappa et al. 1999), neutral evolution (Fraser et al. 2005; Hershberg et al. 2008; Wirth et al. 2008) and a combination of genetic drift and positive selection (Roumagnac et al. 2006) have all been described. Many regional surveys of natural M. tuberculosis populations have revealed a highly prevalent bacterial strain or strain family (van Soolingen et al. 1995; Bhanu et al. 2002; Drobniewski et al. 2002, 2005; Sharma et al. 2003; Victor et al. 2004; Chihota et al. 2007; Lazzarini et al. 2007; Namouchi et al. 2008). This pattern is often attributed to positive selection, with the assumption that strain prevalence is directly related to fitness. However, some common strains appear to have low relative fitness (de Jong et al. 2008). Furthermore, traits thought to confer a selective advantage, such as the immunomodulatory phenolic glycolipid PGL-tb or the ability to readily acquire drug resistance mutations, are not found uniformly throughout “successful” (i.e., prevalent) strain families (Glynn et al. 2002; European Concerted Action on New Generation Genetic Markers and Techniques for the Epidemiology and Control of Tuberculosis 2006; Reed et al. 2007). Finally, strains that appear “invasive” in some regions have failed to increase in prevalence in others (Glynn et al. 2002; European Concerted Action on New Generation Genetic Markers and Techniques for the Epidemiology and Control of Tuberculosis 2006). The phenomenon of prevalent or dominant strains is equally consistent with a founder effect. Lending further complexity to its global population structure, M. tuberculosis is geographically subdivided, with specific bacterial clades predominating in different continental regions (Sreevatsan et al. 1997; Baker et al. 2004; Hirsh et al. 2004; Filliol et al. 2006; Gutacker et al. 2006; Hershberg et al. 2008). Because there is evidence of genetic differentiation between continental human populations (Rosenberg et al. 2002), it is possible that regionally predominant bacterial clones have adapted to specific environmental niches (Gagneux et al. 2006; de Jong et al. 2008). It is not known whether there are substantial divisions within continental M. tuberculosis populations: one global survey failed to find evidence of fine-scale phylogeographic structure (Hirsh et al. 2004). Subcontinental divisions are more likely to result from neutral processes and would emphasize the role of genetic drift in structuring populations of M. tuberculosis. Neutral patterns of genetic variation consistent with isolation by distance and biogeographic subdivisions have been described among other bacterial pathogens, such as Helicobacter pylori (Linz et al. 2007; Moodley et al. 2009).
We have assembled a comprehensive and unique data set of M. tuberculosis isolates and TB clinical data collected over two decades (1986–2004) from Aboriginal Canadian communities, as well archival data spanning the ∼130 years since epidemic forms of tuberculosis first became established in these communities. The comprehensive nature of the genetic data set allowed us to formally test bacterial population genetic data for signatures of natural selection on the bacteria. We were also able to estimate a variety of relevant bacterial demographic parameters from the historical data and thus identify potential influences of demography on population structure. Historical data were used to date epidemiological shifts in different regions from zero/minimal disease burden to severe, epidemic forms of tuberculosis. This allowed us to estimate the timing of a bacterial population bottleneck associated with spread of tuberculosis from a small number of founder pathogens. In addition, we delineated historical and contemporary human social structuring phenomena, which could affect the organization of pathogen populations. Relative to other population groups, indigenous Americans are likely to have much less human genetic variation (Ramachandran et al. 2005), which could complicate the identification of neutral barriers to gene flow within the pathogen population. Of the four ethnic groups that make up the majority of the study population, three were included in a recent study of human genetic variation (Wang et al. 2007); the three groups clustered together and were distinguishable from other Aboriginal American populations, as well as from global continental populations.
In the present fine-scale spatial and temporal analyses, we have compared positive selection and demography as explanatory models for the observed data. We have also examined alternative models of selection to see if any offers a better fit to the data than positive selection. Contrary to commonly cited hypotheses about the evolution of M. tuberculosis, we find considerable evidence of neutral effects, which are likely to influence the global spread of important strains, such as drug resistant and highly virulent organisms. In addition, there are important implications for studies of genetic variability among M. tuberculosis complex organisms, as a priori assumptions about relative fitness will affect the design and interpretation of these studies.
Materials and Methods
Study Population
The data set for this study consists of clinical data and bacterial isolates from all culture-positive cases of tuberculosis, diagnosed during the study interval 1986–2004, in individuals of Aboriginal ancestry from 57 First Nations communities in the province of Saskatchewan, Canada. Community types include reserves (lands formally associated with specific First Nations groups as a result of legal agreements) as well as Métis communities (>25% Métis population, according to the Canadian census). The estimated number of full-time residents of these 57 communities was 32,907 in 2004 (Saskatchewan Health 2004). Ethnocultural groups represented in the study population include Métis, Denesuline, Nêhiyawak (Cree), Anishinaabeg (Nahkawininiwak/Plains Ojibway), Dakota, and Nakota.
Genotyping Techniques
Restriction fragment length polymorphism (RFLP) analysis was performed on all bacterial isolates; this TB typing method is based on the number and location(s) of the IS6110 transposable element (van Embden et al. 1993). RFLP patterns were compared with BioNumerics 5.0 (Applied Maths, Kortrijk, Belgium). RFLP bands were identified as identical if their sizes differed by <0.59% of total pattern length. Bands of specific sizes were coded as present or absent in each unique RFLP type, a matrix of 0–1 band values was generated within BioNumerics, and this was used for later analyses. At least one representative of each unique RFLP type was also subjected to spacer oligonucleotide typing (“spoligotyping”). The direct-repeat (DR) locus of the M. tuberculosis genome contains unique, conserved oligonucleotide spacers between repeats: Spoligotype patterns are based on the presence or absence of these 43 individual spacers (Kamerbeek et al. 1997). Large sequence polymorphism (LSP) analysis was performed on a representative of each individual spoligotype, as well as each RFLP group (groups of RFLP types with banding patterns ≥80% similar). Clade- and lineage-defining genomic deletions have been identified for M. tuberculosis complex (MTBC) organisms (Brosch et al. 2002; Tsolaki et al. 2004; Gagneux and Small 2007). Polymerase chain reaction (PCR) was used to screen for these phylogenetically informative LSPs; PCR conditions and flanking primer sequences have been previously published (Gagneux et al. 2006).
Classification of Bacterial Isolates Based on Patient Origin
TB isolates were grouped according to their geographic source: Individual isolates were assigned to the Aboriginal community in which the source patient was living at the time of diagnosis.
Historical Classification (Temporal Population Analysis).
Archival epidemiological data were used to classify Aboriginal communities (and bacterial isolates) according to the timing of postcontact epidemics of tuberculosis, which profoundly affected First Nations communities (Lux 2001). The onset of severe, epidemic tuberculosis in each community was assumed to coincide with the establishment and expansion of founding bacterial clone(s)—either native or imported by European colonists. Historical data were thus used to estimate the “time since founding” of contemporary M. tuberculosis populations circulating in these communities. We defined two assemblages of communities within the study population: group 1 and group 2. Epidemic forms of tuberculosis became established in group 1 communities between 1874 and 1920. This group was also subclassified according to epidemic peak times: In group 1a communities, maximum disease rates occurred prior to 1900 (i.e., epidemic onset and peak occurred between 1874 and 1900), in group 1b communities, the peak occurred between 1900 and 1940 (i.e., epidemic onset between 1874 and 1920, epidemic peak occurred between 1900 and 1940). By contrast, onset of epidemic tuberculosis was delayed until after 1920 in group 2 communities, all of which experienced peak disease rates after 1940. Epidemic timing differed among Aboriginal communities primarily as a result of community-to-community variability in the time course of European-First Nations acculturation. Acculturation was often accompanied by decreased food security, crowded living conditions and institutionalization of First Nations peoples, all factors that promote the spread of tuberculosis (Milloy 1999; Lux 2001). There are more historical details in the supporting information (supplementary text S1, Supplementary Material online).
Geopolitical Classification (Spatial Population Analysis).
The longitude and latitude of each community were identified from the Atlas of Canada (Natural Resources). The predominant ethnocultural group (i.e., Métis, Denesuline, Nêhiyawak, Anishinaabeg, Dakota, and Nakota) represented in each community was identified from the Encyclopedia of Saskatchewan (Canadian Plains Research Center), as well as from material published online by the communities themselves. In Canada, “Status Indian” rights, which include formal affiliation with a First Nations community and reserve, are inherited by the children of community members (Canadian Plains Research Center). As a result, the organization of First Nations individuals in Aboriginal communities reflects kinship structure, in addition to cultural identity. In the past, federal administration of Aboriginal communities was performed by “Indian agencies”: These agencies administered several reserves, usually in geographic proximity with each other. Communities within agencies shared resources such as schools, farming instructors, and religious officiates. The historic agency affiliation of each community was identified and used to group communities. In order to identify broader structural effects, adjacent agencies were also grouped into eight geographic regions covering the entire province. The resulting hierarchical structure (see table 4) combines the effects of geography, ecology (with environments ranging from Plains to sub-Arctic), kinship, ethnicity, and organization of government institutions.
Table 4.
Data Source for AMOVA.
| Level of Structure | Name of Structure | No. of Divisions | Mean Sizea | Median Sizea | Sizea Range |
| 1 | Bacterial isolates | 451 | 1 | 1 | 1 |
| 2 | Human communities | 48 | 577 | 650 | 88–2,929 |
| 3 | Indian agenciesb | 14 | 2,700 | 2,246 | 264–5,697 |
| 4 | Regions | 8 | 5,407 | 4,497 | 264–15,301 |
| N/A | Ethnic groups | 6 | 7,210 | 2,428 | 969–30,112 |
Human census of communities, agencies, regions, and ethnic groups. Source: 2004 health registrations (Saskatchewan Health 2004).
This is a historical term, referring to federal bureaucratic structures that oversaw the administration of several Aboriginal communities.
Estimates of Bacterial Demographic Parameters μ, Ne, and θ from Epidemiological Data
The group 2 bacterial effective population size, referred to hereafter as Ne(epi), was estimated from archival epidemiological data (harmonic mean of time series disease data). Along with Ne(epi), published estimates of the rate of IS6110 substitution were used to estimate μ(epi) and θ(epi). More details are found in the supplementary text S2, Supplementary Material online.
Statistical Analysis
Bacterial population diversity was assessed with the Simpson index (Ds) and the Simpson evenness index (Ds[e]) (Simpson 1949; Magurran 2003), with index values based on RFLP-defined haplotype frequencies. Variance of the Simpson index was calculated according to the formula suggested by Simpson for small populations (Simpson 1949). Populations were then compared with a two-tailed t-test. Index values, variances, and population comparisons using both indices were verified by bootstrap analysis: for each sample, 5,000 populations of size equal to the study population were simulated, with probability of each haplotype in the simulated population based on its frequency in the observed data. Variance of index values was calculated across simulated populations and differences between populations were evaluated with a two-tailed t-test based on mean index values and pooled variance.
Population genetic analyses were performed in Arlequin (version 3.1) (Excoffier 2005). Mismatch distribution analysis was based on pairwise differences between RFLP haplotypes (coded as binary data, as described above); 1,000 bootstrap replicates were performed for each analysis. For the population structure analyses, analysis of molecular variance (AMOVA) was used to partition total genetic variation within and among individuals, communities, agencies, ethnic groups, and geographic regions. As implemented in Arlequin, this was based on pairwise differences between RFLP haplotypes; significance levels of differences in covariance components were computed according to a nonparametric permutation procedure (10,000 permutations/analysis). The Mantel test was also performed, to assess the degree of correlation between great circle distances and genetic distances. Genetic distances reported here include population pairwise FST values, and Nei's average number of differences between populations (Excoffier 2005). A nonparametric permutation procedure determined the level of significance of correlation between matrices.
Great circle distances (in radians) were calculated according to the haversine formula (Sinnott 1984), which reduces rounding errors for small distances.
Neutrality of RFLP haplotypes was assessed using an infinite alleles model, with each haplotype treated as a single allele. The Ewens–Watterson (E–W) homozygosity test (Ewens 1972; Watterson 1978) (as implemented in Arlequin 3.1) and Ewens’ conditional sample frequency spectrum (CSFS) test (Ewens 1973, 2004) were used to compare the observed haplotype frequency spectrum with neutral expectations. For the CSFS test, the approximate value for the expected number of singletons (eq. 8, Ewens 1973) and the conservative Poisson approximation of probability greater than or equal to the observed number of singletons were used. The use of the exact formulae resulted in inflated type 1 error rates in simulation experiments. We postulated that some characteristics of the study/simulated populations, possibly their small size and high mutation rate, amplified the effect of stochastic variation on the CSFS test. For this reason, a more conservative version of the test was used to explore the hypotheses of interest (neutral equilibrium vs. other explanations of the observed genetic data).
Phylogenetic Analysis
Standard phylogenetic analysis (parsimony, Neighbor-Joining, and Bayesian methods) of RFLP, spoligotyping, and LSP data produced low-resolution trees with very short branch lengths and large polytomies (data not shown). Given the short time span during which these isolates diverged, it is likely that ancestral types are extant. We therefore chose to use networks to describe relationships among bacterial strains. Separate networks were constructed from spoligotyping data for each LSP-defined lineage (“H37Rv-like,” Rd 219, Rd 182, and Rd 115). The largest group (H37Rv-like) is shown in figure 1. Networks are parsimonious reconstructions of oligonucleotide spacer loss events, with spacer loss considered to be a one-way event. This is consistent with the “closed genome” of M. tuberculosis and a recent report of the relationship between spoligotype and other genetic markers of ancestral versus modern lineages (Flores et al. 2007). LSP-defined lineages were treated separately because organisms with different genetic backgrounds have been known to converge on the same spoligotype (Warren et al. 2002; Filliol et al. 2006; Gutacker et al. 2006). A minimum spanning tree (MST) of RFLP haplotypes is shown in figure 2. The tree was generated with the Prim–Jarnik algorithm, as implemented in BioNumerics 5.0 (Applied Maths, Kortrijk, Belgium). The BURST priority rule maximizing single and double locus variants was used during network searches. Permutation resampling was done to assess statistical support for network topology.
FIG. 1.
Haplogroup network, H37Rv-like organisms. Network (parsimonious reconstruction of one-way oligonucleotide spacer loss events) of 434 bacterial isolates. Each distinct spoligotype defines a haplogroup. Because homoplasy is known to be a problem with spoligotyping data, each LSP-defined lineage was considered separately, and only the largest (H37Rv-like) of these LSP-defined groups is shown. Haplogroups are shown as nodes and are labeled with the shared type number (s53, s34, etc.) assigned within the international spoligotyping database (Brudey et al. 2006). Orphan types (not found within the database) are designated n-1 to n-8. Node area is proportional to number of tuberculosis cases associated with the spoligotype. Scale of case numbers associated with different spoligotypes is indicated by the labeled, open circles adjacent to the network. Individual spacer loss events are represented by gray beads on the lines joining nodes. Sections within the nodes are color coded to indicate their community type of origin. Dark blue sections represent bacterial isolates from group 1a communities (epidemic peak <1900), medium blue isolates are from group 1b communities (epidemic peak 1900–1940),and light blue sections originated in group 2 communities (epidemic peak >1940). Color-coded sections within networks are drawn to scale, to demonstrate the relative contributions of isolates from the three types of communities. The general appearance of the network is consistent with the influence of genetic drift, as opposed to positive selection (see text).
FIG. 2.
MST of RFLP haplotypes. Single and double locus variants were prioritized during clustering. All branches but one in the topology had 100% statistical support in permutation resampling (1,000 iterations). The branch with 50% support is shown dashed. Branches of length > 5 are not shown. Node size (area) is proportional to associated TB case numbers (scale indicated in open circles). Nodes are color coded according to geographic region of origin (see table 4). Pie charts are used to illustrate relative frequencies of haplotypes found in more than one region. Regions 1 and 7 belong to group 1 (TB epidemic initiation < 1920) and are distinguished by a black circle in the node interior; the remaining regions are in group 2 (TB epidemic initiation > 1920). Haplotypes belonging to the ancestral haplogroup (s53) are outlined in gray. Mismatch-distribution analysis was performed on the same data; results are shown and discussed in Supplementary Materials online.
Simulation Experiments, Group 2 Population
Simulation experiments were performed to determine how statistical tests of neutrality behave in response to deviations from the condition of a constant, panmictic population (demographic deviations)—in the absence of positive or negative selection. The following parameters were included in the simulations: time since bacterial population bottleneck/epidemic founding event (t0), rate of population growth (α), sample size from each island in a subdivided population (ni), population migration rate between islands (2Nm), and population mutation rate (θ = 2Nμ). For the purpose of the simulations, θ—denoted hereafter as θ(gen)—was estimated from the bacterial genetic data. This was done to separate the effect of demographic deviations from uncertainty inherent in estimates of θ from historical, epidemiological and experimental data. Details of simulation parameter estimation can be found in the supporting material (supplementary text S3, Supplementary Material online).
We used the computer program “ms” (Hudson 2002) to produce simulated samples under selective neutrality and a variety of demographic conditions. We ran the two neutrality tests in 14 different scenarios, 1,000 repetitions each, for a total of 28,000 simulated data sets that resemble our real-world data set of 386 group two haplotypes. Neutral scenarios consisted of a panmictic population of constant size (baseline), bottleneck alone (at four different times in the past), subdivision alone, and combinations of bottleneck with subdivision into four or six islands (i.e., eight combination scenarios).
For each of the 28 neutral scenarios and the corresponding 1,000 data sets, we used a script written in Perl to convert the output of ms into a series of files accessible by Arlequin software (Schneider et al. 2000). Using the same script, we found the approximate Poisson distribution and the resulting P value for the number of singletons observed in each sample (i.e., the P value for the CSFS test). We then ran the Arlequin software on each of the 28,000 data sets to compute the P value for the Ewens’ homozygosity test.
Results
Clonal Bacterial Population Structure
There were 561 culture confirmed cases of tuberculosis in the study population between 1986 and 2004; bacterial isolates were available from 451 of these cases. RFLP analysis defined 98 different haplotypes among these bacterial isolates. The rank order of the four most frequent haplotypes was exactly the same in the first 9 years as in the second 9-year interval. Spoligotyping analysis revealed 26 different spoligotypes, with several different RFLP types sharing the same spoligotype. All tested isolates contained the katG CTG→CGG mutation, indicating they are members of the Euro-American M. tuberculosis clade (Gagneux et al. 2006). The majority of RFLP-defined haplotypes (90/98, representing 434/451 isolates) belonged to the H37Rv-like (also known as principal genetic group 3, genetic cluster VIII) (Gagneux and Small 2007) lineage—they did not contain any of the 10 lineage-defining deletions of the other members of this clade. Members of H37Rv-like lineage are rare relative to other lineages within the Euro-American clade (Gutacker et al. 2006); in situ expansion and diversification of a single H37Rv-like founder thus appears to account for the majority of cases of tuberculosis in this population. A smaller group of RFLP haplotypes (4/98, representing 13 isolates) contained the Rd 219 deletion. Three RFLP haplotypes (representing 3 isolates) belonged to the Rd 182 lineage. One singleton RFLP haplotype contained another Euro-American clade deletion—Rd 115.
Evidence of a Founder Effect.
Study strains were isolated from 443 individuals (8 individuals had repeat episodes of disease within the study interval), who lived in 48 different communities at the time of diagnosis. Of these, 26 host communities (61 bacterial isolates) could be classified as group 1: Epidemics of tuberculosis were initiated in these communities in the early reserve era (1874–1920, see table 1). Twenty-one communities (386 isolates) were classified as group 2: Epidemics of tuberculosis were delayed until the late reserve era (1920–2004). One community (four isolates) could not be classified. Relative to its size, a greater variety of RFLP haplotypes, spoligotypes, and LSP types was observed in the group 1 population of bacteria (table 2), suggesting diversification through mutation and/or migration events during the ∼130 years since epidemic tuberculosis became established in these communities. The more recently established group 2 population of bacteria was dominated by a small number of highly prevalent RFLP haplotypes (eight haplotypes accounted for 300/386 cases, see table 3), consistent with a founder effect. Differences in the structure of group 1 and group 2 populations, reflected in diversity index values (Simpson index and Simpson evenness index), were highly statistically significant (P < 0.0001 for both indices, two-tailed t-tests on bootstrapped data for both indices, as well as directly on observed data for the Simpson index).
Table 1.
Historical Classification of Communities and Their Bacterial Isolates.
| Classification | Epidemic Onset (t0)a | Time Since Founderb | Subclassification | Epidemic Peak | Time Since Peakb |
| Group 1 | 1874–1920 | 75–121 years | |||
| Group 1a | <1900 | >95 years | |||
| Group 1b | 1900–1940 | 55–95 years | |||
| Group 2 | >1920 | <75 years | >1940 | <55 years |
t0 refers to the timing of an epidemiological shift to severe epidemic tuberculosis.
Calculated to midpoint of study period, 1995.
Table 2.
Diversity of Bacteria from Different Types of Communities.
| Variable | Group 1 (t0< 1920) | Group 2 (t0> 1920) |
| Host communities | 26 | 21 |
| Bacterial isolates | 61 | 386 |
| RFLP haplotypes | 48 | 60 |
| Spoligotypes | 22 | 14 |
| LSP typesa, no. | 4 | 2 |
| LSP, descriptionb | Rv, 219, 182, 115 | Rv, 219 |
| Dsc (95% confidence interval) | 0.99 (0.98–1.00)* | 0.89 (0.88–0.91)* |
| Ds, bootstrapd | 0.97 (0.95–0.99)* | 0.89 (0.87–0.91)* |
| Ds(e)e | 1.73 | 0.16 |
| Ds(e), bootstrap | 1.14 (0.74–1.54)* | 0.21 (0–0.61)* |
LSP types are based on presence/absence of global lineage-defining genomic deletions.
Rv: H37Rv-like lineage; 219: Rd 219 lineage; 182: Rd 182 lineage; 115: Rd 115 lineage.
Ds: Simpson index of diversity, based on relative frequencies of RFLP haplotypes.
Bootstrap: mean index values from 5,000 bootstrap simulations.
Ds(e): Simpson index of evenness, based on frequencies of RFLP haplotypes relative to total number of types in the sample.
*P < 0.0001, two-tailed t-test comparing group 1 with group 2.
Table 3.
RFLP Haplotype Frequency Spectrum, Group 1 versus Group 2.
| Group | Haplotype Frequency (Number of Occurrences)a |
| 1 | 5, 4, 2(6), 1(39) |
| 2 | 89, 64, 35, 34, 24, 19, 18, 17, 7, 6, 5(3), 4(2), 3, 2 (3), 1(40) |
The frequency of an RFLP type (number of times this frequency appears in the population). For example, if haplotype A appears three times and haplotype B appears three times, the notation would be 3(2). Unless noted otherwise, a particular frequency occurred once. This is the configuration distribution: see Ewens (2004, p. 112), for an example.
Group 1 host communities (and bacterial isolates) were subclassified according to epidemic peak times. Fourteen communities (32 isolates) were classified as group 1a (epidemic peak <1900). Ten communities (23 isolates) were classified as group 1b (epidemic peak 1900–1940). There were insufficient data to subclassify the remaining two group 1 communities (six isolates). Differences in diversity index values between group 1a and group 1b populations were not statistically significant (data not shown).
Barriers to Pathogen Migration within a Subcontinental Population.
The Mantel test revealed a modest level of correlation between geographic distance among host communities and pairwise FST values of the bacterial populations (correlation coefficient 0.22, P = 0.001). Nei's corrected average pairwise differences (see Materials and Methods) and geographic distances were not significantly correlated (correlation coefficient 0.003, P = 0.48). Nei's distance measure is likely the more appropriate comparator, given that M. tuberculosis is a haploid, asexual organism. These results suggest that on this scale, flow of M. tuberculosis between human populations is not a simple function of physical distance.
Results of AMOVA are shown in table 5. The hierarchical structuring scheme is outlined in table 4. AMOVA indicated a great degree of genetic differentiation (FST = 0.34), with the most inclusive categories (region and agency) accounting for a large proportion of total variation (28–31%). Results were statistically significant at all hierarchical levels (community, agency, and region), indicating some degree of differentiation between even the smallest populations. Comparisons between ethnic groups (which are not entirely independent of reserve, agency, and region) also accounted for a relatively high proportion of variation (19%). Together with results of the Mantel test, these data suggest that barriers to pathogen migration separate even relatively small human populations.
Table 5.
AMOVAa.
| Source of Variation | Degrees of Freedom | Sum of Squares | Variance Component | % of Variation | Fixation Index | P Value | |
| Levels 1–3 | |||||||
| 1 | Within communities | 398 | 1,855 | Vc= 4.66 | 65.66 | FST= 0.34 | <0.0001 |
| 2 | Among communities, within agencies | 29 | 204 | Vb= 0.44 | 6.23 | FSC= 0.09 | <0.0001 |
| 3 | Among agencies | 13 | 861 | Va= 1.99 | 28.10 | FCT= 0.28 | <0.0001 |
| Levels 2–4 | |||||||
| 2 | Within agencies | 436 | 2,077 | Vc= 4.76 | 66.35 | FST= 0.34 | <0.0001 |
| 3 | Among agencies, within regions | 6 | 51 | Vb= 0.22 | 3.12 | FSC= 0.04 | <0.0001 |
| 4 | Among regions | 7 | 840 | Va= 2.19 | 30.54 | FCT= 0.31 | <0.0001 |
| Ethnic groups | |||||||
| N/A | Within ethnic groups | 443 | 2,474 | Vb= 5.59 | 80.71 | FST= 0.19 | <0.0001 |
| Among ethnic groups | 5 | 448 | Va= 1.34 | 19.29 | |||
Three analyses are shown: The first two analyses incorporate the hierarchical geographic/social population structure described in the Materials and Methods section; as Arlequin could accommodate only three hierarchical levels, two separate analyses were done to cover the four levels of the hierarchy. The third analysis shown is based on ethnic group affiliation of individual Aboriginal communities.
Dispersal of Neutral Variants through a Series of Epidemics.
We looked for phylogenetic signatures of positive selection, which could explain the high prevalence of specific, “dominant” M. tuberculosis strain types in this and other studies. Network analysis of the largest clonal group of bacteria (H37Rv-like, made up of 434 isolates with 23 different spoligotypes) is shown in figure 1. The general appearance of the network suggests intact preservation of the neutral DR locus during temporal and geographic dispersal of the organism. The largest, most “successful” haplogroup (s53) is also the putative ancestor by virtue of the fact that it has the largest number of intact oligonucleotide spacers; this type is prevalent in all three populations (groups 1a, 1b, and 2). A core group of closely related spoligotypes (s34, s37, and s784) likely evolved over the course of the first epidemic (group 1a) and went on to found the second (group 1b) and third (group 2) epidemics. Three spoligotypes (s4, n-4, and n-5) emerge in the middle epidemic (group 1b) and are associated with a large number of group 2 cases. There is no evidence of clonal replacement, as emergent types cocirculate with common, ancestral spoligotypes, in the three populations.
An MST of RFLP haplotypes belonging to the H37Rv-like lineage is shown in figure 2. In comparison with the spoligotyping data, there is more evidence of differentiation of RFLP haplotypes among groups and regions. Haplotypes from the smaller of two clusters within the network belong to the ancestral haplogroup s53, as well as s37 (one node). The larger cluster is more geographically dispersed, perhaps as a result of temporal and geographic variation in patterns of connectivity within and among regions. The hub of this cluster belongs to haplogroup s34 and is distributed among group 1a, 1b, and two communities. The centrality of this node suggests that the type reached high frequency during the earliest epidemics (i.e., group 1a), dispersed and then diversified during subsequent waves of epidemic tuberculosis. Outlying haplotypes scattered around the figure may represent network decay as individual haplotypes go extinct; “burst” dynamics of transposition could also play a role.
Study Populations Deviate from neutral Expectations in Formal Tests of Natural Selection.
Results of E-W and Ewens’ CSFS tests applied to bacterial genetic data (RFLP) from different regions and types of host communities are shown in table 6. There is evidence of between-region variability, which could be the result of sample size differences or distinct evolutionary or demographic histories. Statistical significance of E–W and CSFS tests increased when different regions were pooled together, implying a sample size effect and possible sensitivity of the tests to population subdivision.
Table 6.
Tests of Neutrality, Infinite Allele Model.
| Classification | Regiona | nb | kc | ad | Expected e
|
Observed f
|
P Value (E–W)g | Sim.h | P Value (CSFS)i | Sim.h |
| Group 1 | 1 | 31 | 23 | 18 | 0.055 | 0.061 | 0.13 | 0.23 | 0.00013 | 0 |
| 7 | 29 | 25 | 21 | 0.045 | 0.044 | 0.47 | 0.62 | 2 × 10−5 | 0 | |
| Total | 61j | 48 | 40 | 0.023 | 0.028 | 0.06 | 0.50 | 4 × 10−11 | 0 | |
| Group 2 | 2 | 176 | 22 | 17 | 0.142 | 0.318 | 0.0001 | 0.01 | 1 × 10−6 | 0 |
| 4 | 114 | 19 | 13 | 0.139 | 0.253 | 0.0004 | 0.02 | 0.00017 | 0 | |
| 5 | 47 | 15 | 8 | 0.133 | 0.177 | 0.12 | 0.22 | 0.03 | 0.09 | |
| 6 | 24 | 5 | 3 | 0.377 | 0.639 | 0.01 | 0.05 | 0.12 | 0.25 | |
| 8 | 15 | 9 | 7 | 0.155 | 0.209 | 0.01 | 0.05 | 0.03 | 0.09 | |
| 3 | 9 | 2 | 0 | 0.671 | 0.654 | 0.41 | 0.56 | N/A | N/A | |
| Total | 386j | 60 | 41 | 0.051 | 0.109 | <0.0001 | 0.06 | 1 × 10−13 | 0 |
Numeric code of region.
Sample size of region (number of bacterial isolates).
Number of different RFLP haplotypes.
Number of singleton haplotypes.
Under neutral equilibrium conditions, probability that two randomly chosen haplotypes from a population of the same size, with the same total number of haplotypes, would be identical by descent (1 − π). From the E–W sampling formula.
Observed value of (1 − π).
P value for Ewens–Watterson–Slatkin exact test (10,000 simulations).
Proportion of 1,000 neutral simulated samples with P value ≤observed value for reported neutrality test (E–W or CSFS). Bottleneck set at Ne generations in the past (see Results text). For pooled samples, the six-island model was compared. For within-region samples, comparison was with simulated bottleneck only.
Probability that a sample of the same size and with the same number of haplotypes would have greater than or equal to the observed number of singletons. Neutrality and stable mutation rate are assumed.
Regions do not add up to totals due to exclusion of group 1 communities within predominantly group 2 regions and vice versa. Two communities (two isolates) were excluded from region by region calculations of
for this reason.
E–W test performed differently in group 1 versus group 2 populations. Values of
(gen) were close to those expected in a neutral model and were not statistically significant for regions within the group 1 population, nor for group 1 as a whole. By contrast, values of
(gen) from group 2 regions were, with one exception (a region with only 9 cases), higher than expected and statistically significant in 4/6 regions. Region 5 (n = 47) is an outlier. It is a very large territory, and the difference in population structure may reflect within-region heterogeneity in TB epidemic dynamics.
Archival epidemiological data were used to derive an independent estimate of
for the group 2 population. Values of Ne(epi), μ(epi), and θ(epi) estimated directly from published data and historical records (see Materials and Methods and supplementary text S2, Supplementary Material online), generated an expected value of
This result is slightly higher than the value generated by Ewens’ sampling formula [
(gen)] (Ewens 1972) but is well below the observed value (table 6), confirming that the group 2 population structure deviates from neutral equilibrium expectations.
CSFS test revealed a statistically significant excess of rare (singleton) haplotypes in all group 1 regions and for the population as a whole. Results were also significant for group 2 as a whole. The two largest regions within group 2 also had highly statistically significant results, whereas results were borderline significant (P = 0.03) for a further two regions and insignificant in one (P = 0.12). Results from the group 2 population are paradoxical, in that measures of overall diversity are low (reflected in high
(gen)), yet there are excess singleton haplotypes. These aspects of the population arise from its highly uneven/leptokurtic structure (see fig. 3 for group 2 as a whole): Common haplotypes are more common; there are more rare types (singletons), and hence there is a paucity of the intermediate frequencies expected under neutrality. This pattern has, for example, also been seen with RFLP data from human mitochondrial DNA (Whittam et al. 1986).
FIG. 3.
Tests of neutrality. (A) RFLP haplotype frequency spectrum for group 2 populations (t0 > 1920) is shown in white, ordered from the most to the least prevalent (each bar represents an individual haplotype). The neutral frequency spectrum predicted by Ewens’ sampling formula is shown in black. Relative to neutral expectations, group 2 populations have more prevalent “dominant” haplotypes and an excess of very rare haplotypes, with few moderate frequency haplotypes. (B) Performance of neutrality tests in simulated populations with demographic conditions similar to the study population. The percentage of 1,000 simulated neutral samples (each n = 386) with a false positive test result (0.01 significance level) is shown on the Y axis. Proceeding from left to right along the X axis, the first panel depicts a panmictic population with and without a past bottleneck (timing is expressed in terms of Ne generations). The second and third panels show results when simulated populations were divided into four and six islands, respectively, with a bottleneck at 0.4 − 1Ne generations in the past. CSFS test is primarily affected by population subdivision, and imposition of bottlenecks does not appear to add to this effect. Performance of the E–W test is strongly affected by population subdivision and bottlenecks, particularly recent ones.
Tests of Natural Selection Are Affected by Realistic Demographic Scenarios.
We used our demographic parameter estimates for the group 2 population (see supplementary text S3, Supplementary Material online) to simulate a variety of plausible demographic scenarios in order to determine what effect these might have on population structure, and performance of tests of selection. Performance of the E–W and CSFS tests was evaluated first under neutral equilibrium conditions (stable population size and no subdivision). The E–W test had an appropriate number of false positives (0.5% of samples with P ≤0.01), whereas the CSFS test had a slight excess (2.2% of samples with P ≤0.01). Bottlenecks at Ne, 0.8Ne, 0.6Ne, and 0.4Ne bacterial generations in the past (corresponding to 66, 52, 39, and 26 years, using a 62-week generation time) were simulated with and without population subdivision (four- and six-island models). Demographic effects on performance were very different for the two tests (fig. 3). Bottlenecks had a minimal effect on the CSFS test: 2.7% of samples had P ≤0.01 with a bottleneck at Ne generations in the past. Recent bottlenecks (0.6Ne, 0.4Ne) resulted in less than a 2-fold increase in false positives relative to the baseline (4.2% and 4.0%, respectively). Population subdivision, on the other hand, had a dramatic effect on numbers of false positives (fig. 3). There was no evidence of an additive effect of bottlenecks and population subdivision on test performance: within subdivision categories (i.e., none, four-island, and six-island), false positive rates remained relatively flat across progressively more extreme bottleneck conditions.
By contrast, the E–W test was strongly affected by bottlenecks: there was a 7-fold increase in false positive rates (from 0.5 to 3.5%) in the presence of even a remote bottleneck (Ne generations). The rate of false positives increased linearly as bottlenecks became progressively more recent (r2 = 0.86). In contrast with the CSFS test, effects of population subdivision and bottlenecks were additive: False-positive rates increased linearly within categories of division, as progressively more recent bottlenecks were imposed, and the imposition of a more extreme model of division raised overall rates (fig. 3).
Actual P values generated by testing simulated populations were compared with values generated from the observed data (table 6): The proportion of 1,000 simulated samples with P ≤ observed value was recorded for each region and group. Within-region values were compared with undivided, simulated populations and pooled values were compared with divided, simulated populations. Bottleneck at Ne bacterial generations in the past (65 years) was chosen; based on archival data, this is a realistic scenario (t0 for group 2 communities ∼1930, using the local estimate of generation time = 62 weeks, see supplementary text S2, Supplementary Material online). Accounting for demographic conditions decreased the significance of E–W test results. Results were significant for the largest group 2 region, borderline significant for the next largest, but were otherwise not statistically significant. By contrast, values of the CSFS test remained significant for 5/8 regions and for both pooled groups. Taken together, the simulation experiments suggest that demographic factors may not account completely for the excess of rare (singleton) haplotypes observed in group 1 and group 2 populations. However, the phenomenon of highly prevalent, “dominant” haplotypes observed in group 2 populations may result from a combination of recent bottlenecks and population subdivision.
Discussion
Evidence for a Founder Effect and against Positive Selection in the Study Population
As European influence on Saskatchewan Aboriginal populations became more pervasive, these populations became newly susceptible to epidemic forms of tuberculosis (Lux 2001) (e.g., as a result of increased population density, see Materials and Methods and supplementary text S1, Supplementary Material online). We have shown that bacterial populations from human communities in which European acculturation—and epidemic tuberculosis—became established relatively recently (group 2) are dominated by a few molecular variants. Long established bacterial populations (group 1) are, by contrast, much more diverse relative to their size. If the high prevalence of certain M. tuberculosis strains were due to increased fitness, we would have expected to observe this “dominance effect” to the same extent in both populations, or perhaps to a greater extent in group 1, which has had more time for fixation of a fit mutant. Difference in diversity between the two populations is more consistent with a founder effect in group 2 and diversification of group 1 due to cumulative mutation and migration events. The haplogroup network (fig. 1) further supports the idea that genetic drift plays a primary role in structuring these pathogen populations. In contrast with the situation where highly adapted variants sweep through a population and periodically purge it of preexisting variation (Shankarappa et al. 1999), the network is consistent with a scenario in which stable bacterial types spread through consecutive epidemics. In addition, common—“successful”—variants are scattered across the network in a manner consistent with stochastic effects, as opposed to a single lineage dominating in terms of its diversity or size. Prevalent haplotypes (defined by RFLP) are similarly scattered across the MST (fig. 2). In most circumstances it is difficult to determine whether the high frequency of a genetic variant is due to its relatively high fitness or to a founder effect (Smith et al. 2006); both explanations have been posited for “Beijing” and other prevalent lineages of M. tuberculosis (Mokrousov et al. 2005). Comparative studies of M. tuberculosis lineage phenotypes provide some evidence that high prevalence is not likely the result of positive selection. For example, the Beijing lineage has been associated with severe forms of disease and dissemination out of the inoculated organ in animal studies (Tsenova et al. 2005); epidemiological studies have in turn revealed an association between infection with a Beijing strain and extrapulmonary, that is, noncommunicable forms of tuberculosis in humans (Kong et al. 2007). Although pulmonary and extrapulmonary forms of disease are not mutually exclusive, it is difficult to imagine how bacterial strains with a propensity to cause ‘dead-end’ forms of disease would be specifically selected. Two studies failed to discover any association between the Beijing lineage and cavitary (i.e., highly transmissible) pulmonary tuberculosis (Borgdorff et al. 2004; Kong et al. 2007). Our analyses favor a neutral explanation (founder effect) for the high prevalence of some M. tuberculosis variants. Given its status as an obligate pathogen with a limited host range (i.e., humans), as well as the fact that in most (∼90%) cases the pathogen dies with its host, never having been transmitted, bottlenecks would generally be expected to feature in M. tuberculosis population dynamics. However, we cannot exclude the possibility that bottlenecks are a special feature of our study population and are not important in other settings.
Human Social Structures Maintain a High Degree of Bacterial Population Differentiation.
Clear evidence of continental-level population structure has been found in a number of global surveys of M. tuberculosis organisms (Sreevatsan et al. 1997; Baker et al. 2004; Hirsh et al. 2004; Filliol et al. 2006; Gutacker et al. 2006). We found evidence of significant population differentiation (FST > 0.3) at a much finer geographic scale, with the greatest degree of bacterial differentiation (∼30%) between moderately sized host populations (Indian agencies and geographic regions with mean human census size 2,700 and 5,407, respectively). Bacterial differentiation on such a small scale is not likely to be maintained by natural selection—it is improbable that aspects of the bacterial environment vary predictably enough among host communities of this size to promote pathogen subspecialization (TB control policies such as Bacillus Calmette-Guérin vaccination are also uniformly applied to the entire study population). In addition, the human study population is relatively genetically homogeneous (Wang et al. 2007); there is unlikely to be enough differentiation among host communities to allow precise coadaptation of regionally predominant bacterial clones to each human subpopulation. It is more likely that bacterial differentiation is the result of fractured human disease transmission networks with consequent isolation of regional pathogen populations.
Some previously published reports hinted at within-continent subdivision of global TB populations; for example, a TB survey of Zimbabwe and Zambia revealed a predominant strain that is related to, but distinct from, a strain commonly found in Cape Town (Chihota et al. 2007). Mycobacterium tuberculosis is inefficiently transmitted relative to other pathogens (Iseman 2000); spatial spread of pathogens is predicted to increase as a function of transmissibility (Viboud et al. 2006). Some environments, such as prisons, mines, and crowded dwellings promote transmission of tuberculosis to a dramatic degree (Iseman 2000). The occasional spread of M. tuberculosis to these highly favorable environments would allow the expansion of founder types and homogenization of a moderate number of disconnected populations. A similar moderate-level distance effect has been observed with influenza epidemics, where populations connected by workflows demonstrate epidemic synchrony (Viboud et al. 2006).
The Indian residential school system provides an obvious local explanation for distance effects observed within our study population of M. tuberculosis organisms. Children from different communities, but likely from within the same Indian agency or geographic region, lived together in these institutions; conditions within schools were historically highly conducive to tuberculosis transmission (Milloy 1999).
Combined Effects of Bottlenecks and Population Subdivision.
In group 2 bacterial populations, which were subject to a recent bottleneck, Ewens’ summary statistic (
, “homozygosity” index) was significantly elevated relative to expected values in a neutral, infinite alleles model; this is often interpreted to indicate positive selection. The (E–W) test did not detect a deviation from neutral expectations in group 1 populations (no recent bottleneck), implying sensitivity of the test to recent bottlenecks. This was confirmed in simulation experiments modeled on group 2 populations, where progressively more recent bottlenecks resulted in increasing rates of false positive E–W tests. In addition, the effects of bottlenecks and population subdivision appeared to be additive in these experiments This is not surprising, given that population subdivision is known to decrease overall measures of genetic diversity (Hartl and Clark 2007) and could thus delay recovery of population diversity from a bottleneck. Overall, the data suggest that recent bottlenecks in combination with subdivision may produce haplotype frequency patterns that appear to show the influence of positive selection.
Skew in Frequency Spectrum toward Rare Haplotypes.
Relative to neutral equilibrium expectations, excess numbers of rare haplotypes were observed in both group 1 and group 2 populations. Among standard explanations of this phenomenon—namely, rapid population expansion, selective sweep, and background selection (Fu 1997)—background selection is most compatible with our observations. Mismatch distributions of RFLP haplotypes in group 1 and 2 populations were bimodal, that is, not compatible with rapid expansion (see supplementary text S4 and supplementary fig. S2, Supplementary Material online). Within-region and within subgroup (groups 1a and 1b) distributions of pairwise differences were also bimodal (data not shown), suggesting that the bimodal distribution did not result from pooling populations. Interestingly, a bimodal distribution of pairwise differences among haplotypes was also observed in the Hershberg study of global single nucleotide polymorphism (SNP) data. Bacterial population expansion may be constrained by the architecture of host transmission networks, as well as depletion of highly susceptible host populations by deaths from tuberculosis. Selective sweep is unlikely on the basis of our temporal analysis of population structure, as well as the haplogroup and haplotype networks. Background selection is plausible, given the lifestyle of M. tuberculosis and genetics of transposable elements in this organism. Alternatively, our simulation experiments suggest that population subdivision may play a role in the skew to rare haplotypes. Background selection and population subdivision could also act together to increase the proportion of rare haplotypes.
Role of Population Subdivision in Skew to Rare Haplotypes.
Skew in the haplotype frequency spectrum is not often ascribed to population subdivision, although subdivision is well known to increase the proportion of private alleles, that is, alleles found uniquely within a subpopulation (Slatkin 1985). Some investigators have described an interaction between sampling strategies, population subdivision, and haplotype frequency skew (Ptak and Przeworski 2002; De and Durrett 2007). It is possible that the skew observed in our real and simulated populations is the result of taking small samples from a relatively large number of subpopulations. There was a statistically significant excess of singletons within most regions, suggesting influences other than population subdivision. Analyses at the community and agency level did not suggest that finer scale divisions were playing a role (data not shown), but sample sizes are small. Subdivision resulted in large numbers of false positive CSFS tests in simulated populations (fig. 3), but levels of significance of these tests were lower than the observed data, suggesting that observed data were more extreme than would be predicted on the basis of demographic conditions alone.
Mycobacterium tuberculosis Genetic Data Are Consistent with Background Selection.
Background selection occurs when neutral sites linked to deleterious mutations are purged from a population (Charlesworth et al. 1993). It is predicted to have the greatest effect on genetic variation under the following circumstances: high (deleterious) mutation rate, low rate of recombination, and asexual reproduction (Charlesworth et al. 1993). Mycobacterium tuberculosis has a “closed genome”: There is no evidence of substantial gene exchange with other organisms or among lineages of M. tuberculosis (Alland et al. 2003). The overall mutation rate of M. tuberculosis is unknown, but the rate of at least one class of mutations—transposition of IS6110—is known to be high (minimum estimate from this study 0.143 mutation/generation, see Materials and Methods). IS6110 transposition events can have deleterious effects on fitness as a result of insertion into a coding or regulatory region, or recombination between elements (McEvoy et al. 2007). A comprehensive mapping study of IS6110 insertion sites in clinical isolates of M. tuberculosis suggested purifying selection, as intergenic sites were overrepresented and sites where element insertion would be predicted to have a deleterious effect on growth, survival, or virulence were underrepresented or absent (Yesilkaya et al. 2005). There was no evidence of transpositional sequence specificity to explain the nonrandom insertion of IS6110 elements. This suggests that deleterious IS6110 mediated mutations occur, possibly frequently, but that these mutants are rapidly purged from the population.
Combined Effects of Bottlenecks, Subdivision, and Background Selection.
The overall effect of background selection is to decrease diversity of neutral genetic markers (Charlesworth et al. 1993). Charlesworth also demonstrated a skew in the frequency spectrum toward rare alleles at neutral sites in small populations with low rates of recombination. The skew to rare frequencies in our RFLP haplotype spectra is consistent with this effect: Novel configurations of IS elements may not be allowed to increase in frequency as a result of physical linkage with deleterious mutations. It should be noted that although the RFLP haplotype (based on number and location of IS6110 elements) is a marker of neutral variation, it may also, as discussed above, directly identify mutations of functional consequence. Some configurations of IS elements may be rare because elements reside in regions where their presence is poorly tolerated. Although Hershberg et al. concluded that purifying selection is unlikely to be a strong influence on the structure of the MTBC, a skew to rare types also appears in their study of SNPs in a global sample of M. tuberculosis (Hershberg et al. 2008).
Several other observations from our study are consistent with background selection, possibly complementing the influences of bottlenecks and divisions within populations. For example, a combination of population bottleneck and background selection would be expected to result in preservation of ancestral/founding types at high frequencies, as many new mutants associated with population reexpansion would be purged by purifying selection. This was seen in the network analysis of spoligotyping data (fig. 1): Founder types of sequential epidemics (groups 1a, 1b, and 2) remain at high frequency, and most types are very closely related to the ancestral type (implying that a small number of mutational events have been tolerated). Background selection also acts synergistically with modest migration rates to increase FST (Charlesworth et al. 1993; Nordborg 1997). High FST values observed among study populations are thus also consistent with purifying selection. Lastly, the leptokurtic RFLP haplotype frequency spectrum of group 2 populations can be seen with background selection (Fay and Wu 2000; Ewens 2004). That there are both neutral and Darwinian influences on population structure may reflect the complexity of within- versus between-host dynamics. Transmission is constrained by the vagaries of human behavior, but the within-host bacterial population is large (∼109 organisms in a 2-cm lung cavity) and subject to a variety of insults from the immune armamentarium.
Summary
We have discovered a rather complex set of influences on the structure of a well-characterized population of M. tuberculosis. The model that best fits the data combines population bottlenecks and fine-scale subdivision, with or without background selection. This complex model is also consistent with the epidemiology, genetics, and global population structure of M. tuberculosis . Biomedical implications of these findings are discussed in more detail in supplementary text S5, Supplementary Materials online. Neutral influences we have described here should be incorporated into modeling studies of important strains, in order to improve the predictive value of these models. Future studies should also focus on clarifying the role of purifying/background selection in natural populations of M. tuberculosis. To this end, an accurate estimate of the rate of all types of mutation (not just transposition) in natural populations of M. tb would help clarify whether all classes of mutation—that is, errors of replication, DNA damage, and the activity of other autonomous elements such as phage—are likely to be under purifying selection.
Supplementary Material
Supplementary text S1–S5 and supplementary figure S2 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).
Supplementary Material
Acknowledgments
We gratefully acknowledge the assistance of Edward Burns (professor emeritus, State University of New York at Binghamton) with the statistical analysis. We thank Michael P. Cummings (University of Maryland) for assistance with the phylogenetic analyses. The Workshop on Molecular Evolution at the Marine Biological Laboratory (Woods Hole, MA) provided many opportunities for thoughtful discussion of study design and results. We thank Tran Van (Stanford University) for technical assistance. C.P. was supported by NIH grant 5K08AI67458-3. M.L. and M.W.F. were supported in part by NIH grant GM28016.
V.H.H. and C.P. collected and archived data and materials. C.P., V.H.H., and M.W.F. designed the study. M.L. performed simulation experiments, W.W. performed RFLP analysis of a subset of isolates, all other analyses were performed by C.P. C.P. drafted the paper. C.P., V.H.H., M.L., G.K.S., and M.W.F. helped interpret results. C.P., V.H.H., M.L., G.K.S. and M.W.F. revised the manuscript.
References
- Alland D, Whittam TS, Murray MB, et al. (11 co-authors) Modeling bacterial evolution with comparative-genome-based marker systems: application to Mycobacterium tuberculosis evolution and pathogenesis. J Bacteriol. 2003;185:3392–3399. doi: 10.1128/JB.185.11.3392-3399.2003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baker L, Brown T, Maiden MC, Drobniewski F. Silent nucleotide polymorphisms and a phylogeny for Mycobacterium tuberculosis. Emerg Infect Dis. 2004;10:1568–1577. doi: 10.3201/eid1009.040046. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bhanu NV, van Soolingen D, van Embden JD, Dar L, Pandey RM, Seth P. Predominace of a novel Mycobacterium tuberculosis genotype in the Delhi region of India. Tuberculosis (Edinb) 2002;82:105–112. doi: 10.1054/tube.2002.0332. [DOI] [PubMed] [Google Scholar]
- Borgdorff MW, Van Deutekom H, De Haas PE, Kremer K, Van Soolingen D. Mycobacterium tuberculosis, Beijing genotype strains not associated with radiological presentation of pulmonary tuberculosis. Tuberculosis (Edinb) 2004;84:337–340. doi: 10.1016/j.tube.2003.10.002. [DOI] [PubMed] [Google Scholar]
- Brosch R, Gordon SV, Marmiesse M, et al. (15 co-authors) A new evolutionary scenario for the Mycobacterium tuberculosis complex. Proc Natl Acad Sci U S A. 2002;99:3684–3689. doi: 10.1073/pnas.052548299. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brudey K, Driscoll JR, Rigouts L, et al. (66 co-authors) Mycobacterium tuberculosis complex genetic diversity: mining the fourth international spoligotyping database (SpolDB4) for classification, population genetics and epidemiology. BMC Microbiol. 2006;6:23. doi: 10.1186/1471-2180-6-23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bush RM, Fitch WM, Bender CA, Cox NJ. Positive selection on the H3 hemagglutinin gene of human influenza virus A. Mol Biol Evol. 1999;16:1457–1465. doi: 10.1093/oxfordjournals.molbev.a026057. [DOI] [PubMed] [Google Scholar]
- Canadian Plains Research Center. The encyclopedia of Saskatchewan. Regina, Saskatchewan: Canadian Plains Research Center, University of Regina; cited 2009 December 7. Available from http://www.esask.ca/. [Google Scholar]
- Charlesworth B, Morgan MT, Charlesworth D. The effect of deleterious mutations on neutral molecular variation. Genetics. 1993;134:1289–1303. doi: 10.1093/genetics/134.4.1289. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chihota V, Apers L, Mungofa S, et al. (13 co-authors) Predominance of a single genotype of Mycobacterium tuberculosis in regions of Southern Africa. Int J Tuberc Lung Dis. 2007;11:311–318. [PubMed] [Google Scholar]
- De A, Durrett R. Stepping-stone spatial structure causes slow decay of linkage disequilibrium and shifts the site frequency spectrum. Genetics. 2007;176:969–981. doi: 10.1534/genetics.107.071464. [DOI] [PMC free article] [PubMed] [Google Scholar]
- de Jong BC, Hill PC, Aiken A, et al. (15 co-authors) Progression to active tuberculosis, but not transmission, varies by Mycobacterium tuberculosis lineage in The Gambia. J Infect Dis. 2008;198:1037–1043. doi: 10.1086/591504. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Drobniewski F, Balabanova Y, Nikolayevsky V, Ruddy M, Kuznetzov S, Zakharova S, Melentyev A, Fedorin I. Drug-resistant tuberculosis, clinical virulence, and the dominance of the Beijing strain family in Russia. JAMA. 2005;293:2726–2731. doi: 10.1001/jama.293.22.2726. [DOI] [PubMed] [Google Scholar]
- Drobniewski F, Balabanova Y, Ruddy M, et al. (12 co-authors) Rifampin- and multidrug-resistant tuberculosis in Russian civilians and prison inmates: dominance of the beijing strain family. Emerg Infect Dis. 2002;8:1320–1326. doi: 10.3201/eid0811.020507. [DOI] [PMC free article] [PubMed] [Google Scholar]
- European Concerted Action on New Generation Genetic Markers and Techniques for the Epidemiology and Control of Tuberculosis. Beijing/W genotype Mycobacterium tuberculosis and drug resistance. Emerg Infect Dis. 2006;12:736–743. doi: 10.3201/eid1205.050400. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ewens WJ. The sampling theory of selectively neutral alleles. Theor Popul Biol. 1972;3:87–112. doi: 10.1016/0040-5809(72)90035-4. [DOI] [PubMed] [Google Scholar]
- Ewens WJ. Testing for increased mutation rate for neutral alleles. Theor Popul Biol. 1973;4:251–258. doi: 10.1016/0040-5809(73)90010-5. [DOI] [PubMed] [Google Scholar]
- Ewens WJ. Mathematical population genetics. New York: Springer; 2004. [Google Scholar]
- Excoffier L. Arlequin ver. 3.0: an integrated software package for population genetic data analysis. Evol Bioinform Online. 2005;1:47–50. [PMC free article] [PubMed] [Google Scholar]
- Fay JC, Wu CI. Hitchhiking under positive Darwinian selection. Genetics. 2000;155:1405–1413. doi: 10.1093/genetics/155.3.1405. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Filliol I, Motiwala AS, Cavatore M, et al. (25 co-authors) Global phylogeny of Mycobacterium tuberculosis based on single nucleotide polymorphism (SNP) analysis: insights into tuberculosis evolution, phylogenetic accuracy of other DNA fingerprinting systems, and recommendations for a minimal standard SNP set. J Bacteriol. 2006;188:759–772. doi: 10.1128/JB.188.2.759-772.2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Flores L, Van T, Narayanan S, DeRiemer K, Kato-Maeda M, Gagneux S. Large sequence polymorphisms classify Mycobacterium tuberculosis strains with ancestral spoligotyping patterns. J Clin Microbiol. 2007;45:3393–3395. doi: 10.1128/JCM.00828-07. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fraser C, Hanage WP, Spratt BG. Neutral microepidemic evolution of bacterial pathogens. Proc Natl Acad Sci U S A. 2005;102:1968–1973. doi: 10.1073/pnas.0406993102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fu YX. Statistical tests of neutrality of mutations against population growth, hitchhiking and background selection. Genetics. 1997;147:915–925. doi: 10.1093/genetics/147.2.915. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gagneux S, DeRiemer K, Van T, et al. (13 co-authors) Variable host–pathogen compatibility in Mycobacterium tuberculosis. Proc Natl Acad Sci U S A. 2006;103:2869–2873. doi: 10.1073/pnas.0511240103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gagneux S, Small PM. Global phylogeography of Mycobacterium tuberculosis and implications for tuberculosis product development. Lancet Infect Dis. 2007;7:328–337. doi: 10.1016/S1473-3099(07)70108-1. [DOI] [PubMed] [Google Scholar]
- Glynn JR, Whiteley J, Bifani PJ, Kremer K, van Soolingen D. Worldwide occurrence of Beijing/W strains of Mycobacterium tuberculosis: a systematic review. Emerg Infect Dis. 2002;8:843–849. doi: 10.3201/eid0808.020002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gutacker MM, Mathema B, Soini H, Shashkina E, Kreiswirth BN, Graviss EA, Musser JM. Single-nucleotide polymorphism-based population genetic analysis of Mycobacterium tuberculosis strains from 4 geographic sites. J Infect Dis. 2006;193:121–128. doi: 10.1086/498574. [DOI] [PubMed] [Google Scholar]
- Hartl DL, Clark AG. Principles of population genetics. Sunderland (MA): Sinauer Associates; 2007. [Google Scholar]
- Hershberg R, Lipatov M, Small PM, et al. (11 co-authors) High functional diversity in Mycobacterium tuberculosis driven by genetic drift and human demography. PLoS Biol. 2008;6:e311. doi: 10.1371/journal.pbio.0060311. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hirsh AE, Tsolaki AG, DeRiemer K, Feldman MW, Small PM. Stable association between strains of Mycobacterium tuberculosis and their human host populations. Proc Natl Acad Sci U S A. 2004;101:4871–4876. doi: 10.1073/pnas.0305627101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hudson RR. Generating samples under a Wright–Fisher neutral model of genetic variation. Bioinformatics. 2002;18:337–338. doi: 10.1093/bioinformatics/18.2.337. [DOI] [PubMed] [Google Scholar]
- Iseman MD. A clinician's guide to tuberculosis. Philadelphia (PA): Lippincott Williams & Wilkins; 2000. [Google Scholar]
- Kamerbeek J, Schouls L, Kolk A, et al. (11 co-authors) Simultaneous detection and strain differentiation of Mycobacterium tuberculosis for diagnosis and epidemiology. J Clin Microbiol. 1997;35:907–914. doi: 10.1128/jcm.35.4.907-914.1997. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kong Y, Cave MD, Zhang L, Foxman B, Marrs CF, Bates JH, Yang ZH. Association between Mycobacterium tuberculosis Beijing/W lineage strain infection and extrathoracic tuberculosis: insights from epidemiologic and clinical characterization of the three principal genetic groups of M. tuberculosis clinical isolates. J Clin Microbiol. 2007;45:409–414. doi: 10.1128/JCM.01459-06. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lazzarini LC, Huard RC, Boechat NL, et al. (16 co-authors) Discovery of a novel Mycobacterium tuberculosis lineage that is a major cause of tuberculosis in Rio de Janeiro, Brazil. J Clin Microbiol. 2007;45:3891–3902. doi: 10.1128/JCM.01394-07. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Linz B, Balloux F, Moodley Y, et al. (16 co-authors) An African origin for the intimate association between humans and Helicobacter pylori. Nature. 2007;445:915–918. doi: 10.1038/nature05562. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lux M. Medicine that walks: disease, medicine, and Canadian plains native people. Toronto (Canada): University of Toronto Press; 2001. pp. 1880–1940. [Google Scholar]
- Magurran AE. Measuring biological diversity. Oxford: Wiley-Blackwell; 2003. [Google Scholar]
- McEvoy CR, Falmer AA, Gey van Pittius NC, Victor TC, van Helden PD, Warren RM. The role of IS6110 in the evolution of Mycobacterium tuberculosis. Tuberculosis (Edinb) 2007;87:393–404. doi: 10.1016/j.tube.2007.05.010. [DOI] [PubMed] [Google Scholar]
- Milloy JS. A national crime: the Canadian Government and the residential school system. Winnipeg (Canada): University of Manitoba Press; 1999. [Google Scholar]
- Mokrousov I, Ly HM, Otten T, Lan NN, Vyshnevskyi B, Hoffner S, Narvskaya O. Origin and primary dispersal of the Mycobacterium tuberculosis Beijing genotype: clues from human phylogeography. Genome Res. 2005;15:1357–1364. doi: 10.1101/gr.3840605. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moodley Y, Linz B, Yamaoka Y, et al. (15 co-authors) The peopling of the Pacific from a bacterial perspective. Science. 2009;323:527–530. doi: 10.1126/science.1166083. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Murray PR, Baron EJ, Jorgensen JH, Pfaller MA, Yolken RH. Manual of clinical microbiology. Washington (DC): ASM Press; 2003. [Google Scholar]
- Namouchi A, Karboul A, Mhenni B, Khabouchi N, Haltiti R, Ben Hassine R, Louzir B, Chabbou A, Mardassi H. Genetic profiling of Mycobacterium tuberculosis in Tunisia: predominance and evidence for the establishment of a few genotypes. J Med Microbiol. 2008;57:864–872. doi: 10.1099/jmm.0.47483-0. [DOI] [PubMed] [Google Scholar]
- Natural Resources C. The atlas of Canada. Ottawa, ON: Natural Resources Canada; cited 2009 Dec 7. Available from: http://atlas.nrcan.gc.ca/site/english/index.html. [Google Scholar]
- Nordborg M. Structured coalescent processes on different time scales. Genetics. 1997;146:1501–1514. doi: 10.1093/genetics/146.4.1501. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ptak SE, Przeworski M. Evidence for population growth in humans is confounded by fine-scale population structure. Trends Genet. 2002;18:559–563. doi: 10.1016/s0168-9525(02)02781-6. [DOI] [PubMed] [Google Scholar]
- Ramachandran S, Deshpande O, Roseman CC, Rosenberg NA, Feldman MW, Cavalli-Sforza LL. Support from the relationship of genetic and geographic distance in human populations for a serial founder effect originating in Africa. Proc Natl Acad Sci U S A. 2005;102:15942–15947. doi: 10.1073/pnas.0507611102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Reed MB, Gagneux S, Deriemer K, Small PM, Barry CE 3rd. The W-Beijing lineage of Mycobacterium tuberculosis overproduces triglycerides and has the DosR dormancy regulon constitutively upregulated. J Bacteriol. 2007;189:2583–2589. doi: 10.1128/JB.01670-06. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, Zhivotovsky LA, Feldman MW. Genetic structure of human populations. Science. 2002;298:2381–2385. doi: 10.1126/science.1078311. [DOI] [PubMed] [Google Scholar]
- Roumagnac P, Weill FX, Dolecek C, et al. (11 co-authors) Evolutionary history of Salmonella typhi. Science. 2006;314:1301–1304. doi: 10.1126/science.1134933. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Saskatchewan Health HISC. Regina, Saskatchewan: Covered population 2004. 2004. [Google Scholar]
- Schneider S, Roessli D, Excoffier L. Arlequin ver 2: a software for population genetics data analysis. Geneva, Switzerland: Genetics and Biometry Laboratory, University of Geneva; 2000. [Google Scholar]
- Shankarappa R, Margolick JB, Gange SJ, et al. (12 co-authors) Consistent viral evolutionary changes associated with the progression of human immunodeficiency virus type 1 infection. J Virol. 1999;73:10489–10502. doi: 10.1128/jvi.73.12.10489-10502.1999. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sharma MK, Al-Azem A, Wolfe J, Hershfield E, Kabani A. Identification of a predominant isolate of Mycobacterium tuberculosis using molecular and clinical epidemiology tools and in vitro cytokine responses. BMC Infect Dis. 2003;3:3. doi: 10.1186/1471-2334-3-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Simpson EH. Measurement of diversity. Nature. 1949;163:688. [Google Scholar]
- Sinnott RW. Virtues of the Haversine. Sky Telescope. 1984;68:158. [Google Scholar]
- Slatkin M. Rare alleles as indicators of gene flow. Evolution. 1985;39:53–65. doi: 10.1111/j.1558-5646.1985.tb04079.x. [DOI] [PubMed] [Google Scholar]
- Smith NH, Gordon SV, de la Rua-Domenech R, Clifton-Hadley RS, Hewinson RG. Bottlenecks and broomsticks: the molecular evolution of Mycobacterium bovis. Nat Rev Microbiol. 2006;4:670–681. doi: 10.1038/nrmicro1472. [DOI] [PubMed] [Google Scholar]
- Sreevatsan S, Pan X, Stockbauer KE, Connell ND, Kreiswirth BN, Whittam TS, Musser JM. Restricted structural gene polymorphism in the Mycobacterium tuberculosis complex indicates evolutionarily recent global dissemination. Proc Natl Acad Sci U S A. 1997;94:9869–9874. doi: 10.1073/pnas.94.18.9869. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tsenova L, Ellison E, Harbacheuski R, Moreira AL, Kurepina N, Reed MB, Mathema B, Barry CE 3rd, Kaplan G. Virulence of selected Mycobacterium tuberculosis clinical isolates in the rabbit model of meningitis is dependent on phenolic glycolipid produced by the bacilli. J Infect Dis. 2005;192:98–106. doi: 10.1086/430614. [DOI] [PubMed] [Google Scholar]
- Tsolaki AG, Hirsh AE, DeRiemer K, Enciso JA, Wong MZ, Hannan M, Goguet de la Salmoniere YO, Aman K, Kato-Maeda M, Small PM. Functional and evolutionary genomics of Mycobacterium tuberculosis: insights from genomic deletions in 100 strains. Proc Natl Acad Sci U S A. 2004;101:4865–4870. doi: 10.1073/pnas.0305634101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- van Embden JD, Cave MD, Crawford JT, Dale JW, Eisenach KD, Gicquel B, Hermans P, Martin C, McAdam R, Shinnick TM. Strain identification of Mycobacterium tuberculosis by DNA fingerprinting: recommendations for a standardized methodology. J Clin Microbiol. 1993;31:406–409. doi: 10.1128/jcm.31.2.406-409.1993. [DOI] [PMC free article] [PubMed] [Google Scholar]
- van Soolingen D, Qian L, de Haas PE, Douglas JT, Traore H, Portaels F, Qing HZ, Enkhsaikan D, Nymadawa P, van Embden JD. Predominance of a single genotype of Mycobacterium tuberculosis in countries of east Asia. J Clin Microbiol. 1995;33:3234–3238. doi: 10.1128/jcm.33.12.3234-3238.1995. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Viboud C, Bjornstad ON, Smith DL, Simonsen L, Miller MA, Grenfell BT. Synchrony, waves, and spatial hierarchies in the spread of influenza. Science. 2006;312:447–451. doi: 10.1126/science.1125237. [DOI] [PubMed] [Google Scholar]
- Victor TC, de Haas PE, Jordaan AM, van der Spuy GD, Richardson M, van Soolingen D, van Helden PD, Warren R. Molecular characteristics and global spread of Mycobacterium tuberculosis with a western cape F11 genotype. J Clin Microbiol. 2004;42:769–772. doi: 10.1128/JCM.42.2.769-772.2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang S, Lewis CM, Jakobsson M, et al. (27 co-authors) Genetic variation and population structure in native Americans. PLoS Genet. 2007;3:e185. doi: 10.1371/journal.pgen.0030185. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Warren RM, Streicher EM, Sampson SL, van der Spuy GD, Richardson M, Nguyen D, Behr MA, Victor TC, van Helden PD. Microevolution of the direct repeat region of Mycobacterium tuberculosis: implications for interpretation of spoligotyping data. J Clin Microbiol. 2002;40:4457–4465. doi: 10.1128/JCM.40.12.4457-4465.2002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Watterson GA. The homozygosity test of neutrality. Genetics. 1978;88:405–417. doi: 10.1093/genetics/88.2.405. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Whittam TS, Clark AG, Stoneking M, Cann RL, Wilson AC. Allelic variation in human mitochondrial genes based on patterns of restriction site polymorphism. Proc Natl Acad Sci U S A. 1986;83:9611–9615. doi: 10.1073/pnas.83.24.9611. [DOI] [PMC free article] [PubMed] [Google Scholar]
- WHO. Global tuberculosis control—surveillance, planning, financing. Geneva, Switzerland: WHO; 2008. [Google Scholar]
- Wirth T, Hildebrand F, Allix-Beguec C, et al. (13 co-authors) Origin, spread and demography of the Mycobacterium tuberculosis complex. PLoS Pathog. 2008;4:e1000160. doi: 10.1371/journal.ppat.1000160. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yesilkaya H, Dale JW, Strachan NJ, Forbes KJ. Natural transposon mutagenesis of clinical isolates of Mycobacterium tuberculosis: how many genes does a pathogen need? J Bacteriol. 2005;187:6726–6732. doi: 10.1128/JB.187.19.6726-6732.2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.



