Skip to main content
Journal of Clinical Microbiology logoLink to Journal of Clinical Microbiology
. 2001 Sep;39(9):3179–3185. doi: 10.1128/JCM.39.9.3179-3185.2001

Identification and Characterization of Variable-Number Tandem Repeats in the Yersinia pestis Genome

Alexandra M Klevytska 1,, Lance B Price 1, James M Schupp 1, Patricia L Worsham 2, Jane Wong 3, Paul Keim 1,*
PMCID: PMC88315  PMID: 11526147

Abstract

Yersinia pestis, the infamous plague-causing pathogen, appears to have emerged in relatively recent history. Evidence of this fact comes from several studies that document a lack of nucleotide diversity in the Y. pestis genome. In contrast, we report that variable-number tandem repeat (VNTR) sequences are common in the Y. pestis genome and occur frequently in gene coding regions. Larger tandem repeat arrays, most useful for phylogenetic analysis, are present at an average of 2.18 arrays per 10 kbp and are distributed evenly throughout the genome and the two virulence plasmids, pCD1 and pMT1. We examined allelic diversity at 42 chromosomal VNTR loci in 24 selected isolates (12 globally distributed and 12 from Siskiyou County, Calif.). Vast differences in diversity were observed among the 42 VNTR loci, ranging from 2 to 11 alleles. We found that the maximum copy number of repeats in an array was highly correlated with diversity (R = 0.86). VNTR-based phylogenetic analysis of the 24 strains successfully grouped isolates from biovar orientalis and most of the antiqua and mediaevalis strains. Hence, multiple-locus VNTR analysis (MLVA) appears capable of both distinguishing closely related strains and successfully classifying more distant relationships. Harnessing the power of MLVA to establish standardized databases will enable researchers to better understand plague ecology and evolution around the world.


The etiologic agent of plague, Yersinia pestis, is a gram-negative bacillus and a member of the family Enterobacteriaceae (4). Hundreds of millions of people have died in three major plague pandemics. The first pandemic, known as Justinian's plague, occurred during the 6th century, striking populations in Africa and the Mediterranean countries (4). The well-known Black Death plague began in the 14th century and by the 17th century had killed a fourth of the European population (4). The third pandemic began in China in the late 1800's and was quickly dispersed by boat and rail around the world (16). The third pandemic persists today, primarily in rodent reservoirs, and rarely causes human deaths. Three biovars, antiqua, mediaevalis, and orientalis, have been established for classification of Y. pestis based on glucose fermentation and nitrate reduction. However, until recently, strain differentiation within these biovars has been limited.

Recent technological advances in molecular biology have facilitated strain discrimination among pathogenic bacteria. Multilocus sequence typing has demonstrated extremely low diversity in genes of such recently emerged pathogens as Y. pestis (1, 2, 5) and the gram-positive bacterium Bacillus anthracis (18). Molecular techniques based on restriction enzyme digestion patterns have recently been applied to Y. pestis strain differentiation. The hypothesis that the first, second, and third pandemics were caused by progenitor strains of the antiqua, mediaevalis, and orientalis biovars, respectively, has been substantiated by strain typing using rRNA restriction patterns (ribotyping) (6) and the IS100 insertion element restriction fragment length polymorphism (RFLP) (1). Pulsed-field gel electrophoresis detects large restriction fragment size differences and has been applied to Y. pestis strain typing, but it has resulted in mixed findings (6, 12). Of these methods, the IS100 insertion element RFLP analysis appears to have the highest resolution, perhaps due to the higher mutation rates associated with IS elements (1).

It has been known since the late 1960s that eukaryotic genomes contain large numbers of repeated DNA sequences (3). However, not until two decades later were these hypervariable minisatellite regions used to detect DNA fingerprints in the human genome (8). Because of repetitive sequence length variation observed among individuals, these tandem repeat arrays were renamed variable number of tandem repeat (VNTR) loci by Nakamura et al. (14). The ongoing sequencing efforts for a number of archaeal and bacterial genomes have led to the discovery and application of VNTRs to DNA fingerprinting of prokaryotic species as well. For example, high-diversity VNTR loci have been used to genotype strains of Neisseria gonorrhoeae (17), Helicobacter pylori (13), Haemophilus influenzae (23), B. anthracis (9), and, in a limited case, Y. pestis (2).

In the work presented here, we report the analysis of the Y. pestis chromosomal and two plasmid, pMT1 and pCD1, DNA sequences in order to identify and characterize VNTR loci. We further analyzed the relationship between sequence structure and diversity using 42 VNTR loci. We report here the effects of sequence structure on VNTR polymorphism and the phylogenetic potential for Y. pestis strain discrimination using VNTR markers.

MATERIALS AND METHODS

Nomenclature.

Any DNA sequence repeated side by side is referred to as a direct repeat or tandem repeat array. The simplest sequence motif of a direct repeat array has a repeat length, measured in base pairs. The number of times that this simple sequence motif is present in the array is referred to as the copy number. The repeat length multiplied by the copy number is the array size, also measured in base pairs. For example, the tandem repeat array NATATATN, where N is any DNA, contains the dinucleotide motif AT with a repeat length of 2 bp, a copy number of 3 bp, and an array length of 6 bp. By convention, we state the repeat length before the copy number. For instance, the array in the above example is a 2 by 3.

A direct repeat array may contain either a mononucleotide repeat motif, also called a homopolymeric motif, such as a poly(T) tract, or a heteropolymeric motif, which is a dinucleotide or greater repeat motif of mixed nucleotide composition. A tandem repeat array with a short repeat length is frequently referred to as a simple sequence repeat (SSR) or short tandem repeat. van Belkum et al. (22) have previously defined SSRs as repeat arrays with repeat lengths of 2 to 6 bp. For the sake of consistency with one of the direct repeat search programs used, here we refer to SSRs as arrays with repeat lengths of 1 to 10 nucleotides. A VNTR maintains a repeat length but differs among strains in copy number. PCR amplification of a VNTR locus from different strains will reveal PCR fragment length polymorphisms that reflect the increased or decreased copy number as a result of insertion or deletion events within the array.

Detection of tandemly repeated sequences in the Y. pestis genome.

In these studies, we have used and compared two approaches for identifying direct repeat arrays. One approach employed the Genequest software program (Dnastar package; LaserGene, Inc., Madison, Wis.) direct repeat function set for the smallest scanning window size possible (8 nucleotides) to locate direct repeat arrays. This setting permits detection of arrays as small as 1 by 9, 2 by 5, 3 by 3, 4 by 3, etc.; however, it will not detect arrays of 8 bases or less. Genequest will also readily detect open reading frames (ORFs) and was used in conjunction with the direct repeat searches to determine their presence in and out of probable coding regions. The second approach used software designed by Gur-Arie et al. (7) to detect SSRs with a repeat length between 1 and 10 bp (ftp://ftp.technion.ac.il/pub/supported/biotech/ssr.exe). The SSR program generates an output file that contains the repeat length in base pairs, the nucleotide composition of the motif, the copy number, and the location of the array.

The unfinished Y. pestis strain Colorado 92 (CO92) chromosome sequence was downloaded from the Sanger Center Microbial Genomes web page (http://sanger.ac.uk/Projects/Y_pestis/) on 21 April 2000 in 105 contigs. The complete sequences for the Y. pestis plasmids pCD1 and pMT1 were downloaded from the National Center for Biotechnology Information (NCBI) web site (GenBank accession numbers NC_001976 for pCD1 and NC_001976 for pMT1). Plasmid sequences were screened for direct repeats in Genequest according to the parameters described above.

Clustering of arrays in the chromosome as detected with Genequest was examined in 10-, 25-, and 50-kbp intervals and was compared to an expected Poisson distribution [P(x) = e−μμx/x!, μ = 2.14 arrays per 10-kbp interval] as previously described (25).

PCR screening of direct repeats for variability.

All reagents were obtained from Life Technologies, unless otherwise noted. A diverse collection of 77 tandem repeat arrays identified with Genequest was screened for locus variability. When available, three representatives were screened for each repeat length. These loci were amplified from 12 Y. pestis isolates, representing all biovar types (Table 1). Variability was detected in 42 of the screened loci (Table 2). Thirty-five loci were dropped from further analyses due to lack of variability (27 loci), poor amplification (5 loci), or no amplification (3 loci). Six of the 27 monomorphic loci showed significant homology to ORFs identified in members of the Enterobacteriaceae when BLAST searched on the NCBI database (http://www.ncbi.nlm.nih.gov/BLAST/). Another five, detected in an earlier incomplete Y. pestis chromosome sequence assembly, were no longer present in multiple copies in the 21 April 2000 sequence, suggesting they were artifacts of an inaccurate assembly. Sequences from the 42 VNTR loci were checked against the 21 April 2000 Y. pestis genome sequence to ensure that the same VNTR locus was not duplicated in our analyses. The completed Y. pestis sequence (February 2001) was revisited to establish final genomic coordinates (Table 2).

TABLE 1.

Y. pestis strains

Straina Geographical origin Biovarb
Angola Angola A
Antiqua Democratic Republic of Congo A
Pestoides F Former Soviet Union A
Harbin 35 Manchuria M
Kim 10 Variant Kurdistan M
Nicholisk 41 Manchuria M
Pestoides A Former Soviet Union M
Pestoides Ba Former Soviet Union M
195/P India O
Java 9 Indonesia O
La Paz Bolivia O
CO92 Colorado O
83A-5257 California O
85A-4160 California O
86A-3503 California O
86A-3654 California O
89A-7521 California O
89A-7544 California O
89A-7545 California O
90A-415 California O
90A-598 California O
90A-4021 California O
90A-6072 California O
90A-7215 California O
a

Strains from the U.S. Army Medical Research Institutes for Infectious Diseases collection are indicated with asterisks. Strains from the California Department of Health are indicated with daggers. 

b

A, antiqua; M, mediaevalis; O, orientalis. 

TABLE 2.

Y. pestis VNTR loci and primers used for their amplificationa

Locus CO92 array No. of alleles DIb Forward primer (5′ to 3′) Reverse primer (5′ to 3′)
M02 1 by 11 5 0.76 GCCTTGGCGCTGACTCCATTGTGC Hex-GGCCTATTTATCTTAACCACGACTGAACCTC
M06 2 by 6 2 0.17 Fam-GATAGATCTCCGAAGGCAGATCGCAATAGGTC GGGCGATAGGATAGCTTGATGCGTTGTTTTAC
M09 3 by 6 4 0.51 Fam-GACCTCGATCTGCTTAGAACCTTTGTAGCTGTTGC GTTGCATTTGTTGGCTAACTGCTGACTGAGTTC
M12c 4 by 10 7 0.82 GAAGCGGCAACAATTTACCGTTATTTATGCT Ned-TTTATTCGCCTCCCCTTCGAACTTGAA
M15 5 by 2 4 0.57 GTCACCTCTCAGGCGGGAATCATCTCTC GCATAACGTCTTCAGTGCGTTGTGGC
M18 6 by 6 9 0.82 GGGGTGTTAATTGTGAGGCGTGTTGTC Hex-CCCTACCCGCCACTCTCTTGGTAGC
M2 7 by 5 3 0.49 Ned-GATTTATGAATGGCTACAACGTCGTCGCA GTAGTGATACAGGCAAATCCAAGAGCGCA
M22 7 by 12 9 0.82 Fam-GCGTGATACCAAAGGCTGGCTCACC GGCACTTTGGGTACGGAACGTCATCAC
M23 7 by 6.5 9 0.83 Hex-GTTAAAACTTAATTAACCAACTTAAGAGTCGCCATATC GTTATCAGATTTCGCTTGAAGTAGGTTTAACGATGAC
M25 7 by 20 10 0.83 Hex-GTTTAGCTGTAAATAGATTTAGAAGCCTCGTCTTTTGAC GATATAAATGAGTTGATTCAGGTGTTCATATTTAACGAAAC
M26 8 by 2 2 0.49 GCTATTTTTTGCGGTTAGTCACATTTGATATTTG GTCCCTTTCCTCACTGGTTCGACTTGTAAG
M27 8 by 6 10 0.84 Fam-GTCTAACTGGCGCGGCATTCTTGC GGGTGTTCTTATGTCATCCGCCAACAAAC
M28 8 by 7 7 0.75 GTTTGGCGGTTGGGCGTACCTTGGTA Ned-AGCGCCCGTAGACGCTTTCGAAATAGC
M29 8 by 8 7 0.83 Ned-GAGCGGCGGGTTCTCATGCTGAT GTTTAAGCAGTAGATCTAAAGCGTTATGAATATTGGTGTTA
M31 8 by 8 7 0.78 Fam-GGTTTGCAGGTTTTTGTTGTGGATTATGGACTTAGAT GGCGGGATGGCGTATCGGTTGC
M33 9 by 22 7 0.76 Hex-AGCAACCTGTGCCGCCTCGATATAAG GAGACGGGCGAAATTGAAGCACAGTTAT
M34 9 by 6 11 0.81 Fam-GAATCGCGGGTTGACGCTGTTGAGC GCTGAACAGCCCCATAAAACCGGAGC
M36 10 by 1 2 0.38 GTTAGACAAAACGTTTCTCGATGATTTGTAAGC GACAAACAATAAAAATTCACGATTTATACCCGTC
M37 10 by 6 5 0.72 GCCACAGGAAGAGGACATTTCAGAGAAAAC GTTGCTAAAACGATACCGCTACGATCAGC
M41 12 by 2 3 0.15 GCAGGGGACACCGAGCAGATTTATGC GCTTTCGCTTTAGGCCTGACCTGTTCTGC
M42 12 by 3 2 0.18 GCCGGTAGAGGCGTTGTCTTTGGTTTTTTC GTTTTGGGGTTCAGTGCACGCTTGTGAC
M43 12 by 5 4 0.69 GAGTGCGCGACGGTATGGTGC GCCGCGCATTTATTGATGGTGTC
M49 14 by 3 2 0.44 GTAATACTTACGCCTTGGCAGCAGTGTTCACGAC GTGGGGTGTTCTACGGTGGATTGTTTTTAGGC
M51 15 by 3 2 0.49 GCAACCCGCTGAAGTTGTAAAAACCGAC GCGTTGATCTTCGCGGCCTTCAC
M52 15 by 4 2 0.49 GTGGCCTAACCCGTTTTACCGGTGTAGC GCGGTTTTGTCAATCACGAATCAGGACTC
M54 16 by 3 3 0.57 GTATGCTTAGCGCCAGTGATAACGAGTC GATCGCGTCATCGGGGTTTGTC
M55 16 by 2 3 0.40 GTCATGGGTGATGCTGTTGCTCTCATTTTATAGTTGTAGTGA GCCTTAATGGTTGAATGCGCGAATGAGTCAGATAAC
M56 17 by 2 3 0.40 GTGCCAGTGTTTCGAGCATAGCCAATGAAATAC GTACCGCAGCCCAGACTCCTTACTGGAAAC
M58 17 by 7.5 6 0.78 Hex-GCGATAACCCACATTATCACAATAACCAACAC GCTGATGGAACCGGTATGCTGAATTTGC
M59 17 by 8 4 0.68 Ned-GCTTAGCCGCCAGAAAAGGTGAGTTGGC GATAATGGCGGTAGCCGGAATCTGATAATCATC
M61 18 by 5 3 0.60 GCGCCACAATTAGGGCAACTGC GCCGCTTTAATGGTTTGTGAAATGAC
M65 19 by 2.5 3 0.29 GTTGTATGTGCGTTGGTTAGGGAAGGC GTCATTTACTCCGGTAGTTTATTGGGTATTGAAC
M66 20 by 2 3 0.50 GAGATGGATTAACCAGATGTCTTAAAAACTATCGTAAC GCGAATCGGCGGCCCAAAC
M68 20 by 3 3 0.49 GATAAAGCGCAATGGCAAGAGAAAGC GCCTGGCAATTGTTCAGCGAATC
M69 21 by 2 2 0.00 GCGGTGCTGTTGTTAATGATTAGGTGTTCAC GCCCTCATCACAAAATACCTAAAATAGTCAATAGC
M71 21 by 2 2 0.15 GCGTTGCCAGCCGCAGCGATAC GCACCCCTGCTCTGGGTCACGC
M72 22 by 2 2 0.38 GCGACACGCCCTTTCAATGAGATACAC GTAGATCACCGCTAAATGCGAAGGTCCAC
M73 30 by 2 2 0.28 GCTTTCTGGCAATGCGATAGTTAGGCATCTC GTTAATTTAACTCAATATTGTCGCTATGGT
M74 36 by 2 2 0.28 GATAGAATAGCGCTTCTTTTATTATTGAGATGATGAC GTGCTTGTGGCAGGTGGGTATGAC
M76 41 by 2 2 0.49 GCGGCCTGATAAGGGATATTGGAAGC GGCGAAATTCATTAAAGAGGATCCTGACAC
M77 45 by 4 2 0.15 GAGTATTGCGAAGGGGTGATAAATGAAGC GTGCCAGAGTCCTTGGTTAAACAAATAGAAGAAC
M79 8 by 10 10 0.86 Fam-GCCCTTATCTACTGGGCCAAGCTAACGC GCCATGGCGGGATGTAATGGCAC
a

The primers marked by an asterisk have a phosphoramidite fluorescent dye (Fam, Hex, or Ned) covalently linked to the 5′ nucleotide. 

b

DI, Nei's diversity index. 

c

The same VNTR locus as that described by Adair et al. (2); however, the PCR primers differ. 

DNA was extracted from the 12 diverse strains using a traditional phenol-chloroform method. For the 12 California strains, DNA was prepared by a simple heat-soak method, previously described by Keim et al. (9). Primers were designed using Primer Select (Dnastar) with annealing temperatures as close to 65°C as possible but ranging from 59 to 69°C. However, primer-annealing temperatures did not differ by more than 2°C within a pair.

Reaction mixtures for PCR amplifications contained the final concentrations of the following reagents: 1× PCR buffer without MgCl2; 2 mM MgCl2; 200 μM dATP; 200 μM dCTP; 200 μM dGTP; 200 μM dTTP; 1 μM R110, R6G, or Tamra phosphoramidite fluorescently labeled dUTP (Applied Biosystems, Foster City, Calif.); 0.5 U of Taq polymerase; 0.5 ng of template DNA; 0.2 μM forward primer; 0.2 μM reverse primer; and filtered sterile water to a final volume of 20 μl. The primers marked by an asterisk in Table 2 have a phosphoramidite fluorescent dye (Fam, Hex, or Ned) covalently linked to the 5′ nucleotide.

Reaction conditions using the 5′-labeled primers were identical to those given above, with two exceptions. The fluorescent dUTP was omitted from the reaction mixture, and 5′-labeled primers were multiplexed, such that 18 primers were combined in four mixes. Mix 1 contained primers for the M06, M09, M18, M21, M28, and M34 loci; mix 2 contained primers for the M12, M23, M31, M58, and M82 loci; mix 3 contained primers for the M27, M29, and M33 loci; and mix 4 contained primers for the M22, M25, and M59 loci. All PCRs were performed in MJ Research 96-well DNA Engines equipped with hot bonnets. Reaction mixtures were raised to an initial temperature of 94°C for 5 min to denature the DNA. Thereafter, reaction mixtures were cycled for 20 s at 94°C, 20 s at 60°C, and 45 s at 72°C for a total of 35 cycles, followed by a final polymerase extension step at 72°C for 5 min.

Fluorescently labeled amplicons were visualized by polyacrylamide gel electrophoresis (PAGE) on an Applied Biosystems 377 DNA sequencer using GeneScan fragment analysis. The PCR product was diluted fivefold and then mixed with formamide, dextran blue loading dye, and a custom Bio Ventures Rox 1000 fluorescently labeled size standard at a ratio of 12:5:1:6, respectively. Virtual filter set A was used to detect amplicons labeled by direct incorporation with fluorescent dUTP. Virtual filter set D was used to detect primer-labeled amplicons. Amplicons were sized with Applied Biosystems GeneScan analysis software.

Statistical and phylogenetic analyses.

The degree of VNTR variability for a locus was assessed by the number of alleles observed or by Nei's diversity index [DI = 1 − Σ(allele frequency2)]. Because the calculations for Nei's diversity index were based on allelic frequency, only the 12 mixed-biovar strains were used to calculate the diversity index for each VNTR locus. The neighbor-joining dendrogram was generated using all 42 VNTR marker loci (Table 2) in PAUP4a (software program; D. Swofford, Sinauer Associates, Inc., Sunderland, Mass.). The simple matching coefficient and midpoint rooting options were used for this analysis.

RESULTS

VNTR types and occurrence in the Y. pestis genome.

In order to identify tandem repeat arrays and potential VNTRs, we examined 105 contigs of Y. pestis genomic sequence undergoing gap closure prior to final assembly. The chromosomal sequence of the CO92 strain was analyzed using two approaches for identifying tandem repeat arrays, Genequest and an SSR search program (7).

We used Genequest to identify an average of 2.18 arrays per 10 kbp, with repeat lengths ranging from 1 to 143 bp. SSRs with repeat lengths of 9 bp or less comprised the vast majority (84%) of all detected tandem repeat arrays. The two most common repeat lengths among these SSRs were 3 and 6 bp (46%) (Fig. 1). Just over half (53%) of all tandem repeat arrays were found in ORFs of at least 700 bp. A substantial portion (72%) of tandem repeat arrays identified in these ORFs had repeat lengths of 3 bp or a multiple of 3. However, a surprisingly large fraction (17%) of ORF tandem repeat arrays contained 7- or 8-bp repeat lengths. Given the high density of genes in most bacterial genomes, the triplet ORF bias and the nontriplet non-ORF bias strongly suggest that tandemly repeated sequences mutate by insertion and/or deletion events that must stay in frame if present in genes.

FIG. 1.

FIG. 1

Number of tandem repeat sequence arrays in the Y. pestis chromosome. Log of array numbers with repeat lengths from 1 to 21 bp detected by Genequest (black bars) and repeat lengths from 1 to 10 bp detected by the SSR program (gray bars) are presented. Larger repeat lengths of 22, 30, 36, 41, 45, 49, 115, 122, 123, 141, and 145 bp were observed but are not presented, as each was observed only once (except the 123-bp repeat length, which was observed three times).

The two larger Y. pestis plasmids were also analyzed using Genequest in order to identify extrachromosomal tandem repeat arrays. The pCD1 (GenBank accession no. NC_001972) and the pMT1 (GenBank accession no. NC_001976) plasmids had array densities of 2.13 and 2.18 per 10 kbp, respectively. The plasmid results were essentially identical to that of the chromosome direct repeat array composition. Hence, at this stage there is no evidence for gross differences in the repeat array composition between Y. pestis plasmids and its chromosome.

While Genequest permits detection of imperfect and nontandem (or interspersed) repeat arrays, it will not detect arrays of less than 9 bp. In contrast, the SSR search program identifies direct repeat arrays with repeat lengths from 1 to 10 bp and was able to identify many additional SSRs that were not discovered via the Genequest analysis. This approach identified 950 SSR arrays per 10 kbp. This is notably higher than the 1.86 SSR arrays per 10 kbp found with Genequest (Fig. 1) and was largely due to the detection of many more short, mononucleotide arrays. The frequency of detected SSR arrays declines logarithmically with increasing repeat length (Fig. 1). The ORF detection program that accompanies the SSR search program was not available for determination of potential ORF bias patterns.

Clearly, the SSR search program is effective at identifying very small arrays that the Genequest program misses. The SSR search program, however, does not identify arrays with imperfect repeats or with repeat lengths larger than 10 bp (Fig. 1). In addition, the large number of short, mononucleotide repeat arrays identified by the SSR program are not generally useful as molecular markers because of the difficulty in scoring 1-bp fragment length polymorphisms. The Genequest direct repeat pattern recognition is primarily a function of the human user and, hence, is capable of identifying repeated sequences that are more difficult to define in an algorithm. We find that both programs are useful and complementary in the identification of repeated sequence arrays in large genomic sequence databases.

In order to discern whether direct repeat arrays (as detected with Genequest) occur randomly or in genomic clusters, their frequency in 10-, 25-, and 50-kbp intervals was compared to a model Poisson distribution (Fig. 2). We found that array frequency did not deviate from random at these three interval scales, as tested by a chi-square test for independence. The observed and Poisson expectations were so close that the chi-square probabilities were equal to 1.0 for all three intervals. Therefore, it appears that direct repeat arrays are randomly distributed in the Y. pestis chromosome over these genomic interval sizes, which are larger than most single genes. It would not be surprising if particular genes were intolerant of repeated sequence arrays.

FIG. 2.

FIG. 2

Tandem repeat array distribution in the Y. pestis chromosome. Array distribution observed per 10-kbp interval (black bars) versus the expected Poisson distribution (μ = 2.14 arrays per 10-kbp interval) (gray bars).

Sequence structure versus diversity.

Among 77 direct repeat loci PCR amplified from a group of 12 mixed-biovar strains of Y. pestis, 42 were polymorphic. In order to compare sequence structure parameters with VNTR variability we have determined the diversity and number of alleles at these 42 VNTR loci for 24 selected Y. pestis isolates (Table 2). For these studies, a group of 12 potentially closely related isolates from a restricted geographic range (Siskiyou County, Calif.) were chosen, in addition to the 12 isolates representing all three biovar types from a broad worldwide distribution (Table 1). Among these 42 VNTR loci, the number of alleles observed per locus ranged from 2 to 11 (Fig. 3). Nei's diversity index, calculated from allele frequencies observed in the 12 mixed-biovar strains, ranged widely from 0 (only 1 allele detected) to 0.86.

FIG. 3.

FIG. 3

VNTR locus diversity. The 42 VNTR loci examined in this study are displayed based on the total number of alleles present in the 24 strains exampled.

Variability at VNTR loci was compared with different sequence structure parameters in order to determine which component provides the greatest power for predicting and understanding VNTR diversity levels. Correlations were performed between copy number, repeat length or array length, and the number of alleles or the diversity index for all 42 VNTR loci. Maximum copy number and number of alleles are highly correlated (R value of 0.86) (Fig. 4). However, this strong relationship is problematic for predictive purposes, because maximum copy number is determined after screening a locus against a group of diverse strains. More useful in a VNTR discovery program is the correlation between the number of alleles and copy number in a reference genomic sequence, such as CO92. In this study this correlation is lower but still of great predictive power at 0.67.

FIG. 4.

FIG. 4

Maximum repeat copy number predicts the number of alleles. Correlation between copy number and allele number observed in 42 polymorphic loci. This is presented for 12 Siskiyou County isolates (open circles and line with short dashes), 12 globally dispersed isolates (open triangles and solid line), and all 24 isolates combined (filled circle and line with long dashes).

The two additional sequence parameters, array length and repeat length, moderately correlate with VNTR diversity. In neither case does the correlation coefficient exceed 0.6, and both are correlated with copy number. Interestingly, the repeat length correlates negatively with the number of alleles (−0.50); VNTR arrays with larger repeat lengths display fewer alleles.

Among the 35 direct repeat arrays that did not show variability, the SSR category (1- to 10-bp repeat length) exhibited a higher frequency of monomorphism than the larger repeat length arrays. BLAST searches in the NCBI database found that the majority of these monomorphic SSRs showed significant homology to known genes, and they very likely reside in related ORFs of Y. pestis (data not shown). The SSRs that exhibited polymorphism, however, showed moderate to very high diversity levels. The most diverse VNTRs tended to have repeat lengths between 6 and 9 bp. The arrays with repeat lengths larger than those of SSRs were more often polymorphic but generally showed low diversity levels.

Diversity in closely and distantly related isolates.

In order to understand the relationship between VNTR diversity and evolutionary time, we have examined the VNTR diversity between the 12 Siskiyou County Y. pestis isolates with that of the 12 global strains. While all Siskiyou County isolates were of the biovar orientalis, the globally dispersed isolates had representatives from the orientalis, mediaevalis, and antiqua biovars. As expected, the mixed-biovar isolates are much more diverse, with an average of 4 alleles per locus versus 1.8 for the Siskiyou County isolates. In addition, the average Nei's diversity index for the mixed-biovar isolates is 0.54 versus 0.18 for the Siskiyou County samples. With all 24 samples combined, the total number of alleles per locus and the diversity index shift to 4.6 and 0.46, respectively.

The slight increase in total allele number when all 24 samples are combined is due to unique alleles in the Siskiyou samples not represented among the four orientalis strains in the global isolate group. Not surprisingly, this contrast indicates a geographic component to polymorphism due to the length of time separating diversifying isolates. Likewise, the maximum copy number effect upon allele number is depressed in the closely related Siskiyou County isolates (Fig. 4). The correlation is only 0.66 for Siskiyou County isolates but is 0.82 for the 12 global isolates. In this case, the geographic proximity may also be an indicator of evolutionary distance. In the Siskiyou County isolates, the VNTR loci have not had sufficient time to fully diversify and therefore exhibit reduced allele number. Because of the differences in diversity among the VNTR loci in this comparison, differential mutation rates among VNTR loci must also be important.

Genetic relationships among strains based upon VNTR analysis.

A phylogenetic analysis was performed on the 24 Y. pestis isolates using all 42 VNTR loci and the neighbor-joining method. The purpose of this analysis was not to construct an extensive phylogeny for many Y. pestis strains but to identify the utility of VNTRs as molecular markers in Y. pestis using biovar-classified isolates. The resulting phylogenetic tree (Fig. 5) grouped all orientalis biovar isolates into a single clade, supported with a high bootstrap value of 85. This strongly supported branch, of course, also separates the mediaevalis and antiqua biovar isolates from the orientalis isolates. Mediaevalis and antiqua isolates also were consistently categorized to their biovar, with the exception of the Pestoides F strain. Branch lengths among the orientalis isolates were relatively short and frequently well supported by bootstrap values. In contrast, branch lengths among mediaevalis and antiqua isolates were long and seldom supported by strong bootstrap values. This phylogenetic analysis is generally consistent with IS100-based analysis and common biovar evolutionary scenarios (2).

FIG. 5.

FIG. 5

Phylogenetic analysis using VNTR loci. Neighbor-joining analysis with midpoint rooting using 42 markers identified genetically similar and dissimilar strains. Biovar identities are indicated in parentheses following the strain names. O, orientalis; M, mediaevalis; A, antiqua. Bootstrap values based upon 1,000 simulations for individual branches are indicated. Branches with no numbers had values of less than 50.

DISCUSSION

Because Y. pestis is a recently emerged pathogen, very little (2) or no (1) nucleotide variation among strains has been detected in comparative sequencing studies. With nucleotide substitutions so rare, more frequent types of mutational changes are needed for strain identification and phylogenetic analysis. IS100-based analysis is very promising (2), as is multiple-locus VNTR analysis (MLVA), which uses the VNTR-based polymorphism reported here. In both cases, multiple-variable loci are used to provide genome-wide coverage and increase the precision in genetic relationship estimation. MLVA has the additional attraction of providing multiple character states at each locus (IS100 RFLPs are binary), which increases discriminatory potential between closely related isolates. Binary data have a maximum possible diversity index value of 0.5 per locus, whereas the multiple allelic VNTRs may approach 1.0 per locus. The greater diversity and, probably, higher mutation rates of VNTRs can provide high-resolution analysis of epidemics where isolates may be very closely related. MLVA is PCR based and requires only small amounts of low-quality DNA templates. With these favorable attributes, MLVA represents a promising approach for the characterization of Y. pestis isolates.

Our analysis of the genome has identified numerous potential VNTR loci and defined attributes related to their diversity in natural populations of Y. pestis. The high densities of both large repeat length arrays (ca. 1.86 per 10 kbp) and SSR arrays (ca. 980 per 10 kbp) in the genome offer a plentiful resource for marker development. The contrasting relationships between copy number and diversity and between repeat length and diversity will provide a guide for designing typing systems that match the research goals of particular projects. If phylogenetic analysis of global diversity is required, the less diverse, longer repeat length VNTR loci may be more fruitful for analysis due to slower evolutionary change maintained in the genome. If epidemiological or forensic analysis of a plague outbreak is the goal, the most rapidly evolving, high-copy-number SSR loci are best suited for inclusion in a typing system. Combining multiple loci in an analysis will provide great precision in genetic estimation and minimize the effect of convergent evolution on the analysis. An assumption intrinsic to VNTR analysis is that alleles of the same length are identical (homologous). However, this will not be universally true due to convergent evolution, and it necessitates identity between isolates to be confirmed at multiple loci. The 42 marker loci described in this study will provide a high level of confidence in identifying genetic relationships.

The exceedingly high number of direct repeat arrays within the Y. pestis genome may be an important feature of its evolution. More than half of the tandem arrays we identified contain triplet (or multiples of 3 bp) repeat lengths and comprise the majority of direct repeats found in probable protein-coding ORFs. Variation in these VNTRs generated by insertion and/or deletion events, such as during mismatch repair (10, 21), will not affect the reading frame but rather will change the amino acid sequences in these proteins and may result in altered phenotypes. In addition, VNTRs found in intergenic regions, such as a previously described tetranucleotide VNTR in Y. pestis (2), have the capacity to modify expression of adjacent genes. For example, SSR-mediated phase variation of virulence factor expression in many pathogenic bacteria, including H. influenzae (24), Neisseria spp., and Moraxella catarrhalis (15), is well characterized.

Surprisingly, we found that over one quarter of the direct repeats detected in ORFs contain nontriplet repeat lengths, mostly of 7 or 8 bp. Nontriplet repeat array variation within coding regions has the potential to radically modify proteins and shift bacterial phenotypes. However, such variation is expected most often to produce a frameshift mutation, leading to loss-of-function or undetected lethal phenotypes. Nontriplet tandem repeat-associated loss-of-function mutations have been previously characterized for Y. pestis. The spontaneous loss of Psn, the yersiniabactin/pestin receptor, arises from a 5-bp deletion, removing one of a pair of tandem pentameric repeats in psn (11, 20). The naturally nonureolytic state of Y. pestis has also been attributed to urease silencing by the addition of a single G residue at a poly(G) tract in ureD (19). Hence, the high frequency of monomorphism we observed among SSRs found in ORFs may reflect a selective bias against such high-stakes changes that accompany nontriplet repeat variation.

These observations have two consequences for our understanding of the Y. pestis genome. First, the mechanisms (e.g., mismatch repair) responsible for tandemly repeated sequences and their associated variation are potent. Their random distribution suggests that VNTR development can occur throughout the genome and that selection against debilitating mutation defines VNTR composition within genes. Secondly, there is tremendous potential for generating genetic diversity within protein-coding genes over a very short evolutionary time. The diversity associated with VNTRs is great, especially in contrast to the lack of nucleotide substitutions among Y. pestis strains. Evolutionary adaptation of Y. pestis as it has moved across the globe and into new hosts, vectors, and reservoirs would require genetic variation. As a recently emerged pathogen, VNTR diversity along with other highly mutable loci (e.g., IS elements) may have played an important role in Y. pestis evolution in the recent past.

ACKNOWLEDGMENTS

This work was supported, in part, by funds from the U.S. Department of Energy—NN20/CBNP, National Institutes of Health, and the Cowden Endowment in Microbiology.

We thank Christine Keys and an anonymous referee for reviewing the manuscript and making critical suggestions.

REFERENCES

  • 1.Achtman M, Zurth K, Morelli G, Torrea G, Guiyoule A, Carniel E. Yersinia pestis, the cause of plague, is a recently emerged clone of Yersinia pseudotuberculosis. Proc Natl Acad Sci USA. 1999;96:14043–14048. doi: 10.1073/pnas.96.24.14043. . (Erratum, 97:8192, 2000.) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Adair D M, Worsham P L, Hill K K, Klevytska A M, Jackson P J, Friedlander A M, Keim P. Diversity in a variable-number tandem repeat from Yersinia pestis. J Clin Microbiol. 2000;38:1516–1519. doi: 10.1128/jcm.38.4.1516-1519.2000. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Britten R J, Kohne D E. Repeated sequences in DNA. Hundreds of thousands of copies of DNA sequences have been incorporated into the genomes of higher organisms. Science. 1968;161:529–540. doi: 10.1126/science.161.3841.529. [DOI] [PubMed] [Google Scholar]
  • 4.Brubaker R R. Factors promoting acute and chronic diseases caused by yersiniae. Clin Microbiol Rev. 1991;4:309–324. doi: 10.1128/cmr.4.3.309. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Buchrieser C, Rusniok C, Frangeul L, Couve E, Billault A, Kunst F, Carniel E, Glaser P. The 102-kilobase pgm locus of Yersinia pestis: sequence analysis and comparison of selected regions among different Yersinia pestis and Yersinia pseudotuberculosis strains. Infect Immun. 1999;67:4851–4861. doi: 10.1128/iai.67.9.4851-4861.1999. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Guiyoule A, Grimont F, Iteman I, Grimont P A, Lefevre M, Carniel E. Plague pandemics investigated by ribotyping of Yersinia pestis strains. J Clin Microbiol. 1994;32:634–641. doi: 10.1128/jcm.32.3.634-641.1994. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Gur-Arie R, Cohen C J, Eitan Y, Shelef L, Hallerman E M, Kashi Y. Simple sequence repeats in Escherichia coli: abundance, distribution, composition, and polymorphism. Genome Res. 2000;10:62–71. [PMC free article] [PubMed] [Google Scholar]
  • 8.Jeffreys A J, Wilson V, Thein S L. Hypervariable 'minisatellite' regions in human DNA. Bio/Technology. 1992;24:467–472. [PubMed] [Google Scholar]
  • 9.Keim P, Price L B, Klevytska A M, Smith K L, Schupp J M, Okinaka R, Jackson P J, Hugh-Jones M E. Multiple-locus variable-number tandem repeat analysis reveals genetic relationships within Bacillus anthracis. J Bacteriol. 2000;182:2928–2936. doi: 10.1128/jb.182.10.2928-2936.2000. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Levinson G, Gutman G A. Slipped-strand mispairing: a major mechanism for DNA sequence evolution. Mol Biol Evol. 1987;4:203–221. doi: 10.1093/oxfordjournals.molbev.a040442. [DOI] [PubMed] [Google Scholar]
  • 11.Lucier T S, Fetherston J D, Brubaker R R, Perry R D. Iron uptake and iron-repressible polypeptide in Yersinia pestis. Infect Immun. 1996;64:3023–3031. doi: 10.1128/iai.64.8.3023-3031.1996. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Lucier T S, Brubaker R R. Determination of genome size, macrorestriction pattern polymorphism, and nonpigmentation-specific deletion in Yersinia pestis by pulsed-field gel electrophoresis. J Bacteriol. 1992;174:2078–2086. doi: 10.1128/jb.174.7.2078-2086.1992. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Marshall D G, Coleman D C, Sullivan D J, Xia H, O'Morain C A, Smyth C J. Genomic DNA fingerprinting of clinical isolates of Helicobacter pylori using short oligonucleotide probes containing repetitive sequences. J Appl Bacteriol. 1996;81:509–517. doi: 10.1111/j.1365-2672.1996.tb03540.x. [DOI] [PubMed] [Google Scholar]
  • 14.Nakamura Y, Leppert M, O'Connell P, Wolff R, Holm T, Culver M, Martin C, Fujimoto E, Hoff. Kumlin M E, et al. Variable number of tandem repeat (VNTR) markers for human gene mapping. Science. 1987;235:1616–1622. doi: 10.1126/science.3029872. [DOI] [PubMed] [Google Scholar]
  • 15.Peak I R, Jennings M P, Hood D W, Bisercic M, Moxon E R. Tetrameric repeat units associated with virulence factor phase variation in Haemophilus also occur in Neisseria spp. and Moraxella catarrhalis. FEMS Microbiol Lett. 1996;137:109–114. doi: 10.1111/j.1574-6968.1996.tb08091.x. [DOI] [PubMed] [Google Scholar]
  • 16.Perry R D, Fetherston J D. Yersinia pestis—etiologic agent of plague. Clin Microbiol Rev. 1997;10:35–66. doi: 10.1128/cmr.10.1.35. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Poh C L, Ramachandran V, Tapsall J W. Genetic diversity of Neisseria gonorrhoeae IB-2 and IB-6 isolates revealed by whole-cell repetitive element sequence-based PCR. J Clin Microbiol. 1996;34:292–295. doi: 10.1128/jcm.34.2.292-295.1996. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Price L B, Hugh-Jones M, Jackson P J, Keim P. Genetic diversity in the protective antigen gene of Bacillus anthracis. J Bacteriol. 1999;181:2358–2362. doi: 10.1128/jb.181.8.2358-2362.1999. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Sebbane F, Devalckenaere A, Foulon J, Carniel E, Simonet M. Silencing and reactivation of urease in Yersinia pestis is determined by one G residue at a specific position in the ureD gene. Infect Immun. 2001;69:170–176. doi: 10.1128/IAI.69.1.170-176.2001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Sikkema D J, Brubaker R R. Resistance to pesticin, storage of iron, and invasion of HeLa cells by yersiniae. Infect Immun. 1987;55:572–578. doi: 10.1128/iai.55.3.572-578.1987. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Strand M, Prolla T A, Liskay R M, Petes T D. Destabilization of tracts of simple repetitive DNA in yeast by mutations affecting DNA mismatch repair. Nature. 1993;365:274–276. doi: 10.1038/365274a0. . (Erratum, 368:569, 1994.) [DOI] [PubMed] [Google Scholar]
  • 22.van Belkum A, Scherer S, van Alphen L, Verbrugh H. Short-sequence DNA repeats in prokaryotic genomes. Microbiol Mol Biol Rev. 1998;62:275–293. doi: 10.1128/mmbr.62.2.275-293.1998. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.van Belkum A, Scherer S, van Leeuwen W, Willemse D, van Alphen L, Verbrugh H. Variable number of tandem repeats in clinical strains of Haemophilus influenzae. Infect Immun. 1997;65:5017–5027. doi: 10.1128/iai.65.12.5017-5027.1997. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Weiser J N, Maskell D J, Butler P D, Lindberg A A, Moxon E R. Characterization of repetitive sequences controlling phase variation of Haemophilus influenzae lipopolysaccharide. J Bacteriol. 1990;172:3304–3309. doi: 10.1128/jb.172.6.3304-3309.1990. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Young W P, Schupp J M, Keim P. DNA methylation and AFLP marker distribution in the soybean genome. Theor Appl Genet. 1999;99:785–790. [Google Scholar]

Articles from Journal of Clinical Microbiology are provided here courtesy of American Society for Microbiology (ASM)

RESOURCES