Abstract
Comparative genomics is a popular method for the identification of microbial virulence determinants, especially since the sequencing of a large number of whole bacterial genomes from pathogenic and non-pathogenic strains has become relatively inexpensive. The bioinformatics pipelines for comparative genomics usually include gene prediction and annotation and can require significant computer power. To circumvent this, we developed a rapid method for genome-scale in silico subtractive hybridization, based on blastn and independent of feature identification and annotation. Whole genome comparisons by in silico genome subtraction were performed to identify genetic loci specific to Streptococcus mutans strains associated with severe early childhood caries (S-ECC), compared to strains isolated from caries-free (CF) children.
The genome similarity of the 20 S. mutans strains included in this study, calculated by Simrank k-mer sharing, ranged from 79.5 to 90.9%, confirming this is a genetically heterogeneous group of strains. We identified strain-specific genetic elements in 19 strains, with sizes ranging from 200 bp to 39 kb. These elements contained protein-coding regions with functions mostly associated with mobile DNA. We did not, however, identify any genetic loci consistently associated with dental caries, i.e., shared by all the S-ECC strains and absent in the CF strains. Conversely, we did not identify any genetic loci specific with the healthy group. Comparison of previously published genomes from pathogenic and carriage strains of Neisseria meningitidis with our in silico genome subtraction yielded the same set of genes specific to the pathogenic strains, thus validating our method.
Our results suggest that S. mutans strains derived from caries active or caries free dentitions cannot be differentiated based on the presence or absence of specific genetic elements. Our in silico genome subtraction method is available as the Microbial Genome Comparison (MGC) tool, with a user-friendly JAVA graphical interface.
Keywords: Comparative genomics, Software, Streptococcus mutans, Dental caries, Virulence, Pathogenesis
1. Introduction
The study of the genetic basis of microbial pathogenesis has been guided by the molecular adaptation of Koch's postulates (Falkow, 1988), i.e., the notion that a disease phenotype should be associated with pathogenic strains of a species that harbor one or more genes associated with a virulence trait. Comparative genomics methods are particularly useful for the identification of genetic elements associated with virulence when pathogenic and non-pathogenic strains of a bacterial species can be isolated. For example, DNA subtractive hybridization has led to the identification of pathogenicity islands and clone-specific markers for several bacterial pathogens (Winstanley, 2002).
With high-throughput sequencing technologies, it is now possible to identify virulence determinants by interrogating in silico a large number of whole genomes from bacterial strains isolated from diseased or healthy hosts (Hu et al., 2011). The availability of multiple genome sequences for a bacterial species, has also led to the pan-genome concept, or the sum of the core genes shared among all sequenced strains of a species, and the accessory genes that are present in at least one, but not all strains studied (Tettelin 2005). Popular programs for comparative genomics of whole genomes such as MAUVE (Darling et al., 2004), ACT (Carver et al., 2005), or MUMmer (Laing et al., 2011), can align entire genomes to highlight regions of similarity and synteny, but may not represent the most practical approach for the rapid identification of accessory sequences common to a large group of strains. Instead, a few in silico comparative genomics methods have been described that apply the rationale behind DNA subtractive hybridization to whole genome sequences for the identification of group-specific genes. mGenomeSubtractor (Shao et al., 2010) and FindTarget (Chetouani et al., 2001) only take into account protein coding regions. The novel region finder (NRF) module of Panseq (Laing et al., 2011) is available as a standalone version, but it requires knowledge of Perl and Unix-based systems.
We developed an in silico genome subtraction method for the rapid identification of genetic elements specific to a group of strains. The accompanying software called Microbial Genome Comparison (MGC) tool is described in detail elsewhere (Chen et al., 2013), and is available as a Java executable from SourceForge. This tool performs in silico genomic comparisons independently of feature identification and annotation, an advantage over more comprehensive, yet time-consuming pipelines of comparative genomics. Instead, the MGC tool consists of the in silico fragmentation of the genome sequences followed by a series of between groups and within groups blastn queries (Altschul et al., 1997). In this study, we applied the in silico genome subtraction method as implemented in the MGC tool to the comparison of 20 genome sequences of Streptococcus mutans, commonly referred to as the main etiological agent of dental caries.
Dental caries remains the most common chronic disease of childhood in the United States, with a prevalence of 41% among children 2-11 years of age (Roberts, 2008). Severe early childhood caries (S-ECC) is an extremely destructive form of dental caries affecting the primary dentition of children six years and younger (AAPD, 2004). The association between S. mutans and S-ECC has been well documented both by culture-based (Loesche et al., 1975; Marchant et al., 2001; Milnes and Bowden, 1985; Tanner et al., 2011; van Houte et al., 1982) and cu lture-independent surveys (Becker et al., 2002; Corby et al., 2005; Kanasi et al., 2010) of the dental biofilm. Individuals free of detectable caries can, however, harbor S. mutans in their dental biofilm (Ge et al., 2008; Loesche, 1986; Marchant et al., 2001; Tanner et al., 2011), and the presence or total numbers of S. mutans are poor predictors of subsequent caries activity (Thenisch et al., 2006).
Whether specific genotypes of S. mutans are associated with S-ECC, and are different than genotypes colonizing caries-free (CF) children, has not been determined, but the evidence suggests that strains can differ in virulence (Fitzgerald et al., 1983; Kohler and Krasse, 1990). The remarkable intra-species genetic variability exhibited by S. mutans has been extensively documented by restriction enzyme fingerprinting (Caufield and Walker, 1989; Kulkarni et al., 1989), MLST (Do et al., 2010; Nakano et al., 2007), comparative genome hybridization (Zhang et al., 2009), and the comparison of two sequenced genomes (Ajdic et al., 2002; Maruyama et al., 2009). The virulence of S. mutans strains may be associated with the presence or absence of regions of accessory DNA, so we queried the genome sequences of 10 S-ECC and 10 CF S. mutans strains through in silico genome subtraction with the MGC tool to identify differences in their genetic repertoire that correlate with differences in caries experience.
2. Materials and Methods
2.1 Bacterial strains
This study included 20 S. mutans strains previously isolated, 10 from children diagnosed with severe early childhood caries (S-ECC) and scheduled for extensive caries restorative treatment under general anesthesia at the Bellevue Hospital, New York, NY, and 10 from children diagnosed as being free from detectable caries (caries-free [CF]) (Argimon and Caufield, 2011). Bacterial samples from saliva and pooled plaque of the S-ECC and CF children were collected. Additionally, for S-ECC children plaque samples were obtained from caries lesions. S. mutans isolates were selected from mitis salivarius-bacitracin agar medium (MSB) based upon colony morphology (Argimon and Caufield, 2011). The study protocol for human subjects was approved by the Institutional Review Board of New York University School of Medicine and Bellevue Hospital. Our study cohort of S-ECC children conformed to the recent reclassification of S-ECC as hypoplasia-associated severe early childhood caries (HAS-ECC) (Caufield et al., 2012). In most cases, only 1 S. mutans genotype was isolated from each subject, though 4 subjects presented 2 genotypes, and 2 subjects presented 3 genotypes. The 20 strains represented 20 different S. mutans genotypes by chromosomal DNA fingerprinting (CDF) and arbitrarily-primed PCR (AP-PCR), as previously reported (Argimon and Caufield, 2011). Total genomic DNA was obtained as previously described (Argimon and Caufield, 2011).
2.2 Genome sequencing and assembly
Twenty S. mutans genomes, 10 from S-ECC children and 10 from CF children, were multiplexed and library preps were generated using the Illumina TruSeq DNA Sample Prep Kit according to manufacturer's instructions. Libraries were sequenced on the Illumina HiSeq 2000 Genome Analyzer System (coverage of ~100x), and the 50 bp paired-end reads thus generated were assembled de novo into contigs using ABySS (Simpson et al., 2009). The contigs for each sample were reordered based on alignment to the reference genome sequence of strain UA159 with Mauve Contig Mover (Rissman et al., 2009).
2.3 Estimation of genome sequence similarity
We used a simple, rapid, computationally efficient and scalable method based on the sharing of short DNA words (k-mers) between genome sequences. The pairwise similarity between genome sequences was estimated with Simrank, which computes the similarity between two sequences as the number of unique k-mers shared, divided by the smallest total unique k-mer count in either sequence (DeSantis et al., 2011). This method requires no annotation of the genomes, treats all portions of the genome equally, and can even be applied to sequence reads that have not been assembled into contigs. The complete genome sequences of S. mutans strains UA159, NN20225, GS-5 and LJ23 (Table 1), and S. agalactiae strains (NC_004368.1) and 2603V/R (NC_004116.1) were included as a reference. A k-mer length of 10 was chosen empirically and validated by comparison of the results to previously published similarity values for S. agalactiae genomes (Tettelin et al., 2005). The contigs in each draft genome sequence were concatenated and separated by a S. mutans stretch of Ns equal to the length of the k-mer. Dissimilarity (100-similarity) matrices of pairwise comparisons between genome sequences were employed for hierarchical clustering of the genomes based on the unweighted pair group method with arithmetic mean (UPGMA), as implemented in the seqlinkage tool in the Matlab bioinformatics toolbox (MathWorks). Dissimilarity matrices were further evaluated by Principal Coordinate Analysis (PCoA) using ADE4 (Dray and Dufour, 2007) R package (Team, 2012). Association of the Simrank dissimilarity matrices with caries status was tested using non-Euclidean multivariate analysis of variance method (Adonis) and by direct t-tests of first two axes of the PCoA.
Table 1.
General features of 20 draft and 4 fully-sequenced S. mutans genomes.
Caries group | Strain ID | Genome ID | GenBank Accession | n | N50 | size (bp) | G+C (%) |
---|---|---|---|---|---|---|---|
S-ECC | B05Sm11 | 01 | ALYO00000000 | 172 | 89289 | 2038052 | 37.0 |
B13Sm1 | 02 | ALYP00000000 | 79 | 134265 | 2152906 | 36.8 | |
B12Sm1 | 03 | ALYQ00000000 | 195 | 67744 | 2176605 | 36.9 | |
B084SM-A | 04 | ALYR00000000 | 125 | 92931 | 2049790 | 36.9 | |
B107SM-B | 05 | ALYS00000000 | 147 | 106823 | 2105230 | 36.8 | |
B04Sm5 | 13 | ALYY00000000 | 221 | 62705 | 2079820 | 37.0 | |
B082SM-A | 14 | ALYZ00000000 | 164 | 97499 | 2129726 | 37.0 | |
B06Sm2 | 15 | ALZA00000000 | 254 | 72589 | 2221276 | 37.0 | |
B85SM-B | 16 | ALZB00000000 | 109 | 96201 | 2103020 | 37.0 | |
B88SM-A | 17 | ALZC00000000 | 306 | 72442 | 2064873 | 36.8 | |
CF | B16-P-Sm1 | 18 | ALZD00000000 | 115 | 94212 | 2331519 | 36.9 |
B23Sm1 | 19 | ALZE00000000 | 329 | 74707 | 2099531 | 37.1 | |
B111SM-A | 20 | ALZF00000000 | 78 | 123128 | 2125644 | 36.9 | |
B114SM-A | 21 | ALZG00000000 | 105 | 120718 | 2111086 | 37.0 | |
B115SM-A | 22 | ALZH00000000 | 90 | 100247 | 2284933 | 37.0 | |
B07Sm2 | 06 | ALYT00000000 | 103 | 89728 | 2224347 | 36.8 | |
B09Sm1 | 07 | ALYU00000000 | 169 | 82624 | 2098314 | 37.0 | |
B24Sm2 | 08 | ALYV00000000 | 114 | 89260 | 2290015 | 36.8 | |
B102SM-B | 09 | ALYW00000000 | 358 | 49104 | 2098790 | 36.8 | |
B112SM-A | 10 | ALYX00000000 | 83 | 100294 | 2259998 | 37.1 | |
Reference strains | UA159 | AE014133.1 | 2030936 | 36.8 | |||
NN2025 | AP010655.1 | 2013587 | 36.8 | ||||
GS5 | CP003686.1 | 2056048 | 36.8 | ||||
LJ23 | NC_017768.1 | 2044422 | 37.1 |
n: number of contigs ≥ 100 bp; N50: longest contig length such that at least 50% of all base-pairs are contained in contigs of this length or larger (Lander et al., 2001); size: sum of contig lengths for draft genomes, or size of the complete genome for the reference strains; G+C content.
2.4 Comparative genomics by in silico genome subtraction with the MGC tool
Figure 1 depicts the in silico genome subtraction method implemented in the MGC tool applied to the identification of genetic elements specific to the S-ECC or CF strains from S. mutans. The average gene size for S. mutans has been reported as 885 bp for strain UA159 (Ajdic et al., 2002) and 903 bp for strain NN2025 (Maruyama et al., 2009). Therefore, a 500-bp fragment would be large enough to contain the coding sequence for a substantial portion of most genes, while a fragment size of 200 bp would better target the smallest protein coding genes. The draft genome sequences were fragmented in silico into consecutive 500-bp or 200-bp segments. The fragments were assigned identifiers to trace them back to their original position in the contigs. We built two BLAST databases, one containing the CF draft genomes, and one with the S-ECC draft genomes.
Figure 1.
in silico genome subtraction strategy. Numbers in brackets indicate the number of fragments or genetic elements in each category.
The S-ECC-specific fragments present in each S-ECC strain were identified by querying by blastn the collection of fragments from each S-ECC genome against the CF database. The fragments that did not find a match in the CF database (no-hit fragments) were considered specific to that S-ECC strain. Contiguous fragments (adjacent on a single contig) were joined together to reconstruct entire genetic elements unique to each S-ECC strain. A subsequent blastn all-versus-all search of no-hit fragments identified fragments shared among the S-ECC strains. The same approach was used to identify genetic elements specific to CF strains.
The genetic elements thus singled out were characterized through a blastx search against a non-redundant protein database available from the National Center for Biotechnology Information (NCBI), also included in our MGC tool. Further characterization of hypothetical proteins was attempted through the search of conserved protein domains with InterProScan (Quevillon et al., 2005). Blastx matches were assigned to the functional categories defined by Clusters of Orthologous Groups of proteins (COGs) (Tatusov et al., 2003; Tatusov et al., 1997). A 39 Kb genomic island from genome 08 was also annotated with xbase (Chaudhuri and Pallen, 2006) on the basis of the reference genome sequences from strains UA159 and NN2025.
2.5 Confirmation of in silico genome subtraction results by PCR
The distribution of some of the genetic elements identified through our in silico genome subtraction as being specific to the S-ECC or CF groups was tested by PCR. Primers (Supplementary File 1) were designed with Primer3 (Rozen and Skaletsky, 2000) to amplify 220-700 bp regions from each genetic element. PCR amplification was carried out in a Mastercycler Pro thermal cycler (Eppendorf, Hamburg, Germany) in a 25 μl reaction containing 1× PCR Buffer, 1.5 mM MgCl2, 200 nM of each 2’-deoxynucleoside 5’-triphosphate, 1 μM of each of the forward and reverse primers, 1.25 U of Taq DNA polymerase (Invitrogen, Carlsbad, CA) and 25 ng of genomic DNA. PCR conditions were typically as follows: one initial denaturation at 94 °C for 3 min, then 30 cycles of denaturation at 94 °C for 30 sec, annealing at 55-57 °C for 30 sec and extension at 72 °C for 1 min, followed by a final extension at 72 °C for 5 min. PCR products were resolved by electrophoresis on 1.5% agarose gels in Tris-acetate-EDTA (TAE) buffer, and stained with SybrSafe (Invitrogen). Results were captured with an Alpha IS-1000 digital imaging system (Alpha Innotech Corp., San Leandro, CA).
2.6 Validation of the in silico genome subtraction method implemented in the MGC tool
The genomes of Neisseria meningitidis strains Z2491 (Parkhill et al., 2000), MC58 (Tettelin et al., 2000) and FAM18 (Bentley et al., 2007) isolated from cases of bacterial meningitis, and of strains α14, α153, α275 (Schoen et al., 2008) isolated from healthy carriers, have been previously described. The genome sequences of the 6 meningococcal strains were obtained from the GenBank database (accession numbers AL157959, AE002098, AM421808, AM889136, AM889137 and AM889138). Genetic elements specific to the disease strains were identified by in silico genome subtraction as described above, and the results were compared to those published by Schoen and coworkers (Schoen et al., 2008).
3. Results and Discussion
3.1 General characteristics of the S. mutans genomes
The draft genomes obtained for the 20 S. mutans strains in our dataset consisted of between 78 and 358 contigs (average n=166), with an N50 between 49104 and 134265 (average N50=90825.5) (Table 1). The draft genome sequences revealed similar genome sizes and G+C content to four S. mutans genome sequences available at NCBI. The average G+C content of the 10 S-ECC and 10 CF genomes were very similar (S-ECC= 36.92±0.09 %; CF= 36.94±0.12 %), while the mean CF genome size appeared to be marginally larger than the mean S-ECC genome size (S-ECC=2.11±0.06 Mb; CF=2.19±0.12 Mb, p=0.03). Our results indicate that we obtained draft genome sequences of similar quality and characteristics between the S-ECC and CF groups, and that any potential genetic differences between them are unlikely to be the result of artifacts of sequencing.
3.2 Estimation of genetic diversity between S. mutans genomes
The genetic diversity of our 20 S. mutans strains was estimated by calculating the pairwise similarity between genomes with Simrank (DeSantis et al., 2011), which computes the similarity between two sequences based on shared k-mers, thus acting in an unbiased, annotation independent manner on both coding and non-coding sequences. Pairwise comparisons between draft genome sequences were computed with a k-mer length of 10. The complete genomes of S. mutans strains UA159, NN2025, GS-5 and LJ23 were included as a reference, and S. agalactiae strains NEM 316 and 2602V/R were used as the outgroup. The similarity value between the two S. agalactiae genomes was 93.9% (Supplementary File 2), within the range of genome similarity previously reported for strains of this species (85-95%, (Tettelin et al., 2005)). Pairwise similarity values for our 20 S. mutans draft genomes ranged from 79.5 to 90.9%. The mean similarity was 84.8±2.1% for our 20 draft genomes, very similar to the mean similarity of 84.4±3.1% for the 4 S. mutans complete reference genomes, which supports the quality of our draft genomes. Importantly, analyses based on dissimilarity values by hierarchical clustering (UPGMA) and by PCoA showed that the genomes do not cluster according to caries status (Figure 2 A and B, respectively). No significant association with caries status was found for genome similarity using non-Euclidean multivariate analysis of variance (p>0.05) or for PCo1 and PCo2. These results are in agreement with our previous comparison of 33 genomes (including the 20 genomes in this study) by chromosomal DNA fingerprinting (CDF, (Argimon and Caufield, 2011)).
Figure 2.
Cluster analysis of genomes based on pairwise comparisons of sequence dissimilarity. A) Dendogram of 20 S. mutans genomes from different caries status (this study) and 4 reference genomes. Two S. agalactiae genomes were included as an outgroup. B) Principal Coordinate Analysis (PCoA) of 20 S. mutans genomes from different caries status (this study).
3.3 Choice of E-value for the blastn searches
The Expect value (E) is a BLAST parameter that describes the number of hits expected by chance when searching a database of a particular size (Camacho et al., 2009). The choice of an E-value threshold that determines a good match between the query sequence and sequences in the database will influence the number of fragments identified as being unique to a strain or group of strains (i.e., failing to match). For example, the more stringent (lower) the E-value is, the more fragments from an S-ECC strain will not find a good match in the CF database (composed of sequences from all 10 CF strains) and will be singled out as unique to that S-ECC strain.
We analyzed the influence of the E-value on our strategy by plotting the number of 500-bp query sequences from S-ECC strains that did not find a match in the CF database (number of non-matches) versus the E-value. Figure 3 shows that the number of non-matches remained fairly constant over a wide range of E-values (1e-8 to 1e-20), supporting the robustness of our method. We chose a mid-range E-value of 1e-12 to conduct our analysis, with the expectation that this stringent threshold would identify DNA fragments that were unique, and eliminate matches that shared substantial regions of similarity.
Figure 3.
Effect of the blastn E-value on the number of fragments specific to each group obtained by the in silico genome subtraction method.
3.4 Identification of genetic elements unique to the S-ECC and CF groups
A total of 254 500-bp fragments were identified as being unique to the S-ECC strains, i.e., absent in all 10 CF strains (Figure 1). From these, 187 fragments were strain-specific, i.e., present in only one S-ECC strain, and 67 fragments were shared by at least 2 S-ECC strains. Conversely, 1069 500-bp fragments were identified as being unique to the CF strains, i.e., absent in all 10 S-ECC strains. Out of these 827 were strain-specific, and 242 were shared by at least 2 CF strains. Our method identified four times more unique fragments for the CF group than for the S-ECC group. This could be explained by the fact that the CF genomes in our collection appear to be larger than the S-ECC genomes.
We found that, in many cases, S-ECC-specific or CF-specific 500-bp fragments mapped to adjacent positions on the genome of a particular strain. Contiguous 500-bp fragments were joined together to form genetic elements that ranged from 1 to 39 kb in size, and contained from 1 to 20 ORFs. The 254 S-ECC-specific fragments grouped into 48 genetic elements, out of which only 7 were present in more than one S-ECC strain (Table 2 and Figure 1). Their frequency was rather low, with the most frequent element present in only 4 out of 10 S-ECC strains. The analysis with a fragment size of 200 bp yielded largely similar results, with the exception that three additional elements unique to the S-ECC group were identified, SECC-8 to 10 (Table 2). These three elements were all 200 bp in size and they matched genes of 375, 885 and 261 bp in length, respectively. This suggests that, even though a 500 bp fragment is appropriate for the in silico genome subtraction method applied to S. mutans, a 200 bp fragment might be more sensitive for the detection of smaller genes. The 1069 CF-specific 500-bp fragments grouped into 113 genetic elements, out of which only 14 were found in two or three CF strains (Table 3 and Figure 1). Once the contiguous fragments were grouped into genetic elements, it became apparent that several of the CF strains contained large genomic islands (GI) of 10 to 39 Kb in size (Table 3 and Supplementary File 3), but that the elements found in the S-ECC strains were smaller than 10 kb. This likely explains the difference in size observed between the S-ECC and CF genomes.
Table 2.
Genetic elements unique to S-ECC genomes and shared by at least two of them.
ID | size (kb) | No. of ORFs | Query G+C % | blastx ident. | blastx description | Comments | Genome ID | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
01 | 02 | 03 | 04 | 05 | 13 | 14 | 15 | 16 | 17 | |||||||
Fragment size 500 bp | ||||||||||||||||
SECC-1 | 1 | 2 | 28.2 | 99% | conserved hypothetical protein | preprotein translocase | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
SECC-2 | 0.5 | 1 | 36.0 | 74% | IS3-Spn1, transposase | insertion seq. ISSmu1 IS3 family | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 |
SECC-3 | 1 | 1 | 29.3 | 83% | hypothetical protein | no InterProScan hits | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
SECC-4 | 0.5 | 1 | 34.4 | 52% | putative membrane protein | signal peptide and TM domains | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
SECC-5 | 0.5 | 1 | 40.8 | 93% | transposase | insertion seq. ISScr1 IS982 family | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 |
SECC-6 | 2 | 1 | 26.7 | 40% | conserved hypothetical protein | no InterProScan hits | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
SECC-7 | 3-7 | 1-4 | 34.8 | 78% | Type I / Type III restriction enzyme | restriction modification systems | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
Fragment size 200 bp | ||||||||||||||||
SECC-8 | 0.2 | 1 | 32.5 | 100% | hypothetical protein | no conserved domains detected | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
SECC-9 | 0.2 | 1 | 34.5 | 100% | hypothetical protein | no conserved domains detected | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
SECC-10 | 0.2 | 1 | 40.5 | 74% | addiction module toxin RelE/StbE family | plasmid stabilization system protein | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
The size, number of ORFs and G+C content for each genetic element are indicated. The description and % identity are for the top blastx hit. The distribution of the genetic elements across the genomes is indicated by (1) presence or (0) absence. Elements SECC-1 to 7 were identified with a fragment size of 500 bp. Additional elements SECC-8 to 10 were detected when the genomes were fragmented into 200 bp segments.
Table 3.
Genetic elements unique to CF genomes and shared by at least two of them.
Element ID | Size (kb) | No. of ORFs | Query G+C % | blastx ident. | blastx description | Comments | Genome ID | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
06 | 07 | 08 | 09 | 10 | 18 | 19 | 20 | 21 | 22 | |||||||
CF-1 | 0.5 | 1 | 29 | 83% | hypothetical protein | 1 transmembrane region predicted | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
CF-2 | 0.5 | 1 | 39.4 | 80% | hypothetical protein | signal peptide and 1 transmemb. region | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
CF-3 | 0.5 | 1 | 27.6 | 99% | YjbE family integral membrane protein | signal peptide and 4 transmemb. region | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
CF-4 | 0.5 | 1 | 42 | 52% | unnamed protein product | CHAP domain predicted (peptidase) | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
CF-5 | 0.5 | 1 | 37.4 | 89% | conserved hypothetical protein | EscC protein fam. Function unknown | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
CF-6 | 0.5 | 1 | 35.8 | 74% | integrase-like protein | Phage integrase family | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
CF-7 | 1 | 2 | 32.8 | 47% | conserved hypothetical proteins | putative bacterial toxin system | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
CF-8 | 1.5 | 1 | 32.6 | 77% | csn1 family CRISPR-associated protein | maintenance of CRISPR repeats | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 |
CF-9 | 1.5 | 2 | 40.9 | 78% | insertion sequence transposase | insertion seq. IS3 family group IS150 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
CF-10 | 2.5 | 3 | 30.2 | 38% | hypothetical proteins, prophage ps3 protein | SEC-C motif (DNA binding) | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
CF-11 | 4.0-5.0 | 5-6 | 36.5 | 52% | replication initiation, FtsK/SpoIIIE family prot | Plasmid replication | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
CF-12 | 6.5 | 6 | 34.4 | 51% | ABC transporter, membrane protease | GI with transporters and proteases | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
CF-13 | 12.5 | 12 | 45.4 | 80% | ABC transporter, hypothetical proteins | GI with transporters | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
CF-14 | 16 | 17 | 37.7 | 78% | FtsK/SpoIIIE fam, DNA methylase, glycosylase | GI with DNA modification enzymes | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
The size, number of ORFs and G+C content for each genetic element are indicated. The description and % identity are for the top blastx hit. The distribution of the genetic elements across the genomes is indicated by (1) presence or (0) absence.
Importantly, we found no S-ECC-specific fragment present in all 10 S-ECC strains and no CF-specific fragment present in all 10 CF strains. Hence, our in silico genome subtraction method did not identify any genetic elements that define the S-ECC or CF groups or serve as biomarkers for dental caries. This was confirmed by an independent combinatorial analysis of the genomic fragments shared between 3 and 18 genomes (Supplemetary File 5). Our finding is also in agreement with previous observations from our group. We previously employed DNA suppression subtractive hybridization (Winstanley, 2002) to identify a set of genetic biomarkers that could classify S. mutans strains according to their caries status with a 92% accuracy (Saxena et al., 2008). However, when these biomarkers were re-tested on a different set of isolates they failed to properly classify them (unpublished results). We also previously investigated the distribution of loci commonly linked in the literature to S. mutans virulence across a collection of S-ECC and CF strains, but we found no association with caries status (Argimon and Caufield, 2011). The results from the in silico genome subtraction are also in agreement with a comparative genomic hybridization study of 11 S. mutans isolates from children with high caries and low caries to reference strain UA159 (Zhang et al., 2009), as well as with studies on both group A streptococci (GAS, (McMillan et al., 2006)) and group B S. agalactiae (Hauge et al., 1996; Smith et al., 2007), which found no significant association between the presence of virulence genes and disease.
A limitation of our in silico subtraction method applied to the study of bacterial pathogenesis is that differences in virulence between strains could stem from differences in gene expression, or allelic differences within genes (SNPs), not from the presence or absence of particular loci. Other factors affecting virulence could be differences in host factors, and interactions with other members of the microbiota, all likely to play a role in caries pathogenesis.
3.5 Confirmation of in silico genome subtraction results by PCR
To validate the strain-specific fragments identified by our method, the distribution of a panel of 8 S-ECC and CF genetic elements was assessed by PCR. The results agreed with the distribution predicted by the in silico genome subtraction (Tables 2 and 3 and Supplementary File 4), indicating that our method was successful at identifying elements that were present only in either the S-ECC or the CF groups, and subsequently identifying these elements in other strains within the group.
Interestingly, we identified SECC-2 (Table 2), a genetic element absent in CF strains and present in four S-ECC strains, as ISSmu1, a known insertion sequence in S. mutans. A previous study in our group showed that this insertion sequence seemed to be over-represented in the S-ECC strains (Argimon and Caufield, 2011).
3.6 Characterization of genetic elements unique to the S-ECC and CF groups
The 48 S-ECC and 113 CF-specific genetic elements were characterized by a blastx search against the complete NCBI non-redundant protein database (nr), as implemented in the MGC tool. The blast matches were classified according to the functional categories defined by Clusters of Orthologous Groups of proteins (COGs) (Tatusov et al., 2003; Tatusov et al., 1997). Both S-ECC and CF-specific genetic elements matched mostly hypothetical proteins of unknown function (Figure 4), but also several proteins involved in biosynthesis of secondary metabolites, DNA metabolism and defense mechanisms.
Figure 4.
Comparison of COG functional categories between the S-ECC and CF specific genetic elements and the completely sequenced genomes of strains UA159 and NN2025. Each concentric ring represents, from outside to inside: NN2025, UA159, CF-specific elements, S-ECC-specific elements. The segments within each ring represent the relative contribution of a functional category as a percentage of the total COGs. For simplicity, Function unknown and General function prediction only were combined into one category under the name Function unknown.
An analysis of the relative contributions of each functional category to the CF and S-ECC pooled genomes (Figure 4) shows that, in comparison to the full genomes of strains UA159 and NN2025, they are enriched in proteins of unknown function, as well as in proteins involved in DNA replication recombination and repair, secondary metabolites biosynthesis, transport and catabolism, and defense mechanisms. The genetic elements specific to the S-ECC group were rich in transposases and restriction modification systems, classified in the replication, recombination and repair and the defense mechanisms categories, respectively. These were also present in the genetic elements specific to the CF group, but in this case the secondary metabolites category was more abundant due to the presence in some of the strains of large genomic islands for the biosynthesis of polyketides and non-ribosomal peptides.
The G+C content of most of these genetic elements differed from that of the S. mutans genomes, suggesting that they were likely acquired by horizontal gene transfer from other species (Tables 2 and 3).
3.7 Validation of the in silico genome subtraction method implemented in the MGC tool
Our in silico genome subtraction method did not identify any S-ECC-specific fragment present in all 10 S-ECC strains or any CF-specific fragment present in all 10 CF strains. To exclude the possibility that this is the result of a shortcoming of our method we validated it using a previously published dataset of six Neisseria meningitidis genomes, three from strains isolated from bacterial meningitis, and three from healthy carriers (Schoen et al., 2008). Our in silico genomic subtraction method identified virtually the same meningococcal core pathogenome as Schoen and coworkers, composed of 11 genes, 8 from the prophage Nf1 (NMA1792-NMA1799 according to the annotation of strain Z2491), plus 3 from the fha locus (NMA0692-NMA0694). We also identified an additional prophage Nf1 gene (NMA17800) by in silico genome subtraction. Our findings confirm that the in silico genome subtraction method as implemented in the MGC tool is able to detect genetic elements specific to a group of strains when they are indeed present.
Our in silico genome subtraction method implemented in the MGC tool differs from previously published programs for in silico subtractive hybridization in several ways. FindTarget (Chetouani et al., 2001) only takes into account protein coding regions. Our method eliminates the need of feature prediction and annotation, thus simplifying the analysis and including both coding and non-coding regions. The web-based mGenomeSubtractor (Shao et al., 2010) implements two modes to search for common and strain-specific genomic regions between reference and query genomes, one is CDS-based, the other DNA fragment-based. However, only one genome can be specified as the reference and it requires a CDS annotation file as part of the input. It is also limited to 10 query genomes, while our in silico genome subtraction method implemented in the MGC tool can compare as many genomes as the computing power allows. The NRF module of Panseq (Laing et al., 2011) also uses a DNA fragment-based algorithm and is available both on the web and as a standalone version. The standalone version of the NRF module is capable of comparing a large number of genomes, but installing and running it is not trivial for those with basic or no knowledge of Unix systems. Our in silico subtraction method has been written into the MGC tool (Chen et al., 2013), a user-friendly, standalone Java executable that does not require knowledge of Unix systems. In addition, the output of the NRF module consists only of a fasta-formatted file with the sequences of novel regions. The distribution of these sequences across the strains can be obtained after running the pan-genome analysis module, and parsing a binary table containing information for both the core and accessory sequences across all strains. Panseq (Laing et al., 2010) uses the MUMmer alignment program (Kurtz et al., 2004) in an iterative process that adds each analyzed sequence to the database before analyzing the following sequence. Our method, on the other hand, performs a series of between-groups blastn searches, between each query and a database composed of all the reference genomes, followed by a within-groups blastn searches to identify blocks common to each group (Figure 1). This streamlines the analysis to generate an output that simplifies the identification of sequences that are specific to group of genomes of interest.
4. Conclusions
The methods proposed in this study for in silico genome subtraction and k-mer based estimation of genome similarity can be applied for the comparison of any groups of bacterial strains, and are independent of gene prediction, annotation, and synteny, thus circumventing the complex pipelines traditionally implemented in comparative genomics. The in silico genome subtraction method implemented in the MGC tool is particularly useful for the rapid identification of loci common to a group of strains, such as virulence or niche-specific loci. With this method it is possible to identify shared genetic elements as small as 200 bp to large genetic islands.
We applied the in silico genome subtraction method to elucidate whether the virulence of S. mutans strains is associated with the presence/absence of regions of accessory DNA. The 20 S. mutans strains included in this study were genetically heterogeneous, as shown by our cluster analysis based on genome sequence dissimilarity, and previously by CDF. Our results suggest that no particular genetic element or group of genetic elements is associated with caries status. It is unlikely that a larger sample size would reveal such genetic elements with predictive power for disease status. Our results imply that S. mutans strains from markedly different oral environments such as those from S-ECC or CF children, do not possess a common genetic repertoire specific to those environments.
Supplementary Material
Highlights.
- Development of an in silico genome subtraction method and the MGC tool software.
- Application of Simrank to the estimation of genome sequence similarity.
- Comparative genomics of whole genome sequences of Streptococcus mutans.
- No S. mutans loci found associated with severe early childhood caries.
Acknowledgements
This work was supported by grant R01 DE13937 from the National Institute of Dental and Craniofacial Research, National Institutes of Health, and by grant 1UL1RR029893 from the National Center for Research Resources, National Institutes of Health.
Abbreviations
- S-ECC
Severe early childhood caries
- CF
Caries-free
- ORFs
Open-reading frames
- CDF
Chromosomal DNA fingerprinting
- PCoA
principal coordinate analysis
- UPGMA
unweighted pair group method with arithmetic mean
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- AAPD . Definition of early childhood caries (ECC) In: AAPD, editor. Reference manual. AAPD; 2004. p. 13. [Google Scholar]
- Ajdic D, McShan WM, McLaughlin RE, Savic G, Chang J, Carson MB, Primeaux C, Tian R, Kenton S, Jia H, Lin S, Qian Y, Li S, Zhu H, Najar F, Lai H, White J, Roe BA, Ferretti JJ. Genome sequence of Streptococcus mutans UA159, a cariogenic dental pathogen. Proc Natl Acad Sci U S A. 2002;99:14434–14439. doi: 10.1073/pnas.172501299. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Argimon S, Caufield PW. Distribution of putative virulence genes in Streptococcus mutans strains does not correlate with caries experience. J Clin Microbiol. 2011;49:984–992. doi: 10.1128/JCM.01993-10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Becker MR, Paster BJ, Leys EJ, Moeschberger ML, Kenyon SG, Galvin JL, Boches SK, Dewhirst FE, Griffen AL. Molecular analysis of bacterial species associated with childhood caries. J Clin Microbiol. 2002;40:1001–1009. doi: 10.1128/JCM.40.3.1001-1009.2002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bentley SD, Vernikos GS, Snyder LA, Churcher C, Arrowsmith C, Chillingworth T, Cronin A, Davis PH, Holroyd NE, Jagels K, Maddison M, Moule S, Rabbinowitsch E, Sharp S, Unwin L, Whitehead S, Quail MA, Achtman M, Barrell B, Saunders NJ, Parkhill J. Meningococcal genetic variation mechanisms viewed through comparative analysis of serogroup C strain FAM18. PLoS Genet. 2007;3:e23. doi: 10.1371/journal.pgen.0030023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10:421. doi: 10.1186/1471-2105-10-421. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Carver TJ, Rutherford KM, Berriman M, Rajandream MA, Barrell BG, Parkhill J. ACT: the Artemis Comparison Tool. Bioinformatics. 2005;21:3422–3423. doi: 10.1093/bioinformatics/bti553. [DOI] [PubMed] [Google Scholar]
- Caufield PW, Li Y, Bromage TG. Hypoplasia-associated severe early childhood caries--a proposed definition. J Dent Res. 2012;91:544–550. doi: 10.1177/0022034512444929. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Caufield PW, Walker TM. Genetic diversity within Streptococcus mutans evident from chromosomal DNA restriction fragment polymorphisms. J Clin Microbiol. 1989;27:274–278. doi: 10.1128/jcm.27.2.274-278.1989. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chaudhuri RR, Pallen MJ. xBASE, a collection of online databases for bacterial comparative genomics. Nucleic Acids Res. 2006;34:D335–337. doi: 10.1093/nar/gkj140. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen H, Argimon S, Brown S. A JAVA tool for Microbial Genome Comparisons. 2013 Submitted for publication. [Google Scholar]
- Chetouani F, Glaser P, Kunst F. FindTarget: software for subtractive genome analysis. Microbiology. 2001;147:2643–2649. doi: 10.1099/00221287-147-10-2643. [DOI] [PubMed] [Google Scholar]
- Corby PM, Lyons-Weiler J, Bretz WA, Hart TC, Aas JA, Boumenna T, Goss J, Corby AL, Junior HM, Weyant RJ, Paster BJ. Microbial risk indicators of early childhood caries. J Clin Microbiol. 2005;43:5753–5759. doi: 10.1128/JCM.43.11.5753-5759.2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Darling AC, Mau B, Blattner FR, Perna NT. Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res. 2004;14:1394–1403. doi: 10.1101/gr.2289704. [DOI] [PMC free article] [PubMed] [Google Scholar]
- DeSantis TZ, Keller K, Karaoz U, Alekseyenko AV, Singh NN, Brodie EL, Pei Z, Andersen GL, Larsen N. Simrank: Rapid and sensitive general-purpose k-mer search tool. BMC Ecol. 2011;11:11. doi: 10.1186/1472-6785-11-11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Do T, Gilbert SC, Clark D, Ali F, Fatturi Parolo CC, Malt z M, Russell RR, Holbrook P, Wade WG, Beighton D. Generation of diversity in Streptococcus mutans genes demonstrated by MLST. PLoS One. 2010;5:e9073. doi: 10.1371/journal.pone.0009073. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dray S, Dufour AB. The ade4 package: implementing the duality diagram for ecologists. J Stat Softw. 2007;22:1–20. [Google Scholar]
- Falkow S. Molecular Koch's postulates applied to microbial pathogenicity. Rev Infect Dis 10 Suppl. 1988;2:S274–276. doi: 10.1093/cid/10.supplement_2.s274. [DOI] [PubMed] [Google Scholar]
- Fitzgerald DB, Fitzgerald RJ, Adams BO, Morhart RE. Prevalence, distribution of serotypes, and cariogenic potential in hamsters of mutans streptococci from elderly individuals. Infect Immun. 1983;41:691–697. doi: 10.1128/iai.41.2.691-697.1983. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ge Y, Caufield PW, Fisch GS, Li Y. Streptococcus mutans and Streptococcus sanguinis colonization correlated with caries experience in children. Caries Res. 2008;42:444–448. doi: 10.1159/000159608. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hauge M, Jespersgaard C, Poulsen K, Kilian M. Population structure of Streptococcus agalactiae reveals an association between specific evolutionary lineages and putative virulence factors but not disease. Infect Immun. 1996;64:919–925. doi: 10.1128/iai.64.3.919-925.1996. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hu B, Xie G, Lo CC, Starkenburg SR, Chain PS. Pathogen comparative genomics in the next-generation sequencing era: genome alignments, pangenomics and metagenomics. Brief Funct Genomics. 2011;10:322–333. doi: 10.1093/bfgp/elr042. [DOI] [PubMed] [Google Scholar]
- Kanasi E, Dewhirst FE, Chalmers NI, Kent R, Jr., Moore A, Hughes CV, Pradhan N, Loo CY, Tanner AC. Clonal analysis of the microbiota of severe early childhood caries. Caries Res. 2010;44:485–497. doi: 10.1159/000320158. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kohler B, Krasse B. Human strains of mutans streptococci show different cariogenic potential in the hamster model. Oral Microbiol Immunol. 1990;5:177–180. doi: 10.1111/j.1399-302x.1990.tb00642.x. [DOI] [PubMed] [Google Scholar]
- Kulkarni GV, Chan KH, Sandham HJ. An investigation into the use of restriction endonuclease analysis for the study of transmission of mutans streptococci. J Dent Res. 1989;68:1155–1161. doi: 10.1177/00220345890680070401. [DOI] [PubMed] [Google Scholar]
- Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL. Versatile and open software for comparing large genomes. Genome Biol. 2004;5:R12. doi: 10.1186/gb-2004-5-2-r12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Laing C, Buchanan C, Taboada EN, Zhang Y, Kropinski A, Villegas A, Thomas JE, Gannon VP. Pan-genome sequence analysis using Panseq: an online tool for the rapid analysis of core and accessory genomic regions. BMC Bioinformatics. 2010;11:461. doi: 10.1186/1471-2105-11-461. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Laing C, Villegas A, Taboada EN, Kropinski A, Thomas JE, Gannon VP. Identification of Salmonella enterica species- and subgroup-specific genomic regions using Panseq 2.0. Infect Genet Evol. 2011;11:2151–2161. doi: 10.1016/j.meegid.2011.09.021. [DOI] [PubMed] [Google Scholar]
- Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, Stange-Thomann N, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R, French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S, Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA, Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL, Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett N, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M, Gibbs RA, Muzny DM, Scherer SE, Bouck JB, Sodergren EJ, Worley KC, Rives CM, Gorrell JH, Metzker ML, Naylor SL, Kucherlapati RS, Nelson DL, Weinstock GM, Sakaki Y, Fujiyama A, Hattori M, Yada T, Toyoda A, Itoh T, Kawagoe C, Watanabe H, Totoki Y, Taylor T, Weissenbach J, Heilig R, Saurin W, Artiguenave F, Brottier P, Bruls T, Pelletier E, Robert C, Wincker P, Smith DR, Doucette-Stamm L, Rubenfield M, Weinstock K, Lee HM, Dubois J, Rosenthal A, Platzer M, Nyakatura G, Taudien S, Rump A, Yang H, Yu J, Wang J, Huang G, Gu J, Hood L, Rowen L, Madan A, Qin S, Davis RW, Federspiel NA, Abola AP, Proctor MJ, Myers RM, Schmutz J, Dickson M, Grimwood J, Cox DR, Olson MV, Kaul R, Shimizu N, Kawasaki K, Minoshima S, Evans GA, Athanasiou M, Schultz R, Roe BA, Chen F, Pan H, Ramser J, Lehrach H, Reinhardt R, McCombie WR, de la Bastide M, Dedhia N, Blocker H, Hornischer K, Nordsiek G, Agarwala R, Aravind L, Bailey JA, Bateman A, Batzoglou S, Birney E, Bork P, Brown DG, Burge CB, Cerutti L, Chen HC, Church D, Clamp M, Copley RR, Doerks T, Eddy SR, Eichler EE, Furey TS, Galagan J, Gilbert JG, Harmon C, Hayashizaki Y, Haussler D, Hermjakob H, Hokamp K, Jang W, Johnson LS, Jones TA, Kasif S, Kaspryzk A, Kennedy S, Kent WJ, Kitts P, Koonin EV, Korf I, Kulp D, Lancet D, Lowe TM, McLysaght A, Mikkelsen T, Moran JV, Mulder N, Pollara VJ, Ponting CP, Schuler G, Schultz J, Slater G, Smit AF, Stupka E, Szustakowski J, Thierry-Mieg D, Thierry-Mieg J, Wagner L, Wallis J, Wheeler R, Williams A, Wolf YI, Wolfe KH, Yang SP, Yeh RF, Collins F, Guyer MS, Peterson J, Felsenfeld A, Wetterstrand KA, Patrinos A, Morgan MJ, de Jong P, Catanese JJ, Osoegawa K, Shizuya H, Choi S, Chen YJ. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. doi: 10.1038/35057062. [DOI] [PubMed] [Google Scholar]
- Loesche WJ. Role of Streptococcus mutans in human dental decay. Microbiol Rev. 1986;50:353–380. doi: 10.1128/mr.50.4.353-380.1986. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Loesche WJ, Rowan J, Straffon LH, Loos PJ. Association of Streptococcus mutants with human dental decay. Infect Immun. 1975;11:1252–1260. doi: 10.1128/iai.11.6.1252-1260.1975. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marchant S, Brailsford SR, Twomey AC, Roberts GJ, Beighton D. The predominant microflora of nursing caries lesions. Caries Res. 2001;35:397–406. doi: 10.1159/000047482. [DOI] [PubMed] [Google Scholar]
- Maruyama F, Kobata M, Kurokawa K, Nishida K, Sakurai A, Nakano K, Nomura R, Kawabata S, Ooshima T, Nakai K, Hattori M, Hamada S, Nakagawa I. Comparative genomic analyses of Streptococcus mutans provide insights into chromosomal shuffling and species-specific content. BMC Genomics. 2009;10:358. doi: 10.1186/1471-2164-10-358. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McMillan DJ, Beiko RG, Geffers R, Buer J, Schouls LM, Vlaminckx BJ, Wannet WJ, Sriprakash KS, Chhatwal GS. Genes for the majority of group a streptococcal virulence factors and extracellular surface proteins do not confer an increased propensity to cause invasive disease. Clin Infect Dis. 2006;43:884–891. doi: 10.1086/507537. [DOI] [PubMed] [Google Scholar]
- Milnes AR, Bowden GH. The microflora associated with developing lesions of nursing caries. Caries Res. 1985;19:289–297. doi: 10.1159/000260858. [DOI] [PubMed] [Google Scholar]
- Nakano K, Lapirattanakul J, Nomura R, Nemoto H, Alaluusua S, Gronr oos L, Vaara M, Hamada S, Ooshima T, Nakagawa I. Streptococcus mutans clonal variation revealed by multilocus sequence typing. J Clin Microbiol. 2007;45:2616–2625. doi: 10.1128/JCM.02343-06. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Parkhill J, Achtman M, James KD, Bentley SD, Churcher C, Klee SR, Morelli G, Basham D, Brown D, Chillingworth T, Davies RM, Davis P, Devlin K, Feltwell T, Hamlin N, Holroyd S, Jagels K, Leather S, Moule S, Mungall K, Quail MA, Rajandream MA, Rutherford KM, Simmonds M, Skelton J, Whitehead S, Spratt BG, Barrell BG. Complete DNA sequence of a serogroup A strain of Neisseria meningitidis Z2491. Nature. 2000;404:502–506. doi: 10.1038/35006655. [DOI] [PubMed] [Google Scholar]
- Quevillon E, Silventoinen V, Pillai S, Harte N, Mulder N, Apweiler R, Lopez R. InterProScan: protein domains identifier. Nucleic Acids Res. 2005;33:W116–120. doi: 10.1093/nar/gki442. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rissman AI, Mau B, Biehl BS, Darling AE, Glasner JD, Perna NT. Reordering contigs of draft genomes using the Mauve aligner. Bioinformatics. 2009;25:2071–2073. doi: 10.1093/bioinformatics/btp356. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Roberts MW. Dental health of children: where we are today and remaining challenges. J Clin Pediatr Dent. 2008;32:231–234. doi: 10.17796/jcpd.32.3.d5180888m8gmm282. [DOI] [PubMed] [Google Scholar]
- Rozen S, Skaletsky H. Primer3 on the WWW for general users and for biologist programmers. Methods Mol Biol. 2000;132:365–386. doi: 10.1385/1-59259-192-2:365. [DOI] [PubMed] [Google Scholar]
- Saxena D, Caufield PW, Li Y, Brown S, Song J, Norman R. Genetic classification of severe early childhood caries by use of subtracted DNA fragments from Streptococcus mutans. J Clin Microbiol. 2008;46:2868–2873. doi: 10.1128/JCM.01000-08. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schoen C, Blom J, Claus H, Schramm-Gluck A, Brandt P, Muller T, Goesmann A, Joseph B, Konietzny S, Kurzai O, Schmitt C, Friedrich T, Linke B, Vogel U, Frosch M. Whole-genome comparison of disease and carriage strains provides insights into virulence evolution in Neisseria meningitidis. Proc Natl Acad Sci U S A. 2008;105:3473–3478. doi: 10.1073/pnas.0800151105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shao Y, He X, Harrison EM, Tai C, Ou HY, Rajakumar K, Deng Z. mGenomeSubtractor: a web-based tool for parallel in silico subtractive hybridization analysis of multiple bacterial genomes. Nucleic Acids Res. 2010;38:W194–200. doi: 10.1093/nar/gkq326. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I. ABySS: a parallel assembler for short read sequence data. Genome Res. 2009;19:1117–1123. doi: 10.1101/gr.089532.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Smith TC, Roehl SA, Pillai P, Li S, Marrs CF, Foxman B. Distribution of novel and previously investigated virulence genes in colonizing and invasive isolates of Streptococcus agalactiae. Epidemiol Infect. 2007;135:1046–1054. doi: 10.1017/S0950268806007515. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tanner AC, Mathney JM, Kent RL, Chalmers NI, Hughes CV, Loo CY, Pradhan N, Kanasi E, Hwang J, Dahlan MA, Papadopolou E, Dewhirst FE. Cultivable anaerobic microbiota of severe early childhood caries. J Clin Microbiol. 2011;49:1464–1474. doi: 10.1128/JCM.02427-10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA. The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003;4:41. doi: 10.1186/1471-2105-4-41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tatusov RL, Koonin EV, Lipman DJ. A genomic perspective on protein families. Science. 1997;278:631–637. doi: 10.1126/science.278.5338.631. [DOI] [PubMed] [Google Scholar]
- Team RDC. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; Vienna, Austria: 2012. [Google Scholar]
- Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, Angiuoli SV, Crabtree J, Jones AL, Durkin AS, Deboy RT, Davidsen TM, Mora M, Scarselli M, Margarit y Ros I, Peterson JD, Hauser CR, Sundaram JP, Nelson WC, Madupu R, Brinkac LM, Dodson RJ, Rosovitz MJ, Sullivan SA, Daugherty SC, Haft DH, Selengut J, Gwinn ML, Zhou L, Zafar N, Khouri H, Radune D, Dimitrov G, Watkins K, O'Connor KJ, Smith S, Utterback TR, White O, Rubens CE, Grandi G, Madoff LC, Kasper DL, Telford JL, Wessels MR, Rappuoli R, Fraser CM. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome”. Proc Natl Acad Sci U S A. 2005;102:13950–13955. doi: 10.1073/pnas.0506758102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tettelin H, Saunders NJ, Heidelberg J, Jeffries AC, Nelson KE, Eisen JA, Ketchum KA, Hood DW, Peden JF, Dodson RJ, Nelson WC, Gwinn ML, DeBoy R, Peterson JD, Hickey EK, Haft DH, Salzberg SL, White O, Fleischmann RD, Dougherty BA, Mason T, Ciecko A, Parksey DS, Blair E, Cittone H, Clark EB, Cotton MD, Utterback TR, Khouri H, Qin H, Vamathevan J, Gill J, Scarlato V, Masignani V, Pizza M, Grandi G, Sun L, Smith HO, Fraser CM, Moxon ER, Rappuoli R, Venter JC. Complete genome sequence of Neisseria meningitidis serogroup B strain MC58. Science. 2000;287:1809–1815. doi: 10.1126/science.287.5459.1809. [DOI] [PubMed] [Google Scholar]
- Thenisch NL, Bachmann LM, Imfeld T, Leisebach Minder T, Steurer J. Are mutans streptococci detected in preschool children a reliable predictive factor for dental caries risk? A systematic review. Caries Res. 2006;40:366–374. doi: 10.1159/000094280. [DOI] [PubMed] [Google Scholar]
- van Houte J, Gibbs G, Butera C. Oral flora of children with “nursing bottle caries”. J Dent Res. 1982;61:382–385. doi: 10.1177/00220345820610020201. [DOI] [PubMed] [Google Scholar]
- Winstanley C. Spot the difference: applications of subtractive hybridisation to the study of bacterial pathogens. J Med Microbiol. 2002;51:459–467. doi: 10.1099/0022-1317-51-6-459. [DOI] [PubMed] [Google Scholar]
- Zhang L, Foxman B, Drake DR, Srinivasan U, Henderson J, Olson B, Marrs CF, Warren JJ, Marazita ML. Comparative whole-genome analysis of Streptococcus mutans isolates within and among individuals of different caries status. Oral Microbiol Immunol. 2009;24:197–203. doi: 10.1111/j.1399-302X.2008.00495.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.