Abstract
The difficulty associated with the cultivation of most microorganisms and the complexity of natural microbial assemblages, such as marine plankton or human microbiome, hinder genome reconstruction of representative taxa using cultivation or metagenomic approaches. Here we used an alternative, single cell sequencing approach to obtain high-quality genome assemblies of two uncultured, numerically significant marine microorganisms. We employed fluorescence-activated cell sorting and multiple displacement amplification to obtain hundreds of micrograms of genomic DNA from individual, uncultured cells of two marine flavobacteria from the Gulf of Maine that were phylogenetically distant from existing cultured strains. Shotgun sequencing and genome finishing yielded 1.9 Mbp in 17 contigs and 1.5 Mbp in 21 contigs for the two flavobacteria, with estimated genome recoveries of about 91% and 78%, respectively. Only 0.24% of the assembling sequences were contaminants and were removed from further analysis using rigorous quality control. In contrast to all cultured strains of marine flavobacteria, the two single cell genomes were excellent Global Ocean Sampling (GOS) metagenome fragment recruiters, demonstrating their numerical significance in the ocean. The geographic distribution of GOS recruits along the Northwest Atlantic coast coincided with ocean surface currents. Metabolic reconstruction indicated diverse potential energy sources, including biopolymer degradation, proteorhodopsin photometabolism, and hydrogen oxidation. Compared to cultured relatives, the two uncultured flavobacteria have small genome sizes, few non-coding nucleotides, and few paralogous genes, suggesting adaptations to narrow ecological niches. These features may have contributed to the abundance of the two taxa in specific regions of the ocean, and may have hindered their cultivation. We demonstrate the power of single cell DNA sequencing to generate reference genomes of uncultured taxa from a complex microbial community of marine bacterioplankton. A combination of single cell genomics and metagenomics enabled us to analyze the genome content, metabolic adaptations, and biogeography of these taxa.
Introduction
The metabolism of bacteria and archaea drives most of the biogeochemical cycles on Earth [1], has a tremendous effect on human health [2], and constitutes a largely untapped source of novel natural products [3]. Recent advances in metagenomics revealed enormous diversity of previously unknown, uncultured microorganisms that predominate in the ocean, soil, deep subsurface, human body, and other environments [2], [4], [5], [6]. However, the recalcitrance to cultivation of the vast majority of environmental prokaryotes makes whole genome studies very challenging, if not impossible. Metagenomic sequencing of microbial communities enabled genome reconstruction of only the most abundant members [7], [8], [9]. While novel isolation approaches resulted in significant progress [10], [11], [12], they remain unsuited for high-throughput recovery of representative microbial taxa from their environment. The paucity of suitable reference genomes is a major obstacle in the interpretation of metagenomic data. For example, the first leg of the Global Ocean Sampling (GOS) expedition produced 6.3 Gbp of shotgun DNA sequence data from surface ocean microbial communities, but only a small fraction of the reads were closely related to known genomes, while no novel genomes were assembled [6]. These limitations of current methods in microbiology are illustrated by the difficulty in determining the predominant carriers of proteorhodopsins, which are abundant in marine metagenomic libraries and likely provide a significant source of energy to the ocean food web [6], [13], [14]. Thus, novel research tools are necessary to complement cultivation and metagenomics-based studies for the reconstruction of genomes, metabolic pathways, ecological niches, and evolutionary histories of microorganisms that are representative of complex environments.
To overcome current methodological limitations, we developed robust protocols for genomic sequencing from individual microbial cells. We used these novel tools to reconstruct genomes of two uncultured, proteorhodopsin-containing marine flavobacteria, MS024-2A and MS024-3C, which were isolated from the Gulf of Maine as previously described [15]. The 16S rRNA sequences of these two cells are distant from cultured strains, but closely related to several community PCR clones from diverse marine and Antarctic locations (Fig. S1). We demonstrate that, in contrast to their cultured relatives, these cells represent genetic material from numerically significant microbial taxa, which possess unique adaptations to the marine environment.
Results and Discussion
Single cell genome reconstruction
Shotgun sequencing and genome finishing resulted in 1.9 Mbp in 17 contigs and 1.5 Mbp in 21 contigs for the single amplified genomes (SAGs) MS024-2A and MS024-3C respectively, with contig length ranging 3–684 Kbp (Table 1). Based on the analysis of conserved single copy genes (CSCGs), these major contigs recovered about 91% and 78% of the two genomes (Figs. S2, S3, S4A, see Materials and Methods for more details). The uneven distribution of CSCGs on the genomes (Fig. S3) may introduce biases in this estimate. However, even considering such biases, complete genome sizes of MS024-2A and MS024-3C are likely to be within the relatively narrow ranges of 2.0–2.2 and 1.9–2.4 Mbp, respectively (Fig. S2B).
Table 1. General features of the single cell genome assemblies.
Assembly statistic | MS024-2A | MS024-3C |
Assembly size [Mbp] | 1.905 | 1.515 |
Estimated genome size [Mbp] | 2.095 | 1.947 |
Estimated genome recovery [%] | 91 | 78 |
Number of contigs | 17 | 21 |
Largest contig [kbp] | 684 | 549 |
GC content [%] | 36 | 39 |
Mean total read depth±sd | 56±63 | 83±110 |
Mean 454 read depth | 47 | 68 |
Mean Sanger read depth | 9 | 14.3 |
Total genes | 1,815 | 1,413 |
rRNA operons | 2 | 1 |
tRNA genes | 33 | 24 |
Protein-coding genes | 1,780 | 1,388 |
Genes with no function prediction | 443 | 328 |
While 454 shotgun pyrosequencing provided lower-cost, high coverage depth data without cloning biases, the addition of paired-end Sanger sequencing assisted resolving homopolymer regions and improved genome assemblies (Fig. S4). Further shotgun sequencing would be ineffective due to the significant overrepresentation of certain genome regions in multiple displacement amplification (MDA) products (Fig. S5A).
On average, we detected one chimera per 13–27 Kbp of single cell whole genome MDA products (Table S1), which is comparable to prior reports [16], [17]. As single stranded DNA molecules represent intermediates in the chimera formation during MDA, S1 nuclease treatment has been suggested and shown to reduce chimerism [18]. We evaluated S1 nuclease mediated debranching effects by comparing 3 Kbp library clones from branched and unbranched MDA DNA for MS024-2A. No notable reduction in chimeric rearrangements was detected in the S1-treated DNA samples (Table S1). While the presence of MDA-produced chimera added a challenge to genome assembly and finishing efforts, sufficient sequencing coverage in most parts of the genomes allowed the identification and removal or chimeric reads from the assembly, as well as the identification of chimeric clones, which were avoided for primer walking.
Rigorous quality controls, using nucleotide pattern analysis and phylogenomics, were implemented to detect potential contaminants and amplification artifacts (Table S1, Figs. S5B, S6). Only 0.7% of all sequence reads and only 0.24% of the assembling reads were identified as contaminants or self-primed amplification products, and were removed from further analysis. In prior single cell genome sequencing attempts, Zhang et al. [18] recovered 66% of a genome of a cultured Prochlorococcus strain in 477 contigs, while Marcy et al. [19] recovered an unknown fraction of a genome of an uncultured representative of the TM7 phylum in 288 contigs, with up to 10% Leptotrichia contamination. Here we demonstrate that improved laboratory and bioinformatics protocols enable high-quality de novo draft reconstruction of genomes of uncultured taxa from complex microbial communities.
Global Ocean Sampling (GOS) fragment recruitment
We searched for the presence of MS024-2A- and MS024-3C-like DNA in the (GOS) data using metagenome fragment recruitment [6]. The number of GOS fragments recruited by the two SAGs was higher, by at least one order of magnitude, than the recruitment by any of the eleven available genomes of cultured marine flavobacteria strains (Figs. 1, S7). The GOS read recruitment by marine cultures, including those collected at or near GOS stations, was as low as the recruitment by the soil isolate Flavobacterium johnsoniae. This suggests that currently sequenced flavobacteria cultures are poor representations of the predominant marine taxa, at least in the regions of the ocean represented by GOS data. In contrast, the number of recruits at high DNA identity level (>97%) was comparable for our two flavobacteria SAGs and the representatives of the ubiquitous marine genera Prochlorococcus, Synechococcus, and Pelagibacter, which were previously identified as the only significant GOS fragment recruiters [6]. This is quite remarkable, considering that the two SAGs are non-redundant genomes from a relatively small, pilot marine SAG library [15]. Our results demonstrate the power of single cell genomics to reconstruct representative microbial genomes from complex communities, independent of their cultivability.
We further focused on the GOS recruits with >95% DNA identity to the two SAGs, as an operational demarcation of bacterial species [20]. A total of 1,505 and 467 of >95% DNA identity recruits were obtained for MS024-2A and MS024-3C. Of these, only nine recruits encoding only two genes were shared by the two SAGs, demonstrating significant evolutionary distance between the two genomes. Interestingly, >99% of the recruits and the two SAGs themselves came from a distinct biogeographic region along the coast of the northwest Atlantic Ocean (Fig. 2A). The fraction of SAG-like DNA did not correlate with ambient temperature, salinity, and chlorophyll a concentrations but was highest at the two northern-most GOS stations. In Bedford Basin, Nova Scotia (GOS station #5), 1.2% and 0.8% of all metagenomic reads were >95% identical to MS024-2A and MS024-3C DNA respectively, and in the Bay of Fundy (GOS station #6) 0.4% GOS reads matched to MS024-2A. No SAG recruits were found south of the GOS station #13 off Nags Head, North Carolina, including many tropical stations. This GOS recruit distribution correlates with the coastal transport of the remnants of the Labrador Current, as illustrated by the ocean surface temperature during the GOS sampling (Fig. 2B). It appears that close relatives of MS024-2A and MS024-3C are most abundant in the coastal northwest Atlantic waters and may be transported southward and mixed into local bacterioplankton assemblages by surface currents along the coastline. Single cells and the GOS Atlantic coast stations were sampled over 2 years apart (March 2006 and August-December 2003, respectively). Thus, MS024-2A and MS024-3C appear to represent two numerically significant marine flavobacteria taxa, which persist in particular geographic areas.
Genome streamlining
The numerical significance of MS024-2A- and MS024-3C-like bacterioplankton in the intensely studied Atlantic coastal waters of U.S. and Canada raises two intriguing questions: 1) what makes these organisms competitive in their natural environment and 2) why are they not represented in cultures? Here we propose plausible explanations, as based on the SAGs' genome composition, including genome streamlining, energy-conserving metabolism, and diversified mixotrophy.
Genome streamlining was suggested as a nutrient and energy conserving adaptation in the ubiquitous and hard-to-culture marine alphaproteobacteria clade SAR11 [21]. Accordingly, MS024-2A and MS024-3C have among the smallest genomes, the lowest fraction of paralogous genes, and the lowest fraction of non-coding nucleotides amongst the sequenced taxa of the Bacteroidetes phylum (Fig. 3). The significantly reduced number of paralogs indicates that genome streamlining comes at a cost of reduced biochemical plasticity. Thus, MS024-2A and MS024-3C may represent taxa adapted to a narrow ecological niche, which may be one of the reasons behind their significant presence in a specific geographic area and difficulties in their laboratory cultivation.
Both MS024-2A and MS024-3C lack recognizable genes involved in the assimilation of sulfate, sulfite, nitrate or nitrite. The lack of nitrate and nitrite reductases is a common feature in all currently available Flavobacteria class genomes, while inorganic sulfur assimilation pathways appear to be missing in two sequenced flavobacteria isolates, Flavobacterium psychrophilum JIP02/86 and the isolate BAL38. Assuming the genome recovery of the two SAGs is 91% and 78% (Table 1; Figs. S2, S4), the probability for a single gene being missing from MS024-2A and MS024-3C due to the incomplete assemblies is 9% and 22%, respectively. Assuming that MDA bias and the resulting genome coverage by shotgun sequencing are random [16], the probability of a gene encoding the same metabolic function being missing from both SAGs due to incomplete genome recoveries is equal to 0.09×0.22 = 0.02, i.e. only 2%. With these qualifications, we hypothesize that the SAG-represented taxa rely solely on reduced N and S forms as an energy-saving strategy in organic C rather than inorganic N or S limited environment, as was recently described for “Candidatus Pelagibacter ubique” [21]. Experimental verification and field studies would be necessary to validate the inability for oxidized inorganic N and S utilization by some marine flavobacteria, as suggested by their genome features.
Although DMSP sulfur utilization was suggested by microautoradiography for a subset of marine flavobacteria in a community-level study [22], no significant homologs to known DMSP demethylases or lyases were detected on any of the available Flavobacteria class genomes, including these two SAGs. All available marine flavobacteria genomes also lack recognizable ureases, but most, including MS024-2A, encode allophanate hydrolases. Thus, allophanate, a breakdown product of urea, is a likely supplementary source of N to many marine flavobacteria. Both SAGs contain phosphate permeases and polyphosphate kinases, indicating their capacity for import and intracellular storage of inorganic phosphorus (Table S2).
Proteorhodopsin photometabolism
The presence of proteorhodopsin genes (Flav2A_or1462, Flav3C_or0805) is yet another similarity between the two SAGs and the Pelagibacter genomes. Proteorhodopsins are light-driven proton pumps, which have recently been recognized for their abundance and likely biogeochemical significance in surface oceans [6], [13], [14]. However, the hosts of the majority of marine proteorhodopsins remain unidentified. Three recent studies, utilizing metagenomics, single cell genomics and cultivation, demonstrated the presence of proteorhodopsins in marine bacteroidetes [14], [15], [23]. Bacteroidetes-like proteorhodopsin genes are also abundant in diverse freshwater bacterioplankton communities [24]. Proteorhodopsin-containing microbial cultures currently include three alphaproteobacteria [21], [25], four bacteroidetes [23], and four SAR92 gammaproteobacteria [26]. Despite extensive tests, light stimulation likely attributable to proteorhodopsin activity has been detected in only one of these isolates, flavobacterium Dokdonia sp. MED134 [23]. Thus, the ecological roles and expression conditions of marine proteorhodopsins remain enigmatic.
Intriguingly, marine planktonic bacteroidetes with proteorhodopsins have smaller genomes and fewer paralogs compared to marine bacteroidetes without proteorhodopsins, while non-marine bacteroidetes have more paralogs and more non-coding DNA than their marine counterparts (Fig. 3; p<0.01, t-test). Although the causality of this relationship is unclear, the presence of proteorhodopsins in the streamlined genomes provides indirect evidence for their adaptive significance.
To examine proteorhodopsin relationships to other biochemical pathways, we investigated what genes are present in all six available proteorhodopsin-containing flavobacteria genomes but are absent in the remaining 13 flavobacteria genomes. Only three such genes were detected: proteorhodopsin, blh (encoding β-carotene dioxygenase, which produces proteorhodopsin chromophore retinal) and genes encoding DNA photolyase-like flavoproteins. The latter formed a distinct phylogenetic cluster among photolyase-like genes of flavobacteria (Fig. S8). It may be speculated that photolyase-like flavoproteins regulate rhodopsin proton pump expression or that both photometabolic systems are involved in synchronized photosensing or energy production. These hypotheses may be experimentally tested using pure cultures, metatranscriptome studies, or heterologous expression of SAG genes.
Other metabolic features
Uniquely among marine flavobacteria, MS024-2A possesses [NiFe]-hydrogenase genes hyaA and hyaB (Flav2A_or1764, Flav2A_or1770), raising the possibility that this organism utilizes hydrogen as a supplementary source of energy. Potential sources of hydrogen in the ocean photic zone include photochemical reactions [27], algal metabolism [28], and heterotroph activity in anoxic microenvironments [29]. Hydrogenase-like genes are also harbored by the marine plankton Roseobacter clade isolates Roseovarius sp. HTCC2601, Roseovarius sp. TM1035 and Sagittula stellata E-37, and are abundant in GOS sequence data, which suggests a potentially widespread hydrogen metabolism in the ocean photic zone. The potential physiological and ecological significance of hydrogenases in marine bacterioplankton is intriguing and requires experimental verification.
Hydrogen oxidation and proteorhodopsin photometabolism may provide supplementary energy and a competitive advantage in a carbon-limited environment. However, the primary sources of carbon and energy for MS024-2A and MS024-3C likely are organic compounds. The two SAGs contain many genes involved in biopolymer hydrolysis (Table S3) and the import and degradation of hydrolysis products (Table S4). Both SAGs possess a substantial number of predicted proteins with domains that have been implicated in cell-surface and cell-cell interaction (Table S5). The characteristic repetitive domain structures in adhesion proteins are known to bind calcium ions, such as Cadherin, FG-GAP and Thrombospondin type 3 repeats; or to bind cell receptors and metal ions, such as Fasciclin and Von Willebrand factor type A. These cell surface repetitive structures could play an important role in adhering to algal surface mucilage, in attaching to the nutrient-rich marine snow particles, and in biofilm formation. These features are consistent with the genome composition of other marine flavobacteria [30], with the community-level evidence of marine flavobacteria proficiency in biopolymer hydrolysis [31], and with the relative abundance of flavobacteria in algal blooms and in physical associations with algal cells - the likely sources of these biopolymers [31].
In contrast to all currently available Flavobacteria class genomes, MS024-2A contains an anti-sigma factor rsbW, its antagonist rsbV, an associated gene rsbU, and a PAS domain S-box (Fig. S9). The MS024-3C genome also contains rsbW and a fragment of rsbU at the end of a contig, while other genes of the operon are missing, possibly due to the incomplete MS024-3C assembly. It is likely that the rsbW cluster is involved in the global cellular response to changing environmental conditions, as in the model organism Bacillus subtilis with homologous genes [32].
Conclusions
We demonstrate the power of single cell DNA sequencing to generate representative reference genomes of uncultured taxa from a complex community of marine bacterioplankton. A combination of single cell genomics and metagenomics enabled us to analyze the genome content, metabolic adaptations, and biogeography of numerically significant, uncultured microorganisms.
Materials and Methods
Environmental sample collection, cell sorting, and first round of whole genome amplification
Coastal water sample was collected from Boothbay Harbor, Maine, from 1 m depth at the Bigelow Laboratory dock (43°50′39.87″N 69°38′27.49″W) on March 28, 2006. Bacterioplankton were stained with a generic live-cell DNA stain SYTO-9 (Invitrogen), and individual, high nucleic acid cells were selected at random and sorted into 96-well plates using a MoFlo™ (Dako-Cytomation) flow cytometer, as previously described [15]. Protocols for single cell lysis, whole genome MDA, and PCR-based screening have been described previously [15]. Of the eleven marine single amplified genomes (SAGs) obtained from this sample, two were identified as proteorhodopsin-containing flavobacteria [15]. The original MDA products of these two SAGs, named MS024-2A and MS024-3C, were re-amplified using REPLI-g MIDI kit (Qiagen) following manufacturer's instructions. To minimize biases of the second MDA reaction, 14 replicate 100 µL reactions were performed and then pooled together, resulting in 700–800 µg of genomic dsDNA from each SAG. MDA products were debranched using S1 nuclease (Fermentas) digestion with 10 U/µl at 37°C for 1 h. The enzyme was heat-inactivated in the presence of EDTA and the DNA was phenol-chloroform-isoamyl alcohol extracted and ethanol-precipitated.
16S rRNA clone libraries
Bacterial 16S rDNA PCR libraries were created from debranched MDA products using primers 27f and 1391r [33]. Ribosomal RNA gene PCR amplification using universal archaeal 16S primers as well as eukaryotic 18S primers was attempted but did not yield any PCR products. PCR amplicons of five replicate reactions were combined and ligated into the pCR4-TOPO vector using the TOPO TA Cloning Kit (Invitrogen). Ligations were then electroporated into One Shot TOP10 Electrocomp™ E. coli cells and plated on selective media agar plates. The bi-directional 16S rDNA sequence reads were end-paired, trimmed for PCR primer sequence and quality and analyzed using BLASTn [34]. Three out of 332 16S rDNA clone sequences were not identical to the MS024-2A 16S gene. Based on their phylogeny (one Pseudomonas and two Crenarchaea), the three clones were most likely introduced as contaminants during the cloning/sequencing process. For flavobacteria bacterium MS024-3C, all 267 16S rDNA clone sequences were target-specific, suggestive of MDA product purity. Previously, 16S rRNA fingerprinting analysis by terminal restriction fragment length polymorphism (T-RFLP) inferred that the flavobacterial SAGs MS024-2A and MS024-3C are lacking evident contamination [15].
Genome sequencing
A combination of Sanger shotgun sequencing and 454 pyrosequencing was performed on the single cell MDA products. For Sanger sequencing, 3 Kbp and 8 Kbp shotgun libraries were constructed using debranched MDA products. To evaluate the debranching effects, an additional 3 Kbp library was constructed using untreated MDA products of MS024-2A. For shotgun library construction, MDA products were randomly sheared to 2–4 Kbp and 6–10 Kbp fragments using HydroShear (GeneMachines). The sheared DNA was separated on an agarose gel, gel-purified using the QIAquick Gel Extraction Kit (Qiagen) and blunt-ended using T4 DNA polymerase (Roche) and Klenow Fragment (New England Biolabs) in the presence of dNTPs and NEB2 buffer. The 2–4 Kbp and 6–10 Kbp DNA fragments were ligated in pUC19 vector (Fermentas) and pMCL200 vector, respectively, O/N at 16°C using T4 DNA ligase (Roche Applied Science) and 4.5% polyethylene glycol (Sigma). The ligation products were phenol-chloroform extracted and ethanol precipitated. According to the manufacturer's instructions, ligations were electroporated into ElectroMAX DH10B™ Cells (Invitrogen) and clones prepared and sequenced on an ABI PRISM 3730 capillary DNA sequencer (Applied Biosystems) according to the JGI standard protocols (www.jgi.doe.gov). End-sequencing yielded 7,680 reads (totaling 4.58 Mbp) of 3 Kbp clone sequence and 19,968 reads (totaling 12.87 Mbp) of 8 Kbp clone sequence for MS024-2A. We generated 7,680 3 Kbp library reads (totaling 5.05 Mbp) and 29,952 8 Kbp library clones (totaling 17.67 Mbp) for MS024-3C. Pyrosequencing was performed on debranched MDA products using the Genome Sequencer FLX System (454 Life Sciences, http://www.454.com/) [35] according to the manufacturer protocol. The sequencing runs generated ∼95 Mbp (MS024-2A) and 108 Mbp (MS024-3C).
Tetramer analysis
To detect possible DNA contamination, we designed a novel test using oligonucleotide frequencies, similar to CompostBin [36]. The frequencies of tetramers were extracted from each Sanger sequence read and used to represent the data as a N×256 feature matrix, where N is the number of sequence reads and each column of the matrix corresponds to the frequency of one of the 256 possible tetramers. Principal Component Analysis [37] was then used to extract the most important components of this high dimensional feature matrix and the projections into the first two Principal Components were analyzed as based on their modality and visualized in a scatter plot. The Matlab and C code for the oligonucleotide tests are available freely at the website http://bobcat.genomecenter.ucdavis.edu/souravc/singlecell/. Lastly, the Sanger reads were analyzed by blastx [34] against the Genbank nr database. Reads were taxonomically assigned using MEGAN [38].
Genome assembly and finishing
The pyrosequence reads were assembled using the 454 Newbler assembler version 1.1.02.15 and the consensus sequence shredded into 2 Kbp pieces with 100 bp overlaps. The 454 shred data was assembled with the Sanger sequences using parallel Phrap (High Performance Software, LLC). The Phred/Phrap/Consed software package (www.phrap.com) was used for sequence assembly and quality assessment [39], [40]. Chimeric reads were detected and excluded from the assemblies using local Perl scripts. Possible mis-assemblies were corrected with Dupfinisher [41] and manual editing. To close the gaps and to raise the quality of the sequences, primer walking on the medium and small insert size clones, and PCR/adapter PCR [42] on the MDA products were performed. A total of 4,494 primer walk reads and 220 PCR/adapter PCR reads were generated during finishing for MS024-2A. For MS024-3C, we generated a total of 2,076 primer walk reads and 197 PCR/adapter PCR reads. The smallest two MS024-2A contigs (∼0.24% of the assembly) were identified as contamination as based on GC-content, tetramer binning, and BLAST analysis, and were thus excluded from this draft genome. Final assembly sizes were 1,905,484 bp (17 contigs) for MS024-2A and 1,515,248 bp (21 contigs) for MS024-3C.
Estimates of complete genome sizes were obtained for MS024-2A and MS024-3C using conserved single copy gene (CSCG) analysis. To identify relevant CSCGs, 16 Flavobacteria class genomes, currently available at the Joint Genome Institute Integrated Microbial Genomes site (IMG; http://img.jgi.doe.gov/cgi-bin/pub/main.cgi) [43], were included in the analysis: Capnocytophaga ochracea DSM 7271, Dokdonia sp. MED134, Croceibacter atlanticus HTCC2559, Flavobacteria bacterium BAL38, Flavobacteria bacterium BBFL7, Flavobacteriales bacterium ALC-1, Flavobacteriales bacterium HTCC2170, Flavobacterium johnsoniae UW101, Flavobacterium psychrophilum JIP02/86, Gramella forsetii KT0803, Kordia algicida OT-1, Leeuwenhoekiella blandensis MED217, Polaribacter irgensii 23-P, Robiginitalea biformata HTCC2501, Polaribacter sp. MED152, and Ulvibacter sp. SCB49. The genome of Psychroflexus torquis ATCC 700755 was excluded due to its poor assembly quality. The pre-computed COG function distribution from the 16 genomes was retrieved from IMG using Function Profile feature. First, one of the 16 genomes was randomly selected as a “seed” to quantify single copy genes. Then additional, randomly selected genomes were sequentially added, until all 16 genomes were included, and the number of CSCGs, i.e. single copy genes shared among the analyzed group of genomes, was quantified for each genome combination. The entire process was reiterated 1000 times, using new, randomly selected “seed” genomes. In this way we identified 268 CSCGs that were shared by all 16 available Flavobacteria class genomes (Fig. S2A). The number of CSCGs was plotted against the number of genomes analyzed, and a power function fit was applied to the data. We extrapolated this regression curve to predict that the number of CSCGs remaining after adding one more, 17th genome is 265, i.e. 6 genes (2.2%) fewer than with only 16 genomes involved. Of the 268 identified CSCGs, 239 (89%) and 204 (76%) were present on the assemblies of MS024-2A and MS024-3C. We used this information to estimate the expected complete genome sizes of MS024-2A and MS024-3C as follows:
where GS is the expected complete genome size; AS is the size of current genome assemblies (1.9 Mbp and 1.5 Mbp for MS024-2A and MS024-3C); RCSCG is the recovery of CSCGs (0.89 and 0.76 for MS024-2A and MS024-3C); and 0.98 is the correction coefficient to compensate for the expected lower number of CSCGs shared by 17 relative to 16 genomes. The application of this model resulted in the expected complete genome sizes of MS024-2A and MS024-3C to be 2.1 Mbp and 1.9 Mbp.
Many CSCGs are arranged in clusters (Fig. S3), which may lead to biases in the CSCG-based genome size estimates. For example, preferential recovery of CSCG-rich regions would lead to an underestimate of the genome size, and vice versa. To evaluate these potential biases, we used the three closed Flavobacteria class genomes as references: Gramella forsetii KT0803, Flavobacterium johnsoniae UW101 and Flavobacterium psychrophilum JIP02/86. Each of these genomes was sequentially divided into various numbers of equal-sized segments, from 1 to 360 segments per genome. The segmentation was repeated 18 times for each genome, by rotating the segmentation at 20° increments. Genome sizes of SAGs were then estimated based on the recovery of genes representing each of these reference genome segments, as follows:
where GS is the expected complete genome size; TCSCG is the total number of CSCGs in a given reference genome segment; SCSCG is the number of those genes recovered in a SAG, n is the total number of segments, and 0.98 is the correction coefficient to compensate for the expected lower number of CSCGs shared by 17 relative to 16 genomes. Genome size estimates varied somewhat, depending on the reference genome, the number of segments, and, to a lesser extent, rotation of the segmentation (Fig. S2B). The segmentation-based genome size estimates for MS024-2A and MS024-3C were 2.0–2.2 Mbp and 1.9–2.4 Mbp.
Genome annotation and comparative analysis
Automated gene prediction was performed by JGI-ORNL using the output of Critica [44] complemented with the output of Generation and Glimmer [45]. The tRNAScanSE tool [46] was used to find tRNA genes, whereas ribosomal RNAs were found by using BLASTn vs. the 16S and 23S ribosomal RNA databases. Other “standard” structural RNAs (e.g., 5S rRNA, rnpB, tmRNA, SRP RNA) were found by using covariance models with the internal search tool [47]. The assignment of product descriptions was made by using search results of the following curated databases in this order: TIGRFam; PRIAM (E<10−30 cutoff); Pfam; Smart; COGs; Swissprot/TrEMBL (SPTR); and KEGG. If there was no significant similarity to any protein in another organism, it was described as “hypothetical protein”. “Conserved hypothetical protein” was used if at least one match was found to a hypothetical protein in another organism. EC numbering was based on searches in PRIAM at E<10−10 cutoff; COG and KEGG functional classifications were based on homology searches in the respective databases.
Comparative analyses were performed using a set of tools available in The Joint Genome Institute Integrated Microbial Genomes (IMG; http://img.jgi.doe.gov/cgi-bin/pub/main.cgi) [43]. Unique and orthologous MS024-2A and MS024-3C genes were identified by using BLASTp (cutoff scores of E<10−2 and 20% identity and reciprocal hits with cutoffs of E<10−5 and 30%, respectively). Signal peptides and transmembrane helices were predicted with SignalP 3.0 [48] and TMHMM 2.0 [49] set at default values. Protein localizations were predicted with PSORTb [50] and twin-arginine translocation systems were identified using TatP program [51]. Insertion sequence (IS) elements were identified using the ISFinder database [52]. Metabolic pathways were constructed with MetaCyc as a reference [53].
The sequence data has been deposited in GenBank (http://www.ncbi.nlm.nih.gov/Genbank) under project accessions ABVV00000000 (MS024-2A) and ABVW00000000 (MS024-3C).
Metagenome fragment recruitment
Sequence reads from the Global Ocean Sampling (GOS) expedition were downloaded from the Camera website (http://camera.calit2.net/about-camera/full_datasets.php). Microbial isolate genome sequences were downloaded from NCBI website (http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi). Contigs from the two flavobacteria MS024-2A and MS024-3C and selected isolate genomes were used as reference sequences and aligned against sequence reads from the GOS data using MUMmer [45], [54] with the following parameters: nucmer -minmatch 10 -breaklen 400 -maxgap 400 -mincluster 400. The ≥400 bp threshold for alignments was introduced to ensure that fragment recruitment is based on nucleotide homology of at least half of an average-length microbial gene. The fragment recruitment criteria used in this study were more stringent than those applied by Rusch et al. [6], resulting in the overall lower recruitment numbers in our study, compared to those reported by Rusch et al. [6] for the same reference genomes. Coordinate files produced from MUMmer alignments were parsed using an in-house developed Java program and alignment plots of the GOS reads against the reference sequences were created using an R script.
Supporting Information
Acknowledgments
We thank Lynne Goodwin for her efforts in coordinating the sequencing project at JGI, Hank Tu for input on the chimer detection analysis, Natalia Ivanova for annotation advice, Nicole Poulton for flow cytometry, and Wendy Bellows for PCR analyses.
Footnotes
Competing Interests: The authors have declared that no competing interests exist.
Funding: The study was supported by the DOE 2007 Microbes program grant DOEM-78201 to RS; NSF grants EF-0633142 and MCB-0738232 to RS and MS, Maine Technology Institute grant to MS, LANL-20080662DR grant to GX, JS and PS, Spanish Ministry of Science and Innovation grant CTM2007-63753-C02-01/MAR to JMG; DARPA grants HR0011-05-1-0057 and FA9550-06-1-0478 to SC; and Taiwan National Science Council grant NSC-97-3112-B-010-019 to CY. Part of this work was performed under the auspices of the US Department of Energy's Office of Science, Biological and Environmental Research Program, and by the University of California, Lawrence Berkeley National Laboratory under contract No. DE-AC02-05CH11231, Lawrence Livermore National Laboratory under Contract No. DE-AC52-07NA27344, and Los Alamos National Laboratory under contract No. DE-AC02-06NA25396. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1.Falkowski PG, Fenchel T, DeLong EF. The Microbial Engines that Drive Earth's Biogeochemical Cycles. Science. 2008;320:1034–1039. doi: 10.1126/science.1153213. [DOI] [PubMed] [Google Scholar]
- 2.Turnbaugh PJ, Ley RE, Hamady M, Fraser-Liggett CM, Knight R, et al. The Human Microbiome Project. Nature. 2007;449:804–810. doi: 10.1038/nature06244. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Haefner B. Drugs from the deep: Marine natural products as drug candidates. Drug Discovery Today. 2003;8:536–544. doi: 10.1016/s1359-6446(03)02713-2. [DOI] [PubMed] [Google Scholar]
- 4.Pace NR, Stahl DA, Lane DJ, Olsen GJ. The Analysis of Natural Microbial-Populations by Ribosomal-RNA Sequences. Advances in Microbial Ecology. 1986;9:1–55. [Google Scholar]
- 5.Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, et al. Comparative metagenomics of microbial communities. Science. 2005;308:554–557. doi: 10.1126/science.1107851. [DOI] [PubMed] [Google Scholar]
- 6.Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, et al. The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific. PLoS Biology. 2007;5:e77. doi: 10.1371/journal.pbio.0050077. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Hallam SJ, Konstantinidis KT, Putnam N, Schleper C, Watanabe YI, et al. Genomic analysis of the uncultivated marine crenarchaeote Cenarchaeum symbiosum. Proceedings of the National Academy of Sciences of the United States of America. 2006;103:18296–18301. doi: 10.1073/pnas.0608549103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, et al. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature. 2004;428:37–43. doi: 10.1038/nature02340. [DOI] [PubMed] [Google Scholar]
- 9.Woyke T, Teeling H, Ivanova NN, Huntemann M, Richter M, et al. Symbiosis insights through metagenomic analysis of a microbial consortium. Nature. 2006;443:950–955. doi: 10.1038/nature05192. [DOI] [PubMed] [Google Scholar]
- 10.Zengler K, Toledo G, Rappe M, Elkins J, Mathur EJ, et al. Cultivating the uncultured. Proceedings of the National Academy of Sciences of the United States of America. 2002;99:15681–15686. doi: 10.1073/pnas.252630999. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Rappe MS, Connon SA, Vergin KL, Giovannoni SJ. Cultivation of the ubiquitous SAR11 marine bacterioplankton clade. Nature. 2002;418:630–633. doi: 10.1038/nature00917. [DOI] [PubMed] [Google Scholar]
- 12.Stingl U, Tripp HJ, Giovannoni SJ. Improvements of high-throughput culturing yielded novel SAR11 strains and other abundant marine bacteria from the Oregon coast and the Bermuda Atlantic Time Series study site. ISME Journal. 2007;1:361–371. doi: 10.1038/ismej.2007.49. [DOI] [PubMed] [Google Scholar]
- 13.Beja O, Spudich EN, Spudich JL, Leclerc M, DeLong EF. Proteorhodopsin phototrophy in the ocean. Nature. 2001;411:786–789. doi: 10.1038/35081051. [DOI] [PubMed] [Google Scholar]
- 14.Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, et al. Environmental genome shotgun sequencing of the Sargasso Sea. Science. 2004;304:66–74. doi: 10.1126/science.1093857. [DOI] [PubMed] [Google Scholar]
- 15.Stepanauskas R, Sieracki ME. Matching phylogeny and metabolism in the uncultured marine bacteria, one cell at a time. Proceedings of the National Academy of Sciences of the United States of America. 2007;104:9052–9057. doi: 10.1073/pnas.0700496104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Lasken RS, Stockwell TB. Mechanism of chimera formation during the Multiple Displacement Amplification reaction. BMC Biotechnology. 2007;7 doi: 10.1186/1472-6750-7-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Marcy Y, Ishoey T, Lasken RS, Stockwell TB, Walenz BP, et al. Nanoliter reactors improve multiple displacement amplification of genomes from single cells. PLoS Genetics. 2007;3:1702–1708. doi: 10.1371/journal.pgen.0030155. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Zhang K, Martiny AC, Reppas NB, Barry KW, Malek J, et al. Sequencing genomes from single cells by polymerase cloning. Nature Biotechnology. 2006;24:680–686. doi: 10.1038/nbt1214. [DOI] [PubMed] [Google Scholar]
- 19.Marcy Y, Ouverney C, Bik EM, Losekann T, Ivanova N, et al. Dissecting biological “dark matter” with single-cell genetic analysis of rare and uncultivated TM7 microbes from the human mouth. Proceedings of the National Academy of Sciences of the United States of America. 2007;104:11889–11894. doi: 10.1073/pnas.0704662104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Goris J, Konstantinidis KT, Klappenbach JA, Coenye T, Vandamme P, et al. DNA-DNA hybridization values and their relationship to whole-genome sequence similarities. International Journal of Systematic and Evolutionary Microbiology. 2007;57:81–91. doi: 10.1099/ijs.0.64483-0. [DOI] [PubMed] [Google Scholar]
- 21.Giovannoni SJ, Tripp HJ, Givan S, Podar M, Vergin KL, et al. Genome streamlining in a cosmopolitan oceanic bacterium. Science. 2005;309:1242–1245. doi: 10.1126/science.1114057. [DOI] [PubMed] [Google Scholar]
- 22.Vila M, Simo R, Kiene RP, Pinhassi J, Gonzalez JM, et al. Use of microautoradiography combined with fluorescence in situ hybridization to determine dimethylsulfoniopropionate incorporation by marine bacterioplankton taxa. Applied and Environmental Microbiology. 2004;70:4648–4657. doi: 10.1128/AEM.70.8.4648-4657.2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Gomez-Consarnau L, Gonzalez JM, Coll-Llado M, Gourdon P, Pacher T, et al. Light stimulates growth of proteorhodopsin-containing marine Flavobacteria. Nature. 2007;445:210–213. doi: 10.1038/nature05381. [DOI] [PubMed] [Google Scholar]
- 24.Atama-Ismael N, Sabehi G, Sharon I, Witzel KP, Labrenz M, et al. Widespread distribution of proteorhodopsins in freshwater and brackish ecosystems. The ISME Journal. 2008;2:656–662. doi: 10.1038/ismej.2008.27. [DOI] [PubMed] [Google Scholar]
- 25.Moran MA, Miller WL. Resourceful heterotrophs make the most of light in the coastal ocean. Nature Reviews Microbiology. 2007;5:792–800. doi: 10.1038/nrmicro1746. [DOI] [PubMed] [Google Scholar]
- 26.Stingl U, Desiderio RA, Cho JC, Vergin KL, Giovannoni SJ. The SAR92 clade: an abundant coastal clade of culturable marine bacteria possessing proteorhodopsin. Applied and Environmental Microbiology. 2007;73:2290–2296. doi: 10.1128/AEM.02559-06. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Punshon S, Moore RM. Photochemical production of molecular hydrogen in lake water and coastal seawater. Marine Chemistry. 2008;108:215–220. [Google Scholar]
- 28.Melis A, Zhang L, Forestier M, Ghirardi ML, Seibert M. Sustained photobiological hydrogen gas production upon reversible inactivation of oxygen evolution in the green alga Chlamydomonas reinhardtii. Plant Physiology. 2000;122:127–135. doi: 10.1104/pp.122.1.127. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Braun ST, Proctor LM, Zani S, Mellon MT, Zehr JP. Molecular evidence for zooplankton-associated nitrogen-fixing anaerobes based on amplification of the nifH gene. FEMS Microbiology Ecology. 1999;28:273–279. [Google Scholar]
- 30.Gonzalez JM, Fernandez-Gomez B, Fernandez-Guerra A, Gomez-Consarnau L, Sanchez O, et al. Genome analysis of the proteorhodopsin-containing marine bacterium Polaribacter sp. MED152 (Flavobacteria). Proceedings of the National Academy of Sciences of the United States of America. 2008;105:8724–8729. doi: 10.1073/pnas.0712027105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Kirchman DL. The ecology of Cytophaga-Flavobacteria in aquatic environments. FEMS Microbiology Ecology. 2002;39:91–100. doi: 10.1111/j.1574-6941.2002.tb00910.x. [DOI] [PubMed] [Google Scholar]
- 32.Petersohn A, Brigulla M, Haas S, Hoheisel JD, Volker U, et al. Global analysis of the general stress response of Bacillus subtilis. Journal of Bacteriology. 2001;183:5617–5631. doi: 10.1128/JB.183.19.5617-5631.2001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Lane DJ. 16S/23S rRNA sequencing. In: Stackebrandt E, Goodfellow M, editors. Nucleic acid techniques in bacterial systematics. Chichester, UK: John Wiley; 1991. [Google Scholar]
- 34.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of Molecular Biology. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
- 35.Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–380. doi: 10.1038/nature03959. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Chatterji S, Yamazaki I, Bai Z, Eisen JA. CompostBin: A DNA composition-based algorithm for binning environmental shotgun reads. In: Vingron M, Wong L, editors. Research in computational molecular biology. Berlin/Heidelberg: Springer; 2008. pp. 17–28. [Google Scholar]
- 37.Jolliffe IT. Principal component analysis. New York, NY: 2002. [Google Scholar]
- 38.Huson DH, Auch AF, Qi J, Schuster SC. MEGAN analysis of metagenomic data. Genome Research. 2007;17:377–386. doi: 10.1101/gr.5969107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Ewing B, Green P. Base-calling of automated sequencer traces using Phred. II. Error probabilities. Genome Research. 1998;8:186–194. [PubMed] [Google Scholar]
- 40.Gordon D, Abajian C, Green P. Consed: A graphical tool for sequence finishing. Genome Research. 1998;8:195–202. doi: 10.1101/gr.8.3.195. [DOI] [PubMed] [Google Scholar]
- 41.Han CS, Xie G, Challacombe JF, Altherr MR, Bhotika SS, et al. Pathogenomic sequence analysis of Bacillus cereus and Bacillus thuringiensis isolates closely related to Bacillus anthracis. Journal of Bacteriology. 2006;188:3382–3390. doi: 10.1128/JB.188.9.3382-3390.2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Rogers YC, Munk AC, Meincke LJ, Han CS. Closing bacterial genomic sequence gaps with adaptor-PCR. Biotechniques. 2005;39:31–34. doi: 10.2144/05391BM01. [DOI] [PubMed] [Google Scholar]
- 43.Markowitz VM, Korzeniewski F, Palaniappan K, Szeto E, Werner G, et al. The integrated microbial genomes (IMG) system. Nucleic Acids Research. 2006;34:D344–348. doi: 10.1093/nar/gkj024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Badger JH, Olsen GJ. CRITICA: Coding region identification tool invoking comparative analysis. Molecular Biology and Evolution. 1999;16:512–524. doi: 10.1093/oxfordjournals.molbev.a026133. [DOI] [PubMed] [Google Scholar]
- 45.Delcher AL, Harmon D, Kasif S, White O, Salzberg SL. Improved microbial gene identification with GLIMMER. Nucleic Acids Research. 1999;27:4636–4641. doi: 10.1093/nar/27.23.4636. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Lowe TM, Eddy SR. tRNAscan-SE: A program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Research. 1997;25:955–964. doi: 10.1093/nar/25.5.955. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Eddy SR. A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure. BMC Bioinformatics. 2002;3:18. doi: 10.1186/1471-2105-3-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Bendtsen JD, Nielsen H, Von Heijne G, Brunak S. Improved prediction of signal peptides: SignalP 3.0. Journal of Molecular Biology. 2004;340:783–795. doi: 10.1016/j.jmb.2004.05.028. [DOI] [PubMed] [Google Scholar]
- 49.Krogh A, Larsson B, Von Heijne G, Sonnhammer ELL. Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes. Journal of Molecular Biology. 2001;305:567–580. doi: 10.1006/jmbi.2000.4315. [DOI] [PubMed] [Google Scholar]
- 50.Gardy JL, Laird MR, Chen F, Rey S, Walsh CJ, et al. PSORTb v.2.0: Expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics. 2005;21:617–623. doi: 10.1093/bioinformatics/bti057. [DOI] [PubMed] [Google Scholar]
- 51.Bendtsen JD, Nielsen H, Widdick D, Palmer T, Brunak S. Prediction of twin-arginine signal peptides. BMC Bioinformatics. 2005;6:167. doi: 10.1186/1471-2105-6-167. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Siguier P, Perochon J, Lestrade L, Mahillon J, Chandler M. ISfinder: the reference centre for bacterial insertion sequences. Nucleic Acids Research. 2006;34:D32–36. doi: 10.1093/nar/gkj014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Caspi R, Foerster H, Fulcher CA, Hopkinson R, Ingraham J, et al. MetaCyc: a multiorganism database of metabolic pathways and enzymes. Nucleic Acids Research. 2006;34:D511–516. doi: 10.1093/nar/gkj128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Delcher AL, Phillippy A, Carlton J, Salzberg SL. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Research. 2002;30:2478–2483. doi: 10.1093/nar/30.11.2478. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, et al. Versatile and open software for comparing large genomes. Genome Biology. 2004;5:R12. doi: 10.1186/gb-2004-5-2-r12. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.