Skip to main content
Applied and Environmental Microbiology logoLink to Applied and Environmental Microbiology
. 2011 Feb 11;77(7):2513–2521. doi: 10.1128/AEM.02167-10

Quantitative Metagenomic Analyses Based on Average Genome Size Normalization

Jeremy A Frank 1,*, Søren J Sørensen 1
PMCID: PMC3067418  PMID: 21317268

Abstract

Over the past quarter-century, microbiologists have used DNA sequence information to aid in the characterization of microbial communities. During the last decade, this has expanded from single genes to microbial community genomics, or metagenomics, in which the gene content of an environment can provide not just a census of the community members but direct information on metabolic capabilities and potential interactions among community members. Here we introduce a method for the quantitative characterization and comparison of microbial communities based on the normalization of metagenomic data by estimating average genome sizes. This normalization can relieve comparative biases introduced by differences in community structure, number of sequencing reads, and sequencing read lengths between different metagenomes. We demonstrate the utility of this approach by comparing metagenomes from two different marine sources using both conventional small-subunit (SSU) rRNA gene analyses and our quantitative method to calculate the proportion of genomes in each sample that are capable of a particular metabolic trait. With both environments, to determine what proportion of each community they make up and how differences in environment affect their abundances, we characterize three different types of autotrophic organisms: aerobic, photosynthetic carbon fixers (the Cyanobacteria); anaerobic, photosynthetic carbon fixers (the Chlorobi); and anaerobic, nonphotosynthetic carbon fixers (the Desulfobacteraceae). These analyses demonstrate how genome proportionality compares to SSU rRNA gene relative abundance and how factors such as average genome size and SSU rRNA gene copy number affect sampling probability and therefore both types of community analysis.


Databases of randomly sequenced DNA fragments from an environment (the metagenome) provide a broad view of community function by giving access not only to small-subunit (SSU) rRNA gene sequences for organism identification but also to other regions of genomes harboring metabolic genes (21). Along with providing a means for potentially linking functional roles to specific organisms, either directly by finding a universal gene, such as an rRNA gene, linked to a specific gene (16) or indirectly by correlating abundances of metabolic genes and SSU rRNA genes in the metagenomic data, they also provide a means for comparing between samples or environments (4, 35, 37).

Since there are often differences between metagenomes in terms of the number and average length of sequencing reads, it is hard to determine if the differences in the observed number of reads corresponding to a particular trait are significant. As addressed by Beszteri et al., these evaluations are heavily influenced by the average genome size of the organisms sampled and the effects of genome size (and to a lesser extent, abundance) on the probability of sampling from a particular genome (5), limiting analysis to qualitative assessments.

These factors can be taken into account through normalizing each metagenome by the average genome size. Calculating the average genome size allows one to compare metagenomic data not only qualitatively, but also quantitatively, in terms of what proportion of genomes harbor a particular trait (e.g., the proportion of genomes capable of photosynthetic carbon fixation). The application of this method takes into account differences in sequencing read length and number, along with differences in community composition. Furthermore, it can become the foundation for other quantitative features, such as the number of genomes sampled and the relative importance of particular metabolisms (and possibly nutrient abundances) in an environment.

The following study details the evaluation and comparison of organism types and the abundances of aerobic photosynthetic, anaerobic photosynthetic, and anaerobic nonphotosynthetic carbon-fixing organisms in two different sets of marine metagenomes. The metagenomes used consist of three samples from the Global Open Ocean Sampling Expedition (GS) (35) and six from Ace Lake in Antarctica (see Materials and Methods). The GS samples were collected from coastal or open ocean tropical waters at the same depth, and the samples from Ace Lake were collected in the same general location but at different depths.

For each of the metagenomes evaluated, we characterized the organisms present through SSU rRNA gene analyses, estimated the average genome size through universal, single-copy gene analyses, and quantified the different types of carbon-fixing organisms (Cyanobacteria, Chlorobi, and Desulfobacteraceae) by determining what proportion of the genomes they make up. In the GS samples, we were able to show possible effects of the proximity to land on community composition by comparing the proportion of genomes capable of photosynthesis, average genome size, and SSU rRNA gene copy number between samples. In the Ace Lake samples, we were able to demonstrate the effects of depth and oxygen levels on the abundances and types of autotrophs present.

MATERIALS AND METHODS

Sequence data.

The metagenomic data used are from two different projects hosted on CAMERA (http://camera.calit2.net) (see Table S1 in the supplemental material). Three metagenomes from the Global Ocean Sampling Expedition were used for the evaluation of temperate waters from the tropics: GS017 (Yucatan channel), GS026 (Galapagos open ocean waters), and GS027 (Galapagos coastal waters). Six metagenomes collected from Ace Lake sites 227 through 232 (filter pore size of 0.8 μm) in Antarctica were used for characterization of cold-water communities.

The database of bacterial and archaeal SSU rRNA gene reference sequences was compiled as previously described (15). It is comprised of 111 full-length SSU rRNA genes collected from sequenced genomes and other cultivated organisms. Sequences were selected to cover the known phylogenetic diversity and to have no more than 85% identity to any of the other selected sequences. The sequence collection used for providing phylogenetic context to environmental SSU rRNA sequences consists of 6,040 sequences from cultivated, named organisms gathered from Ribosomal Database Project II (10, 24). When available, the sequences are from the type strain. The RNA polymerase α subunit (RpoA), RNA polymerase β subunit (RpoB), ribosomal protein L1 (RplA), ribosomal protein L3 (RplC), ribosomal protein L4 (RplD), ribosomal protein S7 (RpsG), ribosomal protein S10 (RpsJ), and ribosomal protein S17 (RpsQ) reference sequence databases consist of 163, 122, 170, 153, 217, 186, 172, and 131 sequences, respectively, collected from a diverse set of sequenced archaeal and bacterial genomes available through The SEED (http://theseed.uchicago.edu/FIG/index.cgi) (29). The phosphoribulokinase (Prk) and ribulose 1,5-bisphosphate carboxylase large subunit (RbcL) reference sequence databases consist of 12 and 20 diverse protein sequences, respectively, from the genomes of carbon-fixing Bacteria and Archaea available through The SEED. The 2-oxoglutarate oxidoreductase alpha (2oxoalpha) and beta (2oxobeta) subunits, ATP-citrate lyase subunits (CL1 and CL2), and pyruvate-flavodoxin oxidoreductase (NifJ) reference sequence databases consist of 5, 5, 5, 5, and 4 protein sequences, respectively, from the genomes of anaerobic carbon-fixing Bacteria available through The SEED. The metabolism-specific protein databases are relatively small due to the dearth of sequenced and characterized representatives, but the effect on our ability to identify these traits is limited due to the tight phylogenetic groupings of the harboring organisms (see Results).

Programs from other sources.

The BLASTALL and FORMATDB programs (1) have been used for most analyses and are available from NCBI.

Metagenomic sequence analysis.

Each sequencing read was evaluated using a Perl software pipeline to iterate through a metagenomic library, identify reads that match genes of interest, orient them in the direction of gene transcription, and extract them for subsequent analysis (described below). Each of these processes depends on the use of a gene-specific reference sequence database (using either DNA or amino acid sequences). The sequence collections used in these studies are described above. Our software uses FORMATDB (1) to prepare the data for searching with BLASTALL.

To identify the portions of the input sequences that are similar to a gene of interest, our software uses the BLASTN or BLASTX function of the BLASTALL program to search the reference sequence databases. The BLAST program parameters were set differently depending on the type of BLAST and database used. When identifying rRNA gene fragments, a relatively high stringency is used to minimize short random similarities (BLAST parameters: −r 1 −q 1 −F F −e 1e-20). When seeking protein-coding genes, BLASTX is used with a protein database and the relaxed parameters (−F F −e 1e-5), along with additional screens for at least 30% amino acid identity and 50% similarity (see below). The BLAST results are used to extract the matching region (plus an additional 21 nucleotides [nt] on each side) of a sequencing read and orient it in the same direction as the reference sequences.

For each of the protein databases used, we required various levels of percent identity and similarity to ensure specific extraction of sequences matching the desired gene type (to the exclusion of similar but unrelated proteins). The levels of stringency were decided upon empirically. Each metagenomic analysis produced a log file, listing (by percent identity and similarity) the metagenomic reads with a significant match to one of the genes of interest. Starting from the lowest percent identity matches, we evaluated the identified metagenomic reads by using the NCBI database (39) to determine if they were more similar to other genes than the ones of interest. This was continued until the reads matched only the genes of interest, and percent identity and similarity cutoffs were set at the lowest differentiating level for each gene type.

For all eight universal single-copy proteins (mentioned above), only matching sequences of 30% or higher identity and 50% or higher similarity were considered. Both 50% identity and 60% similarity were used for phosphoribulokinase and both subunits of 2-oxoglutarate oxidoreductase, 60% identity and similarity for pyruvate-flavodoxin oxidoreductase, 50% and 75% for both ATP-citrate lyase subunits, and 50% and 70% for the ribulose bisphosphate carboxylase large chain.

Identification of phylogenetic groups.

To identify which phylogenetic group an extracted SSU rRNA gene fragment is associated with, the fragment was used as a BLASTN query against the database of 111 reference SSU rRNA genes (above) and assigned the phylum name (except in the case of the Proteobacteria and Bacteroidetes) of the best matching representative. For more detailed analysis of the cyanobacterial, Chlorobi, and Desulfobacteraceae sequences, the identification assigned was that of the best matching sequence in the compilation of 6,040 SSU rRNA gene sequences from named species (above). The sequences associating with the Proteobacteria were split up into the different classes, and the sequences associating with Bacteroidetes (commonly referred to as Cytophaga/Flavobacteria/Bacteroides [CFB]) were split into Chlorobi and other CFB to specifically identify and characterize the Chlorobi component of each metagenome.

Estimation of average genome size and gene frequency in a community.

If every genome in a metagenomic sample includes exactly one copy of a universal gene U of length g and the average size of the genomes is G, then the probability that a metagenomic sequence (typically a sequence read) of length L will include all or part of gene U is (g + L)/G. In order to detect the portion of the gene in the sequence (in our case, using BLASTX and a database containing diverse representatives of U), it is necessary that some minimum amount of the gene be present, which we will call m. The probability then that a metagenomic sequence will match a sequence in the database is (g + L − 2m)/G. As long as all metagenomic sequences are at least length m, we can take L to be the average length of the metagenomic sequences, and we have f = (g + L − 2m)/G, where f is the fraction of the metagenomic sequences that match a gene of type U. Rearranging, we have G = (g + L − 2m)/f. The value m also takes into account the inverse relationship between the BLAST E value and minimum word (i.e., gene-like sequence) length. As the E value decreases, the minimum word length must increase for a BLAST match to be reported. For the work described here (with an E value of at most 10−5), we take the value of m to be 90 nt (30 amino acids). General characteristics of each metagenome used are found in Table S1 in the supplemental material.

When this formula is applied to a gene that is not universal, then G is not the genome size but the amount of genomic DNA in the community per copy of the gene. If GU is the value for a universal gene, and GS is the value for a metabolism-specific gene (e.g., a photosynthesis gene), then GU/GS is the fraction of the genomes in the community with the metabolism-specific gene.

Artificial metagenome creation.

To evaluate how well the eight chosen universal, single-copy protein coding genes estimate the average genome size of organisms sampled in a metagenome, artificial genomes of known composition were used. Twelve different metagenomes were constructed and separated into two groups based on read length. One group consisted of six metagenomes each containing 100,000 400-nt reads to simulate titanium 454 pyrosequencing, and the other consisted of six metagenomes each containing 100,000 1,026-nt reads to simulate Sanger sequencing. Each artificial metagenome consisted of a random combination of five different marine microbial genomes, each sampled from zero to a maximum of 60 times, totaling 100. The genomes used were Nitrosopumilus maritimus SCM1 (gi|161527512), Chloroflexus sp. Y-400-fl (gi|222523302), Nitrosomonas eutropha C91 (gi|114330036), Synechocystis sp. PCC 6803 (gi|16329170), and “Candidatus Pelagibacter ubique” HTCC1062 (gi|71082709).

After assigning abundances to each genome in the artificial metagenome, the genomes were made to appear circular by extending each end with a read-length number of nucleotides from the opposite end and randomly assembling them into a contiguous sequence (separated by a string of X's the size of the read length). The assembled sequence was then randomly sampled 100,000 times, extracting genomic regions of the desired length. If a random sampling contained any X's, the sample was shifted accordingly to make sure it consisted only of genomic data.

A log file was produced for each of the artificial genomes, providing a readout of which genomes were included, what their abundances were, and the average genome size. This information is included in Table S2 in the supplemental material.

RESULTS

Characterizing microbial community structure using SSU rRNA genes.

In most studies, the questions of interest are defined in terms of comparisons of communities, over space, time, or both. To illustrate this we focus on three GS metagenomes (GS017, GS026, and GS027), giving us the ability to study samples from the same depth but different locations, and six Ace Lake metagenomes (site 227 through site 232), giving us the ability to study samples from the same general location but at different depths. Figures S1 and S2 in the supplemental material show the phylogenetic distribution of the SSU rRNA genes observed with each of these samples. One of the abundant groups, in terms of raw SSU rRNA gene-like sequence numbers, in five of the nine metagenomes is that of the Cyanobacteria, a group of aerobic photoautotrophic carbon fixers who are important players in the global carbon cycle (41). In all six of the Ace Lake samples, one of the more abundant phylogenetic groups is that of CFB, which includes the group Chlorobi, made up of anaerobic, photoautotrophic carbon fixers (17, 25).

In GS017, 53 of the 368 SSU rRNA gene fragments are associated with the Cyanobacteria, or just over 14%, while they make up 16% (21 out of 130) in GS026 and 9% (34 out of 374) in GS027. Upon closer inspection, 96 out of the total 108 cyanobacterial sequences share between 96 and 97% identity with a Prochlorococcus marinus strain over an average of 720 nucleotides, or nearly half the gene, indicating that they are probably from the genus Prochlorococcus. The other sequences display lower levels of identity with their closest named relatives and might be from species of Cyanobacteria that do not have cultured representatives. While many sequences from each GS metagenome were found to match a representative from CFB, none of them were within the family Chlorobi.

Of the site 227 sequences, taken from 23 m below the surface, 2% (7 of the 426 SSU rRNA gene fragments) are associated with the Chlorobi and 1% (5 fragments) is associated with the Cyanobacteria. The data from other sites were as follows: from site 228, 18 m, 4% (16 of 447 SSU rRNA gene fragments) are Chlorobi and 4% (18 fragments) are Cyanobacteria; from site 229, 14 m, 15% (48 of 317) and 1% (3 fragments), respectively; from site 230, 12.7 m, 68% (318 of 468) and 0% (0 fragments), respectively; from site 231, 11.5 m, <1% (3 of 767) and 60% (461 fragments), respectively; and from site 232, 5 m, 2% (11 out of 518) and 33% (171 fragments), respectively. Upon closer inspection, 200 of the 403 Chlorobi sequences share between 97 and 98% identity with a Chlorobium phaeovibrioides strain over an average of 408 nucleotides, while 43 of the remaining 203 share between 96 and 98% identity with a Chlorobium luteum strain over an average of 410 nucleotides. For the Cyanobacteria, 381 of the 660 cyanobacterial sequences share between 90 and 95% sequence identity with a Cyanothece strain over an average of 375 nucleotides, while 270 of the remaining 279 share between 93 and 96% identity with a Prochlorococcus marinus strain over an average of 367 nucleotides. The remaining 160 Chlorobi and nine cyanobacterial sequences display lower levels of identity with their closest named relatives.

Average genome size as a foundation for quantitative analyses.

While SSU rRNA genes are good references for identifying the types of organisms present in an environment, there are caveats associated with using them to determine their relative abundances (see Discussion). As an alternative, one can use single-copy, universal genes (most of which are protein coding) (7).

To illustrate the analysis of protein coding genes in metagenomic data, we examine eight examples: six from the translation machinery—rplA, rplC, and rplD, encoding the large ribosomal subunit proteins L1, L3, and L4, respectively, and rpsG, rpsJ, and rpsQ, encoding the small subunit ribosomal proteins S7, S10, and S17, respectively—and two from the transcription machinery, rpoA and rpoB, encoding the RNA polymerase alpha and beta subunits, respectively.

To evaluate how well the abundances of each of the eight gene sequences estimate the average genome size of the organisms sampled in a metagenome, they were used to characterize 12 different artificial metagenomes (see Materials and Methods). Table S3 in the supplemental material summarizes an analysis of each of these metagenomes, in terms of the number of reads containing sequences similar to those of each of the target genes (described in the previous paragraph), and the corresponding estimates of average genome size (see Materials and Methods). Averaging the estimates based on the eight genes resulted in values that are, except for two, all less than 10% (averaging just over 5%) different than the actual average genome sizes (see Table S3).

To further substantiate our method, we compared it to the methods published by Angly et al. for average genome size (2) and Raes et al. for effective genome size (30), and our results for the six Sargasso Sea metagenomes are very similar (see Table S4 in the supplemental material). Table 1, based on the data in Table S5 in the supplemental material, summarizes the average genome size estimates for all nine metagenomes. The average genome size of the Ace Lake samples (3.1 Mbp) is higher than that of the GS samples (2.1 Mbp), while the number of genomes sampled ranges from 53 to 128.

TABLE 1.

Estimation of average genome size for nine marine metagenomes using the number of metagenomic reads containing sequences similar to genes encoding six translation-associated proteins (rplA, rplC, rplD, rpsG, rpsJ, and rpsQ) and two transcription-associated proteins (rpoA and rpoB)

Sample Avg genome size calculations (kb)a
Total avg (kb)
rplA rplC rplD rpoA rpoB rpsG rpsJ rpsQ
GS017 2,409 2,340 2,755 1,681 1,768 2,400 1,872 2,391 2,202
GS026 1,831 2,411 3,370 1,714 1,807 1,992 1,973 1,383 2,060
GS027 2,147 2,011 2,457 1,611 1,550 2,091 1,874 2,250 1,999
Site 227 4,000 3,926 4,592 1,931 3,094 3,389 3,209 3,007 3,394
Site 228 3,472 2,970 3,548 1,653 2,571 2,786 2,711 3,039 2,844
Site 229 3,935 3,218 3,845 1,533 3,245 3,194 3,081 2,868 3,115
Site 230 2,705 2,096 1,935 1,311 2,254 2,202 1,857 2,441 2,100
Site 231 4,566 3,182 2,719 2,950 3,014 2,566 2,896 3,285 3,147
Site 232 4,131 4,530 5,051 3,280 2,567 3,325 3,401 5,053 3,917
a

Values based on data shown in Table S5 in the supplemental material.

Assessing the biological significance of metagenomic sequence data through quantitative analyses.

Although the average genome size tends to increase with the metabolic flexibility of organisms in a community, it is rarely of interest in and of itself. On the other hand, it can serve as the basis for quantitatively characterizing the genomes sampled in a metagenome. In this section we extend this approach to evaluate the proportion of the organisms in different marine microbial communities capable of different types of carbon fixation, represented by the Cyanobacteria (aerobic, photosynthetic, carbon fixers), Chlorobi (anaerobic, photosynthetic carbon fixers), and Desulfobacteraceae (anaerobic, nonphotosynthetic carbon fixers) identified by our SSU rRNA gene analyses.

To calculate the abundances of Cyanobacteria in each metagenome, we evaluated the frequencies of two genes unique to the Calvin-Benson cycle: those encoding phosphoribulokinase (prk) (14) and the large subunit of ribulose bisphosphate carboxylase (rbcL) (6, 9, 20). The Chlorobi and the Desulfobacteraceae, on the other hand, utilize the reductive tricarboxylic acid (rTCA) cycle to fix carbon instead (12, 17, 25, 31). To determine the proportion of genomes carrying out carbon fixation by rTCA, we evaluated the abundances of five unique, single-copy genes: the alpha and beta subunits of 2-oxoglutarate oxidoreductase (2oxoalpha and 2oxobeta), pyruvate-flavodoxin oxidoreductase (NifJ), and the two subunits of ATP-citrate lyase (CL1 and CL2) (12, 36). Interestingly, CL1 and CL2 fall into two distinct groups, that of the Chlorobi/plant/fungus type and that of the Desulfobacter/Hydrogenobacter/Archaea type (22, 40). Because of this, the first three genes were used to determine the total proportion of genomes capable of rTCA, and Chlorobi/plant/fungus-type CL1 and CL2 (along with relatively high stringency) were used as a means to determine what fraction of the rTCA cycle-capable genomes are Chlorobi. The difference between the total number of rTCA-capable genomes and those harboring Chlorobi/plant/fungus-type CL1 and CL2 was used to determine the proportion of genomes that are desulfobacterial.

Analysis of GS samples.

The numbers of metagenomic reads with significant similarity to prk or rbcL are documented in Table S6 in the supplemental material. Taking the average between the results for their frequencies resulted in cyanobacterial genome proportion estimates of 15%, 20%, and 6% in GS017, GS026, and GS027, respectively (Table 2). These results are similar to the proportions of rRNA gene fragments (above). For the evaluations of rTCA-capable organisms, no sequencing reads with similarity to those genes were identified (see Table S6).

TABLE 2.

Estimation of the fraction of genomes in nine metagenomes containing genes specific to Cyanobacteria, Chlorobi, and Desulfobacteraceae, based on the frequencies of protein-coding genes unique to aerobic or anaerobic carbon fixation and SSU rRNA gene copy number per average genomea

Sample Proportion (fraction) of genomes carrying traitb
SSU rRNA gene copy no.
Calvin-Benson cycle Total rTCA Chlorobi-type rTCA
GS017 0.15 1.29
GS026 0.20 1.10
GS027 0.06 1.39
Site 227 0.04 0.12 0.04 1.38
Site 228 0.03 0.17 0.06 1.20
Site 229 0.02 0.26 0.14 1.14
Site 230 0.004 0.69 0.65 0.93
Site 231 0.46 0.07 2.40
Site 232 0.21 0.06 0.04 2.17
a

Values based on data from Table S6 in the supplemental material. rTCA stands for reductive tricarboxylic acid cycle.

b

—, not detected.

Analysis of Ace Lake samples.

The chemical and physical data gathered for each of the samples from Ace Lake display trends consistent with meromictic lakes, which commonly consist of three layers: the upper mixolimnion layer, a pyncocline (providing a physical barrier to mixing), and a lower monimolimnion layer (18). Figure 1 A shows the characteristics of the three layers. Between the surface and 11.5-m depth, oxygen concentrations are high, temperature is low, and salinity is high. Between 11.5 and 14 m, there is a steep gradient in the decrease of oxygen and the increase of both salinity and temperature. Between 14 and 23 m, the amount of change levels off for each trait.

FIG. 1.

FIG. 1.

Characterization of the physical, chemical, and biological characteristics of the Ace Lake samples according to depth. (A) The oxygen concentrations, salinity, and temperature for each depth. (B) A comparison of the relative abundances of relevant organism types, in terms of both the proportion of SSU rRNA gene fragments and genomes that they make up at each depth. For the genome calculations, the organisms falling into the “other Deltaproteobacteria” group are placed into the “Other” group. Data displayed in panel B for proportion of genomes are taken from Tables S3 and S6 in the supplemental material.

Figure 1B displays both the percentages of SSU rRNA gene fragments made up of Cyanobacteria, Chlorobi, and Deltaproteobacteria according to depth and the percentages of genomes capable of aerobic photosynthetic carbon fixation, corresponding to the Cyanobacteria, and either anaerobic photosynthetic or nonphotosynthetic carbon fixation, corresponding to the Chlorobi and Desulfobacteraceae, respectively. For both relative SSU rRNA gene fragment abundances and proportions of genomes capable of the Calvin-Benson cycle, the Cyanobacteria dominate at 5 (33% of SSU rRNA gene fragments and 21% of genomes) and 11.5 m (60% of SSU rRNA gene fragments and 46% of genomes), peaking at 11.5 m. The Chlorobi dominate at 12.7 m and start decreasing at 14 m in both SSU rRNA gene fragment relative abundances (68% and 15% of SSU rRNA gene fragments, respectively) and proportion of genomes carrying Chlorobi/plant/fungus-type ATP-citrate lyase subunits (65% and 14% of genomes, respectively). The proportion of genomes capable of the rTCA cycle peaks in line with the peak of the Chlorobi but maintains abundance at lower depths, corresponding with the presence of the Desulfobacteraceae (69, 26, 17, and 12% of genomes at depths 12.7, 14, 18, and 23 m, respectively). The proportions of total SSU rRNA gene fragments from each sample that fall into the Desulfobacteraceae group for the depths between 12.7 and 23 m are 3, 12, 5, and 2%, respectively. Unlike with the GS samples, there is little correlation between SSU rRNA gene fragment and genome proportions (see Discussion).

DISCUSSION

As the number of sequencing projects increases, the amount of sequence data grows exponentially, putting great pressure on the ability to extract meaning from it. Many questions can be efficiently answered by a directed approach, in which one particular gene type is analyzed at a time. In the case of amplified genes, this is clearly appropriate. In the case of extracting a particular type of gene from a genome or metagenome, this is also more efficient than analyzing everything and then filtering the results for what was being sought. An additional benefit of this approach is a provision of quality control that is more difficult to achieve with an inspection of the end product of an annotation pipeline (a situation that mixes the best annotations with the worst).

The evaluation of specific genes in metagenomes also provides a method for comparison within or between samples. Though without a means to normalize the sequence data, such as through the number of genomes sampled, one is limited to a superficial comparison of enumerating sequencing reads that correspond to genes of interest (5). The estimation of average genome size, which is majorly influenced by the fraction of reads containing a gene-like sequence, provides not only a way to estimate the number of genomes sampled in a metagenome but also a means for enumerating particular organism types and a way to compare between samples, regardless of differences in sequencing read numbers or length (5). By comparing the average genome size to the amount of DNA per instance of a metabolism-specific gene (or other gene unique to the organisms of interest), one can determine the proportion of genomes, and hence organisms, from an environment that are capable of that metabolism.

We have illustrated some of the quantitative analyses that can be performed on metagenomic data based on average genome size calculations. In choosing the examples, we have included alternative approaches to the same, or similar, questions. Thus, the organisms present in a metagenome have been surveyed by the rRNA genes (for which there is a vast database for comparison) and by universal protein-coding genes (which are present in only one copy per genome and hence potentially provide better enumerations). Similarly, the aerobic and anaerobic autotrophic carbon fixing components of the nine characterized communities have been surveyed by the phylogeny of the rRNA genes and by the frequencies of specific metabolic genes.

Some features that were encountered in these analyses, but were not elaborated upon, are the effects of gene length, sequence read length, and the length of sequence required for identifying genes of a given type (Materials and Methods). A larger gene covers a larger fraction of a genome and hence is easier to find and provides better statistics; longer sequencing reads provide a larger sample of the genome and hence increase the chance of finding any given target gene. The amount of data required to recognize a gene of interest is frequently ignored but has a very direct impact on the analyses. There are three particularly favorable factors for minimizing this requirement: (i) looking for a more conserved gene, (ii) incorporating many diverse examples of the gene into the reference sequence database, and (iii) looking for a gene with few if any paralogs. The first two are fairly obvious. They can also be addressed by using profile-based searches rather than BLASTX. The last factor is less obvious but is very important. Resolving members of a gene family for the particular gene of interest can be difficult and hence requires more data. Profile-based searches might be better at finding diverse genes, but they also increase the chances of finding paralogs. One of the reasons to choose a modular approach to our methods (by screening the output against percent identity and similarity thresholds) is that it facilitates the extraction of candidates and then the exploration of the impact of the secondary screens of the gene sequences extracted, without repeating the search of the primary data (which can be very large).

The issue of paralogs is illustrated by our search for prk genes. Our parameters for probing for specific protein coding genes were very stringent: at least 50% sequence identity to one of the reference sequences and at least 60% similarity. The reason for this stringency is that even with a BLASTX E-value maximum of 10−100, we were obtaining gene fragments of uridine kinase in our probing for prk because the proteins have extensive regions of similarity, consisting mainly of an ATP-binding domain. Even though the match qualities between the two proteins were low, usually around 26% identity and 35% similarity, the similarity was extensive enough that the alignment scores consistently had E values as low as 10−142. Thus, it is clear that the common practice of relying on a low E value to identify a particular gene in environmental samples is not sufficient. Although not used here, one easily implemented alternative to requiring such a high percentage of identity is to explicitly require a better match to one group of reference sequences (e.g., phosphoribulokinase sequences) than to another (e.g., uridine kinase sequences).

SSU rRNA gene analysis.

The three GS samples analyzed in detail were chosen specifically because they allow us to compare between samples of the same depth but from different locations. GS017 was taken from the open waters of the Yucatan strait between the Yucatan peninsula of Mexico and the west coast of Cuba, GS026 was collected from the open ocean waters between the Galapagos Islands and Ecuador, and GS027 was taken from coastal waters of Floreana Island (part of the Galapagos). It is thought that terrestrial nutrient runoff has an effect on microbial community structure, in that coastal communities have a higher ratio of heterotrophs to autotrophs than open ocean communities (26, 27), and these three locations give the opportunity to, at some level, evaluate proximity-to-land effects on photoautotrophic organism abundances.

We see evidence of proximity-to-land effects, in terms of both relative abundances of SSU rRNA gene-like fragments and calculations of average SSU rRNA gene copy number per average genome. In GS017 and GS027, there are a higher number of SSU rRNA gene-like fragments that associate with the Gammaproteobacteria (making up about 17% of the SSU rRNA gene fragments in both GS017 and GS027 compared to only 5% in GS026), indicative of the presence of heterotrophic (and possibly intestinal) microbes. The presence of these groups also corresponds with a slight increase in the average SSU rRNA gene copy number, relative to the open ocean sample of GS026, of 0.2 and 0.3 for GS017 and GS026, respectively. Though not greatly significant, this increase could indicate a higher proportion of metabolically flexible organisms, based on the idea that they have a higher copy number of SSU rRNA genes per genome (23, 34).

The Ace Lake samples give the opportunity for comparisons between samples from the same general location but at different depths. This, combined with the recorded values for oxygen concentrations, can allow us to evaluate the influence of oxygen levels on the types of photoautotrophs (Cyanobacteria and Chlorobi) at the different depths. Frigaard and Masuura (17) showed that the ability of the Chlorobi to carry out photosynthesis is retarded in aerobic conditions, but more recent studies testing predicted superoxide dismutase-like proteins showed that the Chlorobi have some tolerance toward the presence of oxygen and reactive oxygen species (25). By evaluating the different samples, we can get an idea for the range of oxygen concentrations in nature that is acceptable for their growth.

The phylogenetic distributions in the different samples (see Fig. S2 in the supplemental material) show a distinct fractionation of dominant functional roles (or the organisms that carry them out) according to depth. In the samples from the highest depths, the Cyanobacteria dominate based on SSU rRNA gene fragment abundances. Further down, their numbers decrease, while the Chlorobi dominate, and in the deepest samples, the dominance changes in favor of the Deltaproteobacteria (some of which belong to the Desulfobacteraceae family) and heterotrophs, mainly Clostridia.

It is clear from Fig. S2 in the supplemental material and Fig. 1B that the Cyanobacteria in these samples do not tolerate oxygen levels below 12 μg/liter, while the Chlorobi can tolerate oxygen levels of up to 6 μg/liter, which may be too high for the Desulfobacteraceae. These comparisons are also made through evaluating unique metabolic pathways (see below).

Interestingly, the types of Cyanobacteria found in the GS samples appear to be mainly from the genus Prochlorococcus, while less than half of the cyanobacterial sequences from Ace Lake share similarity to that group. Also of interest is that the ratio of Cyanothece-like sequences to Prochlorococcus-like changes from 1.65 to 1.25 between the depths of 5 and 11.5 m (see below) in the Ace Lake samples.

Average genome size calculation.

As mentioned above, SSU rRNA genes are good references for identifying the types of organisms present but, due to differences in copy numbers per genome (23), are not the best method of enumerating those organisms. Thus, the data in Fig. S1 and S2 in the supplemental material are not a direct measure of the organismal abundances per se. Although it is possible to partially correct for this in assessing each community's composition (in essence by inversely weighting each observed gene by an estimate of the rRNA gene copy number in its phylogenetic group), it would be very tedious and is limited to those phylogenetic groups for which data are available.

As an alternative, the estimation of the average genome size for the organisms sampled in a metagenome can serve as the basis for quantitative analyses and the direct comparison between metagenomes. It can be determined by calculating the amount of DNA in a metagenome per instance of a target universal, single-copy gene. Since there is one copy of each gene per genome, we can use their frequency in the sequence data to directly measure the abundance of the corresponding genomes in the community. This can be used to characterize the relative abundances of phylogenetic groups or to evaluate more general parameters (e.g., the average genome size in the community). In the case of phylogenetic analysis, the greatest limitation is that available sequences from named species are much more limited than those in the data available for SSU rRNA genes. Even then, it is possible to assess the abundance of groups, but they remain “anonymous.”

This approach to evaluating the data can also remove variables that might differ between metagenomes, such as the average sequencing read length and the number of sequencing reads, which greatly influence the number of reads containing a sequence similar to that of a specific gene (5). Table S5 in the supplemental material demonstrates these effects through comparisons of average genome size calculations and the number of sequencing reads matching the gene used.

Metabolic analyses.

Calculating the proportion of genomes capable of a particular metabolism requires a slight modification to the calculation of average genome size. As with calculating the average genome size, one calculates the amount of DNA per instance of a metabolism-specific gene, but the proportion is determined by dividing the amount of DNA per genome by the amount of DNA per metabolism-specific gene. Table 2, based on the data in Table S6 in the supplemental material, shows what proportion of genomes in the nine marine environments are capable of aerobic, photosynthetic carbon fixation, anaerobic, photoautotrophic carbon fixation, and anaerobic, lithoautotrophic carbon fixation. These observations can be made with other gene types as well, and the case with SSU rRNA genes is also demonstrated in Table 2.

It is important to note that the genes chosen to quantify each type of autotroph are not unique to their respective identified bacterial lineages. The genes used to quantify the Cyanobacteria, encoding phosphoribulokinase (prk) and the large subunit of RuBisCO (rbcL), are not unique to the Cyanobacteria (13, 28), though none of the other organisms known to carry them are expected in these euphotic-zone, marine samples. Similarly, the rTCA cycle is found not only in the Chlorobi and Desulfobacteraceae (a group within the Deltaproteobacteria) but also in Hydrogenobacter (a genus within the Aquificales), and some rTCA cycle enzymes have been found in particular archaeal lineages (3, 12, 32). As with the Calvin-Benson cycle genes, the rTCA-capable archaeal and Aquificales lineages are not found at these locations (supported by their physiology and the SSU rRNA gene sequence data).

In the GS samples, the Cyanobacteria make up, at most, around 20% of the genomes sampled (Table 2). It is interesting that the Cyanobacteria are not more abundant. This may be indicative of the specific environments and their nutrient content, due to the relatively high abundance of Alphaproteobacteria (primarily “Ca. Pelagibacter,” which metabolizes dissolved organic carbon [19]). As mentioned above, it has been suggested that terrestrial nutrient runoff has an effect on microbial community structure, and the three locations chosen give the opportunity to evaluate proximity-to-land effects on autotrophic organism abundances.

Sample GS026 has the highest proportion of photosynthetic carbon fixers, at 20% of genomes, and is about 216 km from the Galapagos Islands and nearly 580 km from Ecuador, which indicates the lowest possible influence of terrestrial animals for all three samples. Sample GS017, from the Yucatan Channel, which is about 193 km from Mexico and Cuba (two relatively heavily populated areas), has the second highest proportion of carbon fixers, at 15% of the genomes. The presence of Gammaproteobacteria might indicate anthropogenic pollution or sewage. Sample GS027, from about 64 km northeast of the Galapagos Islands, has the lowest proportion of photosynthetic carbon fixers, at 6%, about a 3-fold drop. The number of heterotrophs here is also the highest and might be impacted the most by terrestrial runoff.

In the Ace Lake samples, the proportion of genomes made up of certain types of autotrophs is not dependent on proximity to land but on depth and physical boundaries between water layers. The Cyanobacteria make up from 21 to 46% of the genomes between 5 and 11.5 m in depth, and this might indicate a lack of nutrient content in the mixolimnion available to foster heterotrophic growth. However, with the drop in oxygen concentration (from 12.09 to 5.59 μg/liter) in the pyncocline, the Chlorobi, anaerobic photosynthetic carbon fixers, become the most numerically dominant, making up 65% of the genomes at a depth of 12.7 m. The abundance of Chlorobi from 18 to 12.7 m, corroborating our SSU rRNA gene data, indicates an oxygen tolerance of up to approximately 6 μg/liter.

Since the proportion of genomes capable of the rTCA cycle includes not only the Chlorobi but also the Desulfobacteraceae, we used databases of the Chlorobi/plant/fungus-type ATP-citrate lyase genes as a way to differentiate the two populations. For the samples from 12.7 m down and lower, the proportion of genomes capable of rTCA are 69, 26, 17, and 12% (Table 2), but the proportion of genomes capable of Chlorobi-type rTCA at those depths are 66, 15, 6 and 4%, indicating that the remaining 4, 12, 11, and 8% are made up of Desulfobacteraceae-type rTCA. Though this may not be a legitimate comparison (see below), it is quite interesting not only that the proportion of genomes capable of Chlorobi-type rTCA matches the proportions of Chlorobi SSU rRNA gene-like fragments in the different samples but that the latter percentages are very similar to the proportions of SSU rRNA gene-like fragments in the different samples that correspond to the Desulfobacteraceae group (3, 12, 5, and 2%, respectively) for the samples at depths 12.7 and 14 m (also visible in Fig. 1B). The relatively high abundances of rTCA-capable genomes also indicate, through the necessity of reducing oxidized sulfur compounds (33), that sulfur is abundant in Ace Lake.

The comparison of these data to those of the SSU rRNA gene sequence abundances can also help explain similarities and differences between the data types. In the case of the Cyanobacteria for the Ace Lake samples, when they are numerically dominant, the proportion of cyanobacterial SSU rRNA gene fragments is always greater than the proportion of genomes harboring the traits associated with aerobic photosynthetic carbon fixation. This could be due to the types of Cyanobacteria found at Ace Lake, which are a mixture of Cyanothece and Prochlorococcus. Sequenced representatives of the Prochlorococcus genus typically have one SSU rRNA gene per genome (11), while Cyanothece has two (38). The ratios of Cyanothece to Prochlorococcus-like SSU rRNA gene sequences in the Ace Lake samples from 5 and 11.5 m are 1.65 (104:63) and 1.28 (258:202), respectively. When using these ratios as a correction for the SSU rRNA gene copy number, the proportion of SSU rRNA gene fragments made up of Cyanobacteria for those samples becomes 20 and 46.88% from 33 and 60%, respectively, which are close to the values obtained for the proportion of genomes carrying prk and rbcL (21% at 5 m and 47% at 11.5). This line of logic also fits with the cyanobacterial data from the GS samples (dominated by Prochlorococcus) in contrast to samples dominated by Chlorobi. Unlike the Cyanobacteria, the Chlorobi have been less studied and there are fewer completely sequenced representatives, making such analyses nearly impossible. Furthermore, these calculations can safely be done only for communities that have one clearly dominant organism type (as in the Ace Lake samples from 5 and 11.5 m) or are made up of organisms with similar genomes, in terms of size and SSU rRNA gene copy number (as with the GS samples).

Implications.

The calculation of community evenness by comparing relative abundances of SSU rRNA gene-like sequences may be an oversimplification of the available data. It is a safe assumption that nearly all communities are made up of organisms containing different SSU rRNA gene copy numbers, and this, along with differences in genome size, is where the complications lie. The probability of sequencing a particular type of SSU rRNA gene is dependent upon not only the abundances of those specific organisms but also the SSU rRNA gene copy number and genome size.

Having a genome of a relatively large size increases the probability of sampling from that genome, and the same logic applies to genomes that contain a higher copy number of SSU rRNA genes. Larger genomes with fewer SSU rRNA gene copies will not only increase the average genome size for a particular metagenome but also decrease the average number of SSU rRNA gene copies, since there is a higher probability of sequencing from a region with no SSU rRNA genes. This is demonstrated above in the comparisons between the GS and Ace Lake samples.

While, on average, the three GS and six Ace Lake samples have similar SSU rRNA gene copy numbers, the average genome size is nearly 1 megabase larger in the Ace Lake samples. In every case of the GS samples, the dominant phylogenetic group, in terms of SSU rRNA gene-like sequence numbers, is the Alphaproteobacteria, more specifically, the organisms from the “Ca. Pelagibacter” genus, with a genome size of around 1.3 Mbp and only one SSU rRNA gene. Because of their small genome size and SSU rRNA gene copy number, the only reason that they make up such a large proportion of the SSU rRNA gene fragments is their abundance. While some of the GS samples have average genome sizes higher than that of dominating “Ca. Pelagibacter,” this is due to the abundance of organisms from the Cytophaga/Flavobacteria/Bacteroides group, which tend to have a genome size around 4 Mbp and between 6 and 11 SSU rRNA gene copies, the Cyanobacteria, which tend to have genomes over 3 Mbp and either 1 or 2 SSU rRNA gene copies, and the Gammaproteobacteria, with genomes around 3.5 Mbp and between 4 and 7 SSU rRNA gene copies. When looking at the three GS samples evaluated in detail here, these last three phylogenetic groups appear to have a high influence on the overall community characterizations. This again might be related to the land proximity of two of the three samples and the higher proportions of heterotrophs with larger genomes. On average, the last three phylogenetic groups have genome sizes and SSU rRNA gene copy numbers around 3-fold greater than that of “Ca. Pelagibacter,” meaning that their actual abundances could be one-third that of “Ca. Pelagibacter” but be reflected in equal abundances due to sampling probability.

Concerning the Ace Lake samples, since the community structures differ, the average genome sizes and SSU rRNA gene copy numbers change also, better depicting the influence of organisms with either (or both) of these traits. The samples at 5 and 11.5 m are both dominated by Cyanobacteria (though at 5 m, it is closer to codominance with alphaproteobacteria), the average genome sizes are around 3.5 Mbp, and the SSU rRNA gene copy number is around 2.2 (because of the influence of Cyanothece). At 12.7 m, the dominant group is by far the Chlorobi, which is reflected in the average genome size drop to 2.1 Mbp and the SSU rRNA gene copy number of around 1. The three lowest-depth samples have average genome sizes around 3 Mbp, corresponding to the increase of anaerobic heterotrophs, though the SSU rRNA gene copy number increases only from 1 to around 1.25. The SSU rRNA gene fragment abundance ratios of Cyanothece to Prochlorococcus also demonstrate the effects of the SSU rRNA gene copy number by overestimating the number of Cyanobacteria relative to the proportion of genomes capable of aerobic, photosynthetic carbon fixation.

Figure 1B shows how the different data types (SSU rRNA genes and proportion of genomes) can affect our characterization of the different samples. It appears that for the samples at 5 and 11.5 m, the abundance of Cyanobacteria is overestimated by SSU rRNA gene fragment abundances relative to the proportion of genomes capable of aerobic, photosynthetic carbon fixation. As mentioned above, this overestimation is due to the ratio of Cyanothece (which has two SSU rRNA genes) to Prochlorococcus (which has one) but can be corrected for by normalization. Similarly, the samples at 18 and 23 m appear to underestimate the number of the Desulfobacteraceae in the SSU rRNA gene characterization compared to the proportion of genomes capable of anaerobic, nonphotosynthetic, autotrophic carbon fixation. This could be due to the large number of heterotrophs, such as the Clostridia, which tend to have larger genome sizes and higher SSU rRNA gene copy numbers (ranging from 6 to 11), which will increase the probability of them being sampled over organisms with smaller genomes, such as the Desulfobacteraceae (5).

Due to the influence of not only organism abundances, but also the genome sizes and SSU rRNA gene copy numbers of the abundant organisms, the ability to evaluate communities solely by SSU rRNA gene fragment abundances can be done only in certain circumstances. These circumstances appear to be when the community members share characteristics close to the average genome size and SSU rRNA gene copy number, also resulting in the averages being closer to the median.

Caveats.

The above discussion of SSU rRNA and metabolic gene datum comparative analyses between samples must be taken as examples of what can be done through the use of available SSU rRNA gene data and average genome size normalizations. While the data from the GS samples provide evidence supporting the proximity-to-land effect and the Ace Lake samples show differences in community structure corresponding to the different layers of a meromictic lake and differences in metadata, the samples were not replicated prior to sequencing; therefore, there is no way to determine with confidence if the differences between samples are within the margin of variability for each particular location or demonstrative of the actual differences between locations.

Supplementary Material

[Supplemental material]

Acknowledgments

We thank Gary Olsen and Abigail Salyers for their helpful discussions. We also thank Jim Cole and Jim Garrity at Michigan State University for invaluable help in preparing the collection of SSU rRNA gene context sequences.

Footnotes

Published ahead of print on 11 February 2011.

Supplemental material for this article may be found at http://aem.asm.org/.

REFERENCES

  • 1.Altschul, S. F., W. Gish. W. Miller. E. W. Myers, and D. J. Lipman. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403-410. [DOI] [PubMed] [Google Scholar]
  • 2.Angly, F. E., et al. 2009. The GAAS metagenomic tool and its estimations of viral and microbial average genome size in four major biomes. PLoS Comput. Biol. 5:e1000593. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Beh, M., G. Strauss, R. Huber, K. O. Stetter, and G. Fuchs. 1993. Enzymes of the reductive citric acid cycle in the autotrophic eubacterium Aquifex pyrophilus and the archaebacterium Thermoproteus neutrophilus. Arch. Microbiol. 160:306-311. [Google Scholar]
  • 4.Béjà, O., E. N. Spudich, J. L. Spudich, M. Leclerc, and E. F. DeLong. 2001. Proteorhodopsin phototrophy in the ocean. Nature 411:786-789. [DOI] [PubMed] [Google Scholar]
  • 5.Beszteri, B., B. Temperton, S. Frickenhaus, and S. J. Giovannoni. 2010. Average genome size: a potential source of bias in comparative metagenomics. ISME J. 4:1075-1077. [DOI] [PubMed] [Google Scholar]
  • 6.Buchanan, B. B., and R. Sirevåg. 1976. Ribulose 1,5-diphosphate carboxylase and Chlorobium thiosulfatophilum. Arch. Microbiol. 109:15-19. [DOI] [PubMed] [Google Scholar]
  • 7.Case, R. J., et al. 2007. Use of 16S rRNA and rpoB genes as molecular markers for microbial ecology studies. Appl. Environ. Microbiol. 73:278-288. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Reference deleted.
  • 9.Codd, G. A., and D. Vakeria. 1987. Enzymes and genes of microbial autotrophy. Microbiol. Sci. 4:154-159. [PubMed] [Google Scholar]
  • 10.Cole, J. R., et al. 2003. The Ribosomal Database Project (RDP-II): previewing a new autoaligner that allows regular updates and the new prokaryotic taxonomy. Nucleic Acids Res. 31:442-443. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Dufresne, A., et al. 2003. Genome sequence of the cyanobacterium Prochlorococcus marinus SS120, a nearly minimal oxyphototrophic genome. Proc. Natl. Acad. Sci. U. S. A. 100:10020-10025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Eisen, J. A., et al. 2002. The complete genome sequence of Chlorobium tepidum TLS, a photosynthetic, anaerobic, green-sulfur bacterium. Proc. Natl. Acad. Sci. U. S. A. 99:9509-9514. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Finn, M. W., and F. R. Tabita. 2004. Modified pathway to synthesize ribulose 1,5-bisphosphate in methanogenic Archaea. J. Bacteriol. 186:6360-6366. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Flügge, U. I., M. Stitt, M. Freisl, and H. W. Heldt. 1982. On the participation of phosphoribulokinase in the light reaction of CO2 fixation. Plant Physiol. 69:263-267. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Frank, J. A., et al. 2008. Critical evaluation of two primers commonly used for amplification of bacterial 16S rRNA genes. Appl. Environ. Microbiol. 74:2461-2470. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Frigaard, N. U., A. Martinez, T. J. Mincer, and E. F. DeLong. 2006. Proteorhodopsin lateral gene transfer between marine planktonic Bacteria and Archaea. Nature 439:847-850. [DOI] [PubMed] [Google Scholar]
  • 17.Frigaard, N. U., and K. Masuura. 1999. Oxygen uncouples light absorption by the chlorosome antenna and photosynthetic electron transfer in the green sulfur bacterium Chlorobium tepidum. Biochim. Biophys. Acta 1412:108-117. [DOI] [PubMed] [Google Scholar]
  • 18.Gibson, J. A. E. 1999. The meromictic lakes and stratified marine basins of the Vestfold Hills, East Antarctica. Antarctic Sci. 11:175-192. [Google Scholar]
  • 19.Giovannoni, S. J., et al. 2005. Proteorhodopsin in the ubiquitous marine bacterium SAR11. Nature 438:82-85. [DOI] [PubMed] [Google Scholar]
  • 20.Gontero, B., M. L. Cárdenas, and J. Ricard. 1988. A functional five-enzyme complex of chloroplasts involved in the Calvin cycle. Eur. J. Biochem. 173:437-443. [DOI] [PubMed] [Google Scholar]
  • 21.Handelsman, J., M. R. Rondon, S. F. Brady, J. Clardy, and R. M. Goodman. 1998. Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. Chem. Biol. 5:R245-R249. [DOI] [PubMed] [Google Scholar]
  • 22.Kanao, T., T. Fukui, H. Atomi, and T. Imanaka. 2001. ATP-citrate lyase from the green sulfur bacterium Chlorobium limicola is a heteromeric enzyme composed of two distinct gene products. Eur. J. Biochem. 268:1670-1678. [PubMed] [Google Scholar]
  • 23.Klappenbach, J. A., J. M. Dunbar, and T. M. Schmidt. 2000. rRNA operon copy number reflects ecological strategies of bacteria. Appl. Environ. Microbiol. 66:1328-1333. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Larsen, N., et al. 1993. The Ribosomal Database Project. Nucleic Acids Res. 21:3021-3023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Li, H., S. Jubelirer, A. M. Garcia Costas, N. U. Frigaard, and D. A. Bryant. 2009. Multiple antioxidant proteins protect Chlorobaculum tepidum against oxygen and reactive oxygen species. Arch. Microbiol. 191:853-867. [DOI] [PubMed] [Google Scholar]
  • 26.Moran, M. A., and W. L. Miller. 2007. Resourceful heterotrophs make the most of light in the coastal ocean. Nat. Rev. Microbiol. 5:792-800. [DOI] [PubMed] [Google Scholar]
  • 27.Mou, X., S. Sun, R. A. Edwards, R. E. Hodson, and M. A. Moran. 2008. Bacterial carbon processing by generalist species in the coastal ocean. Nature 451:708-711. [DOI] [PubMed] [Google Scholar]
  • 28.Mueller-Cajar, O., and M. R. Badger. 2007. New roads lead to Rubisco in Archaebacteria. BioEssays 29:722-724. [DOI] [PubMed] [Google Scholar]
  • 29.Overbeek, R., et al. 2005. The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res. 33:5691-5702. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Raes, J., J. O. Korbel, M. J. Lercher, C. von Mering, and P. Bork. 2007. Prediction of effective genome size in metagenomic samples. Genome Biol. 8:R10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Schauder, R., F. Widdel, and G. Fuchs. 1987. Carbon assimilation pathways in sulfate-reducing bacteria II. Enzymes of a reductive citric acid cycle in the autotrophic Desulfobacter hydrogenophilus. Arch. Microbiol. 148:218-225. [Google Scholar]
  • 32.Shieh, J. S., and W. B. Whitman. 1987. Pathway of acetate assimilation in autotrophic and heterotrophic methanococci. J. Bacteriol. 169:5327-5329. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Sirevåg, R., and J. G. Ormerod. 1970. Carbon dioxide fixation in green sulphur bacteria. Biochem. J. 120:399-408. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Stevenson, B. S., and T. M. Schmidt. 2004. Life history implications of rRNA gene copy number in Escherichia coli. Appl. Environ. Microbiol. 70:6670-6677. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Venter, J. C., et al. 2004. Environmental genome shotgun sequencing of the Sargasso Sea. Science 304:66-74. [DOI] [PubMed] [Google Scholar]
  • 36.Wahlund, T. M., and F. R. Tabita. 1997. The reductive tricarboxylic acid cycle of carbon dioxide assimilation: initial studies and purification of ATP-citrate lyase from the green sulfur bacterium Chlorobium tepidum. J. Bacteriol. 179:4859-4867. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Wegley, L., R. Edwards, B. Brito-Rodriguez, and F. Rohwer. 2007. Metagenomic analysis of the microbial community associated with the coral Porites asteoides. Environ. Microbiol. 9:2707-2719. [DOI] [PubMed] [Google Scholar]
  • 38.Welsh, E. A., et al. 2008. The genome of Cyanothece 51142, a unicellular diazotrophic cyanobacterium important in the marine nitrogen cycle. Proc. Natl. Acad. Sci. U. S. A. 105:15094-15099. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Wheeler, D. L., et al. 2007. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 35(Database issue):D5-D12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Yoon, K. S., R. Hille, C. Hemann, and F. R. Tabita. 1999. Rubredoxin from the green sulfur bacterium Chlorobium tepidum functions as an electron acceptor for pyruvate ferredoxin oxidoreductase. J. Biol. Chem. 274:29772-29778. [DOI] [PubMed] [Google Scholar]
  • 41.Zwirglmaier, K., et al. 2008. Global phylogeography of marine Synechococcus and Prochlorococcus reveals distinct partitioning of lineages among oceanic biomes. Environ. Microbiol. 10:147-161. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

[Supplemental material]

Articles from Applied and Environmental Microbiology are provided here courtesy of American Society for Microbiology (ASM)

RESOURCES