Quantitative assessment of protein function prediction from metagenomics shotgun sequences -- Harrington et al. 104 (35): 13913 Data Supplement - HTML Page - index.htslp -- Proceedings of the National Academy of Sciences

Harrington et al. 10.1073/pnas.0702636104.

Supporting Information

Files in this Data Supplement:

SI Table 1
SI Table 2
SI Table 3
SI Table 4
SI Table 5
SI Figure 5
SI Figure 6
SI Figure 7
SI Table 6
SI Table 7
SI Text
SI Table 8
SI Figure 8
SI Figure 9
SI Figure 10
SI Figure 11
SI Figure 12
SI Figure 13
SI Figure 14

SI Figure 5

Fig. 5. Metagenomic ORFs with different functional characterizations have different length distributions. ORFs that cannot be characterized by similarity methods are significantly shorter than those that can.

SI Figure 6

Fig. 6. A comparison of the homology and neighborhood methods applied to the metagenomic datasets across three different bit score cutoffs. For more a detailed look at the effect of the bit score cutoff on homology-based methods see SI Fig.13 and for neighborhood methods see SI Figs. 8-11.

SI Figure 7

Fig. 7. Results of the homology and neighborhood methods applied to four representative prokaryotic species

SI Figure 8

Fig. 8. Neighborhood method applied to Surface Sea Water data at three different bit score cutoffs. Each column shows the method applied at a different bit score cutoff, affecting the detection of conserved neighborhoods and the stringency of the KEGG mapping used for the benchmark data set. Row A shows a 2-dimensional histogram of the all of the codirectionally transcribed neighborhoods in the data set, binned on the x axis by intergenic distance and on the y axis by evolutionary distance (see SI Text for full description). Row B shows the benchmark data, at each intergenic and evolutionary distance p (the proportion of neighborhoods where both genes are functionally related) is shown. Row C shows the interpolation of the data in row B. Row D shows the proportion of neighborhoods with p greater than the cutoff on the x axis using the predictions from the interpolation in row C.

SI Figure 9

Fig. 9. Neighborhood method applied to Minnesota Soil data at three different bit score cutoffs. Each column shows the method applied at a different bit score cutoff, affecting the detection of conserved neighborhoods and the stringency of the KEGG mapping used for the benchmark data set. Row A shows a 2-dimensional histogram of the all of the codirectionally transcribed neighborhoods in the data set, binned on the x axis by intergenic distance and on the y axis by evolutionary distance (see SI Text for full description). Row B shows the benchmark data, at each intergenic and evolutionary distance p (the proportion of neighborhoods where both genes are functionally related) is shown. Row C shows the interpolation of the data in row B. Row D shows the proportion of neighborhoods with p greater than the cutoff on the x axis by using the predictions from the interpolation in row C.

SI Figure 10

Fig. 10. Neighborhood method applied to Whale Fall data at three different bit score cutoffs. Each column shows the method applied at a different bit score cutoff, affecting the detection of conserved neighborhoods and the stringency of the KEGG mapping used for the benchmark data set. Row A shows a 2-dimensional histogram of the all of the codirectionally transcribed neighborhoods in the data set, binned on the x axis by intergenic distance and on the y axis by evolutionary distance (see SI Text for full description). Row B shows the benchmark data, at each intergenic and evolutionary distance p (the proportion of neighborhoods where both genes are functionally related) is shown. Row C shows the interpolation of the data in row B. Row D shows the proportion of neighborhoods with p greater than the cutoff on the x axis by using the predictions from the interpolation in row C.

SI Figure 11

Fig. 11. Neighborhood method applied to Acid Mine data at three different bit score cutoffs. Each column shows the method applied at a different bit score cutoff, affecting the detection of conserved neighborhoods and the stringency of the KEGG mapping used for the benchmark data set. Row A shows a 2-dimensional histogram of the all of the codirectionally transcribed neighborhoods in the data set, binned on the x axis by intergenic distance and on the y axis by evolutionary distance (see SI Text for full description). Row B shows the benchmark data, at each intergenic and evolutionary distance p (the proportion of neighborhoods where both genes are functionally related) is shown. Row C shows the interpolation of the data in row B. Row D shows the proportion of neighborhoods with p greater than the cutoff on the x axis using the predictions from the interpolation in row C.

SI Figure 12

Fig. 12. Neighborhood method applied to four different prokaryotic species. Row A shows a 2-dimensional histogram of the all of the codirectionally transcribed neighborhoods in the data set, binned on the x axis by intergenic distance and on the y axis by evolutionary distance (see SI Text for full description). Row B shows the benchmark data, at each intergenic and evolutionary distance p (the proportion of neighborhoods where both genes are functionally related) is shown. Row C shows the interpolation of the data in row B. Row D shows the proportion of neighborhoods with p greater than the cutoff on the x axis using the predictions from the interpolation in row C. Note that for clarity the axes limits are the same for all graphs, however because of the different genome architecture and levels of neighborhood conservation available for individual species the benchmark data may not extend over the full range, causing the blocked appearance of the interpolation in row C. The different genome architectures influence the relationship between intergenic and evoltionary distance and p

SI Figure 13

Fig. 13. Similarity-based functional annotation of four metagenomic data sets at three different bit score cutoffs. The smaller pie charts show the amount of functional characterization possible by using each of the sources of functional annotation individually, whereas the large pie chart shows the combination of these according to the procedure described in Materials and Methods. Note that the bit score cutoff applies only to the COG, KEGG, and UniRef90 mappings, and remote homology is the same as the UniRef mapping with a 40-bit cutoff

SI Figure 14

Fig. 14. Parameter exploration to decide threshold over which environmental ORFs can be considered characterized based on their hits against UniRef. Shown is the proportion of ORFs considered "characterized" based on the proportion of their hits in UniRef90 that are characterized. In theory, any metagenomic ORF that hits a characterized cluster could be considered characterized; however, because of false-positive and -negative rates of the classification method and error propagation in automatically annotated databases, we used a threshold to limit the effect of spurious annotations. ORFs were considered characterized if >20% of the UniRef90 clusters they hit are characterized. Other values of this parameter do not greatly affect the number of ORFs functionally characterized.

SI Text

Functional Classification Using Homology. BLAST parameters.

Each data set was subjected to BLAST analysis against itself and each of the other data sets. To functionally characterize the data we subjected each data set to BLAST analysis against proteins from the STRING database (v6) and the UniRef90 database (downloaded 29 March 2006). The parameters used for each search are '-p blastp -M BLOSUM62 -G 11 -E 1 -z 10000000 -Y 10000000 -v 300 -b 300'. To assess the sensitivity of our method to different cutoffs, we carried out all analyses using 40, 60, and 80 bit score cutoffs (see SI Fig. 13), which correspond to e-values of »10-1, 10-8, and 10-14 in a BLAST against the UniRef90 database with the above alignment parameters (except -z and -Y).

Classifying UniRef clusters as functionally characterized or uncharacterized.

To be able to integrate functional information based on similarity to UniRef90 clusters, we first had to divide the UniRef90 database into characterized and uncharacterized clusters. Clusters names matching the regular expression

were classified as functionally uncharacterized, and the remaining clusters were considered characterized. On this basis, 55% (1,086,355) of the UniRef90 clusters were considered functionally characterized. It would be extremely difficult to develop a regular expression that can detect all functionally uninformative annotation. We therefore took a random sample of 200 clusters and checked manually our functional classification. From this we estimate that »4% of clusters are incorrectly classified as characterized (false positives) versus 14% that are incorrectly classified as uncharacterized (false negatives). In theory, any ORF that hits a characterized cluster could be considered characterized; however, because of false-positive and -negative rates of the classification method and error propagation in automatically annotated databases (1), we used a threshold to limit the effect of spurious annotations. ORFs were considered characterized if >20% of the UniRef90 clusters they hit are characterized (see SI Fig. 14). To make the results comparable between the prokaryotic genomes and the environmental data sets, we removed self-hits from the results of the BLAST between the prokaryotic genomes and UniRef90 by excluding all 100%-identical hits, unless the target cluster was composed of sequences from more than one species.

Benchmark of homology-based method.

Any attempt to automatically provide functional annotation for a large data set is prone to a range of potential errors (2). To test the sensitivity of our homology-based classification method to such errors, we took a random sample of 100 ORFs classified as having specific functional annotation and carried out a detailed manual analysis, based on which we estimate that the overall false-positive rate is 5%, and the false negative rate is 18%.

Functional Classification Using Neighborhood. Dealing with overlapping gene pairs.

The difficulty involved in predicting translation initiation sites has led to the prediction of a large number of overlapping genes (3) in both the fully sequenced genomes and the metagenomic data. Some of these genes are in the same phase and therefore likely to be artifacts of the gene prediction process; however, there are also many ORFs with long overlaps. Although some of these may represent real overlaps, manual inspection revealed that many are likely to be mispredictions. To reduce the effect that these might have on our analysis, where two genes overlapped by >100 nt or overlapped in the same phase, we removed the shorter gene from the analysis. The 124 prokaryotic genomes used in this analysis (SI Table 7) were chosen to have relatively few large overlaps.

Measuring evolutionary distance.

To investigate the conservation of neighborhoods, we constructed a graph for each set of homologous neighborhoods for the metagenomic data sets at each of the three bit score cutoffs (40, 60, and 80) and for the two 124 prokaryotic genomes at a single 60-bit cutoff. An edge was placed between two neighborhoods if there were BLAST hits the cutoff between both pairs of genes. This graph was then used to construct clusters of neighborhoods representing a conserved gene pair. To measure the level of conservation of a given gene pair, we adapted a method developed to weight sequences for multiple sequence alignment (4). For each neighborhood cluster, a distance matrix was constructed where the distance between two neighborhoods was calculated as 1, the average identity between the genes in each neighborhood. This matrix was then used to construct a UPGMA tree using the biopython treecluster algorithm and then subjected to the algorithm described in Gerstein et al. (4) to produce a series of weights for each neighborhood in the cluster. The evolutionary distance for this cluster was taken to be the sum of the unnormalized weights. This score has the property that it will be low for small clusters of closely related sequences and large for clusters with distantly related sequences. This data are plotted on the y axis of rows A, B, and C of SI Figs. 8-11 and 12.

Benchmarking the relationship between intergenic and evolutionary distance and functional relatedness.

For each of the metagenomics datasets at each bitscore cutoff (40, 60, 80) and each individual prokaryotic genome (60-bit cutoff), we constructed a benchmark data set of the neighborhoods where both members have a KEGG mapping. Using these neighborhoods, we constructed a two-dimensional histogram, the first dimension being intergenic distance (nucleotides) and the second being evolutionary distance (conservation score described above). For each bin in this histogram, we measured the fraction of neighborhoods that map to the same KEGG pathway, which can be interpreted as P, the probability that a pair of genes are functionally related. It is possible that the difficulties in predicting genes in metagenomic data sets can lead to split genes that could cause our method to overestimate the value of P. Therefore, we removed neighborhoods where both genes map to the same COG. This data are shown in row B of SI Figs. 8-11 and 12.

Predicting functional relationships.

Next, we used the relationship between intergenic and evolutionary distance and p determined for the benchmark set to predict functional relationships for all neighborhoods. Given the sparse nature of the data, it was necessary to first interpolate the relationship over the range of values for intergenic and evolutionary distance. Because we expect different evolutionary pressures to be acting on negatively overlapping genes, we interpolated positively and negatively overlapping neighborhoods separately. A weighted 2-dimensional loess interpolation was carried out by using the interp.loess function of the tgp package in R. Because the sparseness of the data, we first log transformed both the evolutionary and intergenic distances before performing the interpolation. Each point was weighted by the number of neighborhoods contributing to that data point. Grid lengths of 1,000 and 500 were used for the positive and negative overlaps, respectively. A span parameter of 0.5 was chosen after considering a range of values. The vast majority of P values exceed the random expectation (16%, the probability that a random pair of genes map to the same KEGG pathway). To ensure that we were dealing with high-quality predictions, however, we considered a pair of genes to be functionally linked only if the P value was >0.4 [in a previous study (5), this was found to have an accuracy approaching 70% at the level of functional modules].

Incorporating STRING neighborhood information.

In addition to utilizing the neighborhood data availible within the metagenomic data sets, we also integrated information from the STRING database. Genes that map to orthologous groups with no or nonspecific functional annotation were upgraded if that orthologous group was linked to a functionally characterized orthologous group by a significant neighborhood score (greater than or equal to 2) in the STRING database.

Functional Characterization of Environmental Data Sets. Identification of over-/underrepresented KEGG maps.

To identify biological processes that are significantly over- or underrepresented in the environmental samples relative to the fully sequenced prokaryotic genomes, we counted the number of proteins from each of these to sets that could be assigned to each KEGG map. For a given map, the statistical significance of over- or underrepresentation was assessed by using a two-sided Fisher's exact test, and the resulting P values were corrected for multiple testing by applying the Bonferroni correction. For the maps that display a statistically significant skew, the absolute difference was summarized by calculating the fraction of proteins from each set that was assigned to the KEGG map in question. The most significant maps are displayed in SI Table 5.

1. Brenner SE(1999) Trends Genet 15:132-133.

2. Iliopoulos I, Tsoka S, Andrade MA, Enright AJ, Carroll M, Poullet P, Promponas V, Liakopoulos T, Palaios G, Pasquier C, et al. (2003) Bioinformatics 19:717-726.

3. Suzek BE, Ermolaeva MD, Schreiber M, Salzberg SL (2001) Bioinformatics 17:1123-1130.

4.Gerstein ME, Sonnhammer EL, Chothia C (1994) J Mol Biol 236:1067-1078.

5. von Mering C, Huynen M, Jaeggi D, Schmidt S, Bork P, Snel B(2003) Nucleic Acids Res 31:258-261.