Significance
Marine viruses are abundant and have substantial ecosystem impacts, yet their study is hampered by the dominance of unannotated viral genes. Here, we use metaproteomics and metagenomics to examine virion-associated proteins in marine viral communities, providing tentative functions for 677,000 viral genomic sequences and the majority of previously unknown virion-associated proteins in these samples. The five most abundant protein groups comprised 67% of the metaproteomes and were tentatively identified as capsid proteins of predominantly unknown viruses, all of which putatively contain a protein fold that may be the most abundant biological structure on Earth. This methodological approach is thus shown to be a powerful way to increase our knowledge of the most numerous biological entities on the planet.
Keywords: viruses, marine, proteins
Abstract
Viruses are ecologically important, yet environmental virology is limited by dominance of unannotated genomic sequences representing taxonomic and functional “viral dark matter.” Although recent analytical advances are rapidly improving taxonomic annotations, identifying functional dark matter remains problematic. Here, we apply paired metaproteomics and dsDNA-targeted metagenomics to identify 1,875 virion-associated proteins from the ocean. Over one-half of these proteins were newly functionally annotated and represent abundant and widespread viral metagenome-derived protein clusters (PCs). One primarily unannotated PC dominated the dataset, but structural modeling and genomic context identified this PC as a previously unidentified capsid protein from multiple uncultivated tailed virus families. Furthermore, four of the five most abundant PCs in the metaproteome represent capsid proteins containing the HK97-like protein fold previously found in many viruses that infect all three domains of life. The dominance of these proteins within our dataset, as well as their global distribution throughout the world’s oceans and seas, supports prior hypotheses that this HK97-like protein fold is the most abundant biological structure on Earth. Together, these culture-independent analyses improve virion-associated protein annotations, facilitate the investigation of proteins within natural viral communities, and offer a high-throughput means of illuminating functional viral dark matter.
Microorganisms are central to the Earth’s ecosystem function (1), and it is becoming increasingly evident that viruses substantially influence microbially driven processes through mortality and manipulation of metabolism via viral-encoded metabolic genes (reviewed in ref. 2), including those involved in photosynthesis (3) and most of central carbon metabolism (4). However, holistic understanding of marine viruses has been limited in part by the dominance of “unknown” genomic sequences encountered when surveying viral communities in nature.
This “viral dark matter” in metagenomes manifests as an inability to obtain functional or taxonomic annotations for most (63–93%) of surveyed sequence space (5), as well as an inability to taxonomically annotate the vast majority (>99%) of viral populations observed in nature (6). Emerging approaches, such as comparison of metagenomes using shared k-mers (7), protein clusters (PCs) (8), and viral populations (6), enable ecological inferences without annotation (reviewed in ref. 9), but further conclusions are hindered by most viral PCs and populations remaining unknown. Taxonomic viral dark matter occurs due to limited representation of viruses in reference databases—86% of 1,531 sequenced genomes of bacterial and archaeal viruses were isolated from only 3 of 61 known host phyla (10). Some progress is being made using traditional isolation and genome-sequencing techniques to obtain reference genomes for both abundant (11, 12) and rare, but ubiquitous (13), marine viruses. However, identifying viral genomic information within microbial genomic datasets and using genome- and network-based analytics to classify these previously unidentified sequences is already rapidly increasing the number of available and classified viral reference genome sequences (10). With the emerging deluge of novel and diverse single-cell genomic datasets that contain viruses (14, 15), such methods are likely to uncover viruses for all known phyla in short order, which should presumably greatly illuminate taxonomic viral dark matter.
In contrast, high-throughput advances to resolve our understanding of functional viral dark matter are lagging. Examination of viral genomic sequence space organized into PCs based on similarity has revealed that the global virosphere (the catalog of genes encoded by viruses) is now well sampled in the upper oceans (6) and likely contains less than 3.9 million proteins (16). Although the abundance of viral PCs is becoming well understood, the functions of these PCs remain poorly characterized.
A promising approach to annotate portions of functional viral dark matter could be to elucidate which predicted proteins encode viral structural components. Computationally, artificial neural networks have been used to predict viral capsid and tail proteins from metagenomic data, which has been validated through in vivo expression and visualization of four putative viral structural genes (17). Experimentally, divergent structural proteins from cultivated viral isolates have been annotated using mass spectrometry (MS)-based proteomics (13, 18–20). Metaproteomics has now emerged as a powerful tool to investigate microbial communities (21, 22), and here we apply this approach to marine viral communities to identify virion-associated proteins and facilitate annotation of the structural components of viral dark matter, generating new insights regarding the structural proteins in natural viral communities.
Results and Discussion
Metaproteomic Datasets for Investigating Wild Marine Viruses.
High-throughput experimental MS-based proteomics was applied to four purified marine viral communities from the Mediterranean Sea, Indian Ocean, and Atlantic Ocean (Table S1) collected through the Tara Oceans Expedition (23). After using several experimental approaches to generate metaproteomes (Table S2; see experimental overview in Fig. S1), we selected the sample preparation method that minimized keratin contamination and autotryptic peptides [filter-aided sample preparation 2 (FASP2)] and the mass spectrometer that produced the most peptide spectra (LTQ Orbitrap Velos Pro). We then evaluated three analytical search pipelines to compare these MS-derived peptide spectra against assembled contigs from their paired dsDNA viral metagenomes included in the Tara Oceans Viromes (TOV) dataset (6) (Fig. S1). Among these pipelines, TPP with X! Tandem enabled the identification of the most spectra, nonredundant proteins (i.e., the distinct nonidentical proteins those spectra represent), and PCs (defined as groups of proteins with 60% similarity across 80% coverage; Table S3). Furthermore, 26% of the total spectra were only identified using the TPP with X! Tandem pipeline, and only 8% of total spectra were not identified using this pipeline (Fig. S2A). Finally, the distribution of annotated spectra within the viral functional and taxonomic categories was highly similar among all three pipelines (Fig. S2B; Morisita’s Index of 1.0 for each pairwise comparison). We thus generated the Quantitative Dataset consisting of the peptide spectral abundances and annotations obtained only from the FASP2 sample preparation method, the LTQ Orbitrap Velos Pro mass spectrometer, and the TPP with X! Tandem pipeline to quantitatively investigate viral protein abundances (Fig. S1).
Table S1.
Metadata category | 22SUR | 39SUR | 39DCM | 67SUR |
Latitude | 39.8386°N | 18.5918°N | 18.5839°N | 32.2401°S |
Longitude | 17.4155°E | 66.6220°E | 66.4727°E | 17.7103°E |
Oceanic region | Mediterranean Sea | Indian Ocean | Indian Ocean | S. Atlantic Ocean |
Depth, m | 5 | 5 | 25 | 5 |
Temperature, °C | 17.0 | 26.8 | 26.8 | 12.8 |
Salinity, psu | 37.8 | 36.3 | 36.3 | 34.8 |
Oxygen, µmol⋅kg−1 | 221 | 193 | 193 | 249 |
Chlorophyll, mg⋅m−3 | 0.18 | 0.10 | 0.18 | 1.55 |
Viral abundance in purified sample, total no. of viruses | 1.47 x 109 | 9.54 x 108 | 1.65 x 1010 | 1.34 x 1010 |
Sites are named by Tara Oceans station number (22, 39, and 67) and depth category (DCM, deep chlorophyll maximum; SUR, surface).
Table S2.
Analysis conditions | Sample | ||||||
Sample preparation | Run time | Mass spectrometer | No. of cycles | 22SUR | 39SUR | 67SUR | 39DCM |
FASP1 | 8 h | OVP | 4 | X | |||
FASP1 | 12 h | Orbi | 6 | X | X | X | X |
FASP2 | 4 h | QE | 2 | X | X | X | X |
FASP2 | 8 h | OVP | 4 | X | X | X | X |
Orbi, LTQ Orbitrap Classic; OVP, LTQ Orbitrap Velos Pro; QE, Q Exactive. X denotes that the sample was analyzed using these conditions.
Table S3.
Data category | SEQUEST with DTASelect (% of all search algorithms) | Proteome Discoverer with Percolator (% of all search algorithms) | TPP with X! Tandem (% of all search algorithms) | All search algorithms |
Total proteins | 264 (27%) | 543 (55%) | 697 (70%) | 990 |
PCs* | 118 (30%) | 237 (59%) | 296 (74%) | 399 |
Spectral counts | 4,909 (19%) | 5,607 (22%) | 15,270 (59%) | 25,686 |
Due to overlap in proteins identified by each pipeline, the “% of all search algorithms” amounts sum to >100% across the methods; e.g., the 697 proteins identified by TPP with X! Tandem represent 70% of the proteins identified by all three search algorithms together.
The number of PCs identified within nonredundant proteins.
The Quantitative Dataset consisted of 15,270 spectra representing 697 nonredundant proteins in 296 PCs (Table S3; Dataset S1). The majority (74% of spectral counts) of proteins in this dataset facilitated annotation of previously unannotated virion-associated proteins (i.e., “newly annotated”; Fig. 1). Taxonomically, 24% of the proteins were annotated as belonging to tailed phages (myoviruses, podoviruses, and siphoviruses; Fig. 1). However, there were very few tail proteins in the dataset; among the proteins with previous functional annotations, the majority (23%) were identified as capsid proteins and <1% were identified as tail proteins (Fig. 1), resulting in ∼100-fold more capsid than tail proteins. Two prior proteomic studies of marine phage isolates show that, although all ORFs annotated as tail proteins were detected in the proteomes of myoviruses infecting Synechococcus and Prochlorococcus (24), five of the nine putative tail proteins were not detected in Cellulophaga siphoviruses (13). This suggests that, even in isolates, MS-based proteomic methods may miss tail proteins—presumably due to loss during phage isolation or deficiencies in sample preparation method (i.e., inefficient digestion with trypsin due to limited K/R residues in these specific proteins or excessive digestion due to having too many K/R residues). In this complex community case using metaproteomics, lower conservation of tail proteins relative to capsids may also hamper their identification through annotation using reference databases (see discussion regarding conservation of viral-associated proteins below).
Collectively, experimentation with two sample preparation methods, three mass spectrometers, and three analytical search pipelines, generated additional peptide spectra beyond the Quantitative Dataset (Fig. S1). Due to the methodological differences, these data could not be combined quantitatively; however, they did provide expanded identification of virion-associated proteins in the four marine viral communities because not all methods identified the same proteins. The resulting Inclusive Dataset (see overview in Fig. S1) contained 1,875 nonredundant proteins grouped into 574 PCs (Table S4), which is ∼2.7- and ∼1.9-fold more proteins and PCs, respectively, than the Quantitative Dataset. Of these proteins, most (991 nonredundant proteins; 53% of the Inclusive Dataset) were again newly identified as virion-associated proteins (Fig. 1), providing functional annotation to 677,376 previously unannotated viral metagenomic reads from these samples, identified here as “structural” based on similarity to peptide spectra using the three analytical search pipelines. The metaproteomes included 176 proteins (9% of the Inclusive Dataset) previously seen in viral isolate experimental proteomes and identified as “viral-associated” or structural (e.g., ref. 13) (Fig. 1). In addition, the metaproteomes provided annotation for 84 previously unannotated hypothetical proteins in viral isolate genomes (4% of the Inclusive Dataset; Fig. 1).
Table S4.
Sample | SEQUEST with DTASelect | Proteome Discoverer with Percolator | TPP with X! Tandem | All search algorithms | ||||||||
Total proteins | Unique proteins* | PCs† | Total proteins | Unique proteins* | PCs† | Total proteins | Unique proteins* | PCs† | Total proteins | Unique proteins* | PCs† | |
22SUR | 256 | 89 | 100 | 110 | 71 | 49 | 266 | 182 | 132 | 446 | 223 | 195 |
39SUR | 353 | 278 | 126 | 221 | 171 | 123 | 193 | 156 | 99 | 519 | 278 | 212 |
39DCM | 236 | 120 | 84 | 336 | 293 | 173 | 293 | 285 | 141 | 527 | 428 | 240 |
67SUR | 223 | 113 | 85 | 150 | 118 | 96 | 236 | 202 | 141 | 383 | 254 | 181 |
All samples | 1,068 | 445 | 281 | 817 | 653 | 299 | 988 | 825 | 377 | 1,875 | 1,183 | 574 |
Proteins that had at least one unique spectra.
The number of PCs identified within nonredundant proteins.
To further examine the utility of metaproteomic analyses in natural viral samples, we first investigated whether the metaproteomes included proteins within the dominant PCs from the paired viral metagenomes. Of the 200 most abundant PCs in the viral metagenomes of each sample, 9% (72 of 800 PCs total) were experimentally detected in the metaproteomic Inclusive Dataset, including 47 PCs that had no prior functional annotation (Fig. 2). We next examined TOV-generated viral populations (i.e., contigs grouped based on similarity of ≥80% of their genes at ≥95% nucleotide identity) (6) for the presence of PCs detected in the metaproteomes. This showed that the metaproteomic PCs in the Inclusive Dataset were detected in viral populations from the paired viral metagenomes that spanned a large range of population abundances—identifying proteins in the most abundant viral populations, as well as rare populations (Fig. 3A). Applying these same analyses to all 5,476 viral populations detected in the larger, globally distributed TOV dataset (6) revealed that metaproteome-detected PCs were found in populations spanning a large range of abundances across as many as 36 of the 43 samples (Fig. 3B). Together, this combined information (Figs. 1–3) suggests that metaproteomics is a powerful approach to inform annotation of previously unknown genomic content as structural genes in both isolates and variably abundant populations in natural viral communities.
Dominant Protein Clusters in Viral Metaproteomes.
Within the Quantitative Dataset, one PC (CAM_CRCL_773, previously identified in the Global Ocean Sampling expedition, Pacific Ocean Viromes, and TOV datasets) (5, 6, 25) was by far the most abundant, representing 57.5% of spectral counts (Fig. 4A). Given this PC’s dominance, we applied network analysis to the 400 protein members of this PC in the Inclusive Dataset, which showed two clearly separated groups divergent by ∼30% amino acid identity (Fig. 4B). Within this PC, only 10 of the 400 constituent proteins were previously annotated (as capsid proteins of siphoviruses JD024 and D3112 that infect Pseudomonas), which represented only 1.6% of the PC’s spectral counts derived from the Quantitative Dataset (Fig. 4B and Dataset S1). This PC thus included the majority (79%) of the previously unannotated spectra in the Quantitative Dataset (Fig. 1). In silico structural modeling of representative sequences from this PC suggested both groups represent major capsid proteins from phages similar to one another (the lambdoid phages HK97, ref. 26, and BPP-1, ref. 27; Fig. 4 C and D); however, these best fits were relatively weak (template modeling scores, TM scores, lower than the accepted cutoff of 0.5) (28). Thus, this dominant PC appears to be a major capsid protein of previously unexplored marine viruses.
The next four most abundant PCs in the Quantitative Dataset contained a total of 9.8% of the spectral counts (Fig. 4A) and were predominantly annotated as capsid proteins by sequence similarity (Dataset S1) and structural modeling (Fig. S3) of their total ORFs present within the Inclusive Dataset. The most abundant of these four PCs, CAM_CRCL_625, was a T4-like major capsid protein by consensus annotation of the PC’s component ORFs (29) and also by structural modeling (30). Moving in order of decreasing spectral abundance, PCs CAM_CRCL_14716 and TARA_183056 were both functionally and taxonomically unannotated by sequence similarity; however, by structural modeling, both had best fits to a capsid protein of cyanophage Syn5 (31), although the TM score for the latter PC was below the recommended cutoff of 0.5 (28). Finally, PC TARA_207964 was annotated as a capsid protein from phage HMO-2011 (which infects Ca. Puniceispirillum marinum of the SAR116 clade) (11) by similarity, but was annotated as the major capsid protein of cyanophage P-SSP7 (32) by structural modeling, likely because there is currently no reference structure available in the modeling database for phage HMO-2011. Collectively, this combination of ORF annotation and structural modeling thus suggested that, of the top five most abundant PCs (which comprised approximately two-thirds of the spectra in the Quantitative Dataset), at least four were capsid proteins. This is consistent with the dominance of capsids in the annotated portion of the metaproteomes (Fig. 1), and with our understanding of virion structural proteins usually being dominated by capsid proteins in proteomes of viral isolates (13, 24).
We next sought to examine the global-scale distribution of these five most abundant metaproteome-detected PCs, by examining their presence in previously-identified TOV viral populations (6). The dominant metaproteome-detected PC (CAM_CRCL_773) was present in a total of 93 viral populations collectively found in every TOV sample across seven oceans and seas (Fig. 5). In contrast, the four next most abundant PCs were present in substantially fewer populations and showed somewhat more restricted geographic distributions. One PC (TARA_183056) was found in 10 populations that were present in every oceanic region examined except the Southern Ocean. Two PCs (CAM_CRCL_625 and TARA_207964) were found in 5 and 11 viral populations, respectively, predominantly present only in the Indian and Atlantic Oceans, and the Mediterranean and Red Seas. Finally, one PC (CAM_CRCL_14716) was present in only one viral population that showed the most geographic restriction, with the highest abundance from the Indian Ocean, where two of the four metaproteomic samples were collected, but low or nonexistent abundance in the remaining locations. Thus, the five most abundant PCs in the four metaproteomes from three stations are present in viral populations with both widespread and regionally restricted distributions.
Conservation of Virion-Associated Proteins.
Conservation of structural similarity in viral capsid proteins, even in the absence of nucleotide sequence similarities, has long been recognized (33, 34). It is thus notable that the model-predicted structural similarities of the five most abundant PCs in the Quantitative Dataset (Fig. 4A) are to capsid proteins that all contain the HK97-like fold, including siphophage HK97, HK97-like phage BPP-1, myophage T4, podophage Syn5, and siphophage P-SSP7 (Fig. 4 C and D and Fig. S3) (27, 30, 31, 34). This HK97-like capsid protein fold has been found in viruses infecting organisms from all three domains of life (35) and is suggested to be the most abundant biological structure on Earth, based on the high abundance of total viruses (e.g., refs. 30, 34, and 36). The data presented here support that assertion: not only do the most abundant PCs in the metaproteomes (representing 67% of the Quantitative Dataset; Fig. 4) seem to contain this protein fold, four out of five of these PCs also appear widely distributed in the upper oceans as shown in our analysis of the TOV viral populations (Fig. 5).
To further investigate conservation in virion-associated proteins, selective constraints of the PCs from the Inclusive Dataset were examined using the ratio of nonsynonymous to synonymous polymorphisms (pN/pS), which has proven powerful for analysis of microbial metagenomic datasets (37, 38). Average pN/pS ratios for PCs in the metaproteome were significantly lower than those determined for all viral metagenome-derived PCs (0.67 vs. 0.84; P < 0.001, Mann–Whitney U test; Fig. 6). For comparison, viral metagenome PCs previously annotated as capsids also had relatively low pN/pS ratios (average, 0.48), whereas ratios for annotated tail proteins were higher (average, 0.69). Together, this information suggests stronger overall negative selection for virion-associated proteins (i.e., increased maintenance of their gene sequences), especially capsid proteins, relative to other viral genome-encoded proteins. This is analogous to previous observations of conservation in housekeeping genes in microorganisms (e.g., ref. 38) and underscores the importance of capsid protein structure maintenance to virion fitness.
Genomic Context for Experimentally Detected Viral Proteins.
Genomic context frequently improves gene-specific functional and taxonomic interpretations. We thus examined the genomic context of the five most abundant metaproteome-detected PCs via their five longest associated contigs per PC in the TOV dataset (Fig. 7 and Dataset S2). The most abundant PC (CAM_CRCL_773) was present in contigs where few (24–29%) ORFs were annotated and also showed no taxonomic consensus, the latter of which is consistent with >99% of TOV viral populations (6). However, this genomic context did show that CAM_CRCL_773 was present within a genomic region containing ORFs encoding for a tail fiber, baseplate, and a terminase, as well as three additional unannotated PCs that were also detected in the metaproteome. Within these contigs, the presence of two tail genes and the significant similarities to tailed virus genes for the majority (90–100%) of annotated ORFs indicates that this dominant PC may belong to previously-unidentified Caudovirales.
In contrast, the second most abundant PC (CAM_CRCL_625) was present in contigs that were predominantly taxonomically annotated (58–100% of their ORFs), mainly as genes of Myoviridae infecting highly abundant hosts such as Pelagibacter, Synechococcus, and Prochlorococcus (Fig. 7 and Dataset S2). This PC was again found within a genomic region containing multiple tail and capsid proteins and two terminase subunits. Collectively, this genomic context combined with the sequence-based and structural modeling-based annotations (above) provides strong evidence that CAM_CRCL_625 is a capsid protein of myoviruses.
The third and fourth most abundant PCs (CAM_CRCL_14716 and TARA_183056) were found in predominantly unannotated contigs (11–29% of ORFs annotated; Fig. 7; Dataset S2). The former (CAM_CRCL_14716) was present in only one TOV contig, consistent with its more restricted geographic distribution (Fig. 5). Although the annotations present in both of these PCs’ contigs did not allow taxonomic consensus to be reached, each PC occurred within genomic regions containing other metaproteome-detected PCs. Furthermore, the genomic context for TARA_183056 included a terminase gene as well tail fiber genes, suggesting it may belong to another unidentified Caudovirales.
Finally, the fifth most abundant PC (TARA_207964) was present in predominantly annotated contigs (57–78% annotated ORFs) in which the consensus taxonomy (56–91%) was podophage HMO-2011, a phage infecting a SAR116 bacterium (11) (Fig. 7 and Dataset S2). This matches this PC’s annotation reported above via its component metagenomic ORFs. This PC was also present in a well-annotated genomic region that included a metaproteome-detected PC (TARA_40991) annotated as a portal protein, supporting the annotation of this PC (TARA_207964) as capsid protein of podophage HMO-2011.
Conclusions
In summary, this study establishes environmental metaproteomics as a high-throughput strategy for shedding light on viral dark matter in two ways: (i) defining formerly unannotated proteins as structural, and (ii) revealing which of these proteins are most abundant thereby focusing further inquiry (e.g., structural modeling). The 1,875 viral proteins observed in these metaproteomes allowed us to newly annotate 991 proteins as primarily structural. Surprisingly, the majority (67%) of the metaproteomic spectra were derived from just five environmentally dominant PCs. With a combination of sequence- and structural modeling-based annotation, these PCs are now predominantly identified as putative capsid proteins of tailed viruses containing the most abundant biological structure on Earth, the HK97-like protein fold. Furthermore, analysis of metaproteomic PCs facilitated understanding of increased selective pressures on genes encoding virion-associated proteins (e.g., capsids). Although this study focused on dsDNA viruses, the approach is generalizable to ssDNA and RNA viruses, which currently require generation of separate metagenomes. Thus, this large-scale annotation strategy and the findings presented here will help guide the experimentation needed to refine structural annotations and offer glimpses of the viral metagenomic dark matter that obfuscates our understanding of the most abundant biological entities on Earth: viruses.
Methods
A detailed description of all metaproteomic, metagenomic, and bioinformatic procedures is provided in SI Methods.
SI Methods
An overview of this study is presented in Fig. S1, describing the major steps outlined below.
Sample Collection and Processing.
Sample collection.
Four 20-L seawater samples were collected from the Tara Oceans Expedition (23) (Table S1). Sampling strategy and methodology for the Tara Oceans Expedition is fully described by Pesant et al. (39). Samples were immediately 0.22-µm filtered (Millipore Express Plus; Millipore) and stored at 4 °C in the dark in acid-washed polycarbonate bottles until further analysis.
Viral concentrates.
Viral particles were concentrated using the iron chloride precipitation method (40). Viruses were precipitated from the 0.22-µm filtrate using FeCl3, collected onto 1-µm polycarbonate filters (GE Water and Process Technologies), and stored at 4 °C until analysis. After resuspension of the viral precipitates in ascorbic-EDTA buffer (0.1 M EDTA, 0.2 M Mg, 0.2 M ascorbic acid, pH 6.0), viral particles were concentrated via Amicon Ultra 100-kDa centrifugal devices (Millipore) and treated with DNase I (100 U/mL) followed by the addition of 0.1 M EDTA and 0.1 M EGTA to halt enzyme activity.
Virus purification.
For metaproteomic analyses (below), 75% of the DNase I-treated virus suspension for each sample, equivalent to 15 L of seawater per sample, was purified on a cesium chloride gradient (41), and after a 4-h centrifugation at 102,000 × g, the 1.4–1.52 g⋅mL−1 density fraction was collected.
Virus enumeration.
Viruses were enumerated from the cesium chloride gradient-purified viral concentrates using epifluorescence microscopy (42) after staining with SYBR Gold (Life Technologies).
Viral Metagenome Construction and Analyses.
These four viral metagenomes are a subset of the Tara Oceans Viromes (TOV) dataset (6). Briefly, the viral metagenomes were constructed and analyzed as follows.
Viral DNA extraction.
Nucleic acids from the remaining 25% of the DNase I-treated virus suspensions, equivalent to 5 L of seawater per sample, were extracted as previously described (43). Viral particle suspensions were treated with Wizard PCR Preps DNA Purification Resin (Promega) at a ratio of 0.5-mL sample to 1-mL resin, and eluted with TE buffer (10 mM Tris, pH 7.5, 1 mM EDTA) using Wizard Minicolumns. DNA was fluorescently quantified with the Pico Green dsDNA kit (Life Technologies) according to the manufacturer’s directions.
Viral metagenome DNA sequencing.
Extracted DNA was Covaris-sheared and size selected to 160–180 bp, followed by amplification and ligation per the standard Illumina protocol. Sequencing was done on a HiSeq 2000 system at the Genoscope facilities (Paris, France).
Quality control of viral metagenome reads and assembly.
Individual reads in the four viral metagenomes were quality-controlled using a combination of trimming and filtering as previously described (38). Briefly, bases were trimmed at the 5′ end if the number of base calls at any base (A, T, G, C) was more than 2 SDs from the average across all cycles. Bases were trimmed at the 3′ end if the quality score was <20. Reads that were shorter than 95 bp or reads with a median quality score <20 were removed from further analyses. Reads were assembled using SOAPdenovo (44) where insert and k-mer size (43–47) are calculated at run time and are specific to each metagenome as implemented in the MoCAT pipeline (45). These assembled contigs included 43.96%, 29.01%, 24.17%, and 42.32% of the reads from each viral metagenome (Station 22 surface, Station 39 surface, Station 39 DCM, and Station 67 surface, respectively).
Protein clustering and read mapping.
ORFs were predicted from all quality-controlled contigs using Prodigal, version 2.5 (46), with default settings. Predicted ORFs were clustered based on sequence similarity as described previously (5, 25). Briefly, ORFs were initially mapped to existing viral protein clusters (from POV, GOS, and phage genomes), using cd-hit-2d (“-g 1 -n 4 -d 0 -T 24 -M 45000”; 60% percent identity and 80% coverage). Then the remaining, unmapped ORFs were self-clustered, using cd-hit with the same options parameters as above. To develop viral metagenomic read counts per PC for statistical analyses, reads were mapped back to predicted ORFs on the contigs using Mosaik (version 1.1.0021; https://code.google.com/archive/p/mosaik-aligner/) (48) with the following settings: “-a all -m all -hs 15 –minp 0.95 –mmp 0.05 -mhp 100 -act 20.”
Read normalization.
To calculate the relative abundance A of each peptide, the number of hits identified in the read-mapping step (above) was normalized in the following way (12): the number of hits H was divided by the total number of sequences N and by the amino acid length of the hit peptide L. Finally, to avoid larger numbers of significant figures, the abundances were rescaled to the mean abundances across all samples (as denoted by the bar across the denominator in the following equation):
Annotation.
Assembled contigs were annotated as follows: ORFs were predicted using Prodigal (above) and annotated using the top BLASTP hit (e value < 0.001) against the viral National Center for Biotechnology Information RefSeq database (April 2014). ORFs were first annotated against the viral RefSeq database excluding genes identified as “hypothetical” to obtain the top hit to genes with functional annotations. Then, ORFs were annotated against the viral RefSeq database including genes identified as hypothetical to obtain the top hit to hypothetical genes with taxonomic annotation. Significant hits to ORFs that had no functional annotation when hypothetical genes were excluded, but did have taxonomic annotation when hypothetical genes were included, are annotated as having an “Unknown” functional annotation and a defined taxonomic annotation. ORFs that were functionally annotated as “head–tail connector,” “neck,” or “portal” were included in the “Capsids” category for functional annotation. ORFs that were functionally annotated as “scaffolding,” “prohead scaffolding,” “prohead core,” “heat shock protein,” “GroEL,” or enzymes (including proteases) were included in the “Other” category for functional annotation. All annotation information is presented in Dataset S1 and deposited with the rest of the data (see below for data deposition).
Network representation.
Predicted proteins were aligned using MUSCLE (47) with default settings. Ends were trimmed and sites with more than one-half of their sequences containing gaps were removed. Identities were calculated as the number of identical residues among all possible pairwise combinations divided by the length of the alignment. This identity table was loaded in Cytoscape (49), with weighted edges equivalent to the percent identity among sequences. Taxonomic assignations were done based on the most common top BLASTP hit (Dataset S1) of each protein for each contig.
Global distribution.
Contigs within the 5,476 TOV viral populations (6) that contained one of the five most abundant metaproteome-detected PCs (CAM_CRCL_773, CAM_CRCL_625, CAM_CRCL_14716, TARA_183056, and TARA_207964) were searched for and identified. The sum of these population’s relative abundances in each TOV metagenome (i.e., the number of base pairs mapped from the metagenome reads to the contig normalized by the contig length and total metagenome bp sequences) was then used as a proxy for the PC abundance across the 43 TOV samples (Fig. 5).
Genomic context.
Annotation of ORFs within abundant TOV contigs containing the five most abundant metaproteome-detected PCs was based on similarity to the viral RefSeq database (April 2014; BLASTP e value < 0.001; score ≥ 50) and accompanies this manuscript in Dataset S2.
pN/pS calculation.
The pN/pS ratio was estimated for each of the 7,648 genes in the 1,870 contigs that contain the 1,875 total nonredundant proteins. To reliably estimate pN/pS, we required an average gene base pair coverage of three reads. Codons containing polymorphic sites were then extracted and the alleles categorized either as nonsynonymous or synonymous. To calculate the expected ratio, we assumed a uniform model for the occurrence of mutations across the genomic sequence (38).
Data deposition.
Sequences, assemblies, annotation, processed data such as alignments, similarity matrixes and network files have been deposited in iVirus (mirrors.iplantcollaborative.org/browse/iplant/home/shared/iVirus/TOV_4_metaproteomes).
Metaproteomic Analyses.
Protein sample preparation.
Viral protein extracts were prepared for MS-based proteomics via a filter-aided sample preparation (FASP) method (50) using the FASP Protein Digestion Kit (Expedeon), with some modifications to the manufacturer’s instructions as previously described for marine viruses (13) yielding two methods: FASP1 and FASP2 (below).
FASP1: Briefly, purified viral concentrates were denatured by nutating for 30 min in a urea solution (8 M urea, 50 mM Tris⋅HCl, and 10 mM DTT; pH 7.6) at room temperature. Denatured concentrates were then applied to the 30-kDa spin filter at 14,000 × g to collect proteins. Fresh urea solution (without DTT) was centrifuged through the filter twice followed by a 20-min dark incubation with iodoacetamide in urea. Filters were then washed three times with urea solution, followed by a 50 mM ammonium bicarbonate solution wash. After transferring to a new collection tube, digestion solution (1 μg of sequencing grade trypsin; Promega) was applied to the filter and incubated at 37 °C overnight. The digested peptides were washed twice by adding 50 mM ammonium bicarbonate solution to the filter, centrifuging, and repeating. The peptides were then eluted by adding 0.5 M NaCl solution and centrifuging. The final volume of the digested protein filtrate was adjusted to 0.3 mL with 0.1% formic acid and divided into two aliquots. Aliquots were run on two instruments: LTQ Orbitrap Classic and LTQ Orbitrap Velos Pro (see below for details).
FASP2: The FASP2 protocol was the same as FASP1 except for the procedure was done in a laminar flow hood to minimize keratin contamination. Also, 1 µg of sequencing-grade trypsin (New England Biolabs) was substituted for the Promega trypsin. The two final aliquots were run on two instruments: LTQ Orbitrap Velos Pro and Q Exactive (see below for details).
LC-MS/MS analysis.
After sample preparation, individual aliquots of the complex peptide mixture were loaded onto a split-phase 2D RP-SCX back column. The strong cation exchange (SCX) phase was 150 μm × ∼3–5 cm (Luna SCX; 5-μm particle size; 100-Å pore size; Phenomonex) and the reverse phase was 150 μm × ∼3–5 cm (Jupiter C18; 3-μm particle size; 300-Å pore size; Phenomonex). After loading, the RP-SCX column was connected to the high-performance liquid chromatograph (HPLC) (U3000; Dionex Thermo Fisher) and washed with 100% (vol/vol) aqueous solvent for 5 min and then ramped up to 100% (vol/vol) organic solvent [70% (vol/vol) acetonitrile, 0.1% formic acid] over 10 min. This migrates peptides from the RP phase onto the SCX phase, which effectively desalts the peptide samples and removes other nonpeptide contaminants that do not bind to the SCX. The back column was then connected to a 100 μm × 15 cm RP resolving front column with an integrated Nanospray tip (Jupiter C18; 3-μm particle size; 300-Å pore size; Phenomonex) resting on the Proxeon Nanospray source (Proxeon Biosystems). Samples were analyzed via a MudPIT strategy (51). A 2D separation was performed with a quaternary HPLC pump (U3000; Dionex Thermo Fisher) with a custom-built split before the home-packed columns to achieve an estimated flow rate of ∼300 nL/min at the Nanospray tip. During each of four cycles for 8-h runs (LTQ Orbitrap Velos Pro), six cycles for 12-h runs (LTQ Orbitrap Classic), and two cycles for 4-h runs (Q Exactive), an initial salt step gradient of ammonium acetate-eluted peptides from the SCX column onto the RP column, and a subsequent 2-h reverse-phase gradient-eluted peptides from the RP column. Eluting peptides were ionized via a nanospray source (Proxeon Biosystems) and introduced into three mass spectrometer instrument systems: (i) LTQ Orbitrap Classic, (ii) LTQ Orbitrap Velos Pro, and (iii) Q Exactive (each from Thermo Fisher Scientific). For a summary of analysis conditions for each sample, see Table S2.
For the entire length of the 2D separation, the mass spectrometer performed data-dependent tandem mass spectrometry (MS/MS). During the full chromatographic runs, the mass spectrometer alternated between full scans and data-dependent MS/MS scans with Xcalibur software control. Each instrument was operated with following settings:
LTQ Orbitrap Classic: Full scans in Orbitrap at 30,000 resolution from 400 to 1,800 m/z and 5 CID data-dependent scans in the ion trap with a 3 m/z isolation width and 35% collision energy. Dynamic exclusion was enabled and set at 1, with a list size of 100 and exclusion time of 60 s, two centroided microscans were averaged for both full and MS/MS scans.
LTQ Orbitrap Velos Pro: Full scans in Orbitrap at 30,000 resolution from 400 to 1,800 m/z and 20 CID data-dependent scans in the ion trap with a 3 m/z isolation width and 35% collision energy. Dynamic exclusion was enabled and set at 1, with a list size of 100 and exclusion time of 60 s; one microscan was collected for both full and MS/MS scans. All MS and MS/MS data was acquired in centroid mode.
Q Exactive: Full scans in Orbitrap at 70,000 resolution from 400 to 1,600 m/z and 10 HCD 17.5,000 resolution data-dependent scans in the Orbitrap with a 3 m/z isolation width and 28 NCE. Dynamic exclusion was enabled exclusion time of 15 s, one microscan was collected for both full and MS/MS scans. All MS and MS/MS data was acquired in profile mode.
Data and quantitation analyses.
The resultant MS/MS spectra were analyzed by three automated analytical proteome informatic pipelines. Each sample had multiple associated MS/MS spectra raw files, because multiple aliquots (3, 4) from each sample were introduced to the mass spectrometers, in multiple instrument runs. The raw files for each sample were combined and searched against a matched database of viral metagenome contigs from the same sample, consisting of 249,381 proteins (22SUR), 209,504 proteins (39SUR), 189,606 proteins (67SUR), and 155,234 proteins (39DCM) predicted and translated using Prodigal 2.50 (46), combined with common contaminant proteins (trypsin, common protein standards, and human keratins). The proteomic analytical pipelines used were as follows:
SEQUEST with DTASelect: SEQUEST, version .27 (52), was used with DTASelect, version 1.9 (53). Raw spectra were extracted via the readW program (ISB, version 4.3.1) for the input to a custom-built Linux-based SEQUEST/DTASelect pipeline running on an 800-node cluster. Searches using the metagenomic databases were performed with the following parameters: parent mass tolerance, 3.0; fragment ion tolerance, 0.5; up to four missed cleavages allowed, variable modification of carboxymethyl cysteine (+57.021 Da) and fully tryptic peptides only.
SEQUEST outputs were sorted and filtered via DTASelect with the following parameters: delCN, >0.08; Xcorr, >1.8 (+1), >2.5 (+2), >3.5 (+3); +1 minimum charge state, +3 maximum charge state, two peptides/protein; and a strict false-discovery rate (FDR) of 0.01 via reverse database search method.
Proteome Discoverer with Percolator: Proteome Discoverer, version 1.4, was used with the Percolator validation algorithm (54). The following workflow was used in the analyses of MS/MS spectra for each sample: Spectrum selector → Scan event filter → Sequest HT → Percolator. Spectrum selector settings: The precursor charge state (high/low), retention time, minimum peak count, and total intensity threshold were all set to default settings. The max and min precursor mass settings were 5,000 and 500 Da, respectively. Scan event filter settings: All setting were set to default. Sequest HT processing node settings: Fully tryptic peptides; max missed cleavage sites = 4; min and max peptide length, 6 and 144, respectively; max delta Cn = 0.05; precursor mass tolerance = 10 ppm; fragment mass tolerance = 0.5 Da; and modification of carboxymethyl cysteine (+57.021 Da).
The criteria used for acceptance of peptide assignments are as follows: two peptides/protein, and high-confidence Xcorr value for each charged state ranging from 1 to 4 was 1.8, 2.5, 3.5, and 4.6, respectively. Percolator settings: Target value for a decoy database search was applied: strict FDR of 0.01.
TPP with X! Tandem: TransProteomic Pipeline (TPP), version 4.6, rev. 3 (55), was used with X! Tandem version Cyclone (2013.2.01) (56), a public-domain program (www.thegpm.org/tandem). Raw files were converted into mzXML format using msconvert, version 2, via ProteoWizard (57). The following parameters were set: fragment monoisotopic mass error = 0.5, max parent charge = 4, min parent M+h = 500, precursor mass tolerance = 10 ppm, max missed cleavages = 4, max valid e value = 0.1, and min ion count = 4. X! Tandem output was then converted to pepxml format via TPP. Data were then analyzed by the built-in software component PeptideProphet using default settings. Output data from PeptideProphet were then processed through ProteinProphet, a statistical probability component in TPP. Acceptable proteins possessed a 99% probability or higher.
To compare quantitation across the three search algorithms (Fig. S2), one subset of data (FASP2, 8 h LC-MS/MS run, on the LTQ Orbitrap Velos Pro) was individually analyzed across all three analytical search pipelines. Shared spectra were distributed according to the ratio of unique spectra for each protein (58) by the following equation:
where is the number of unique spectra for the ith protein, is number of shared spectra for the ith protein, and is the total number of spectra recruited to the ith protein. In cases where no unique spectra were present for any peptides in the group, spectra were not distributed and no quantitation was possible for the associated proteins (noted in Dataset S1).
The Quantitative Dataset is the subset of this data analyzed using the TPP with X! Tandem.
Data output.
All viral metagenomic contigs, ORFs, and PCs, and experimentally detected proteins are listed with their associated annotations in Dataset S1. This list contains all identified proteins (1,875), including clarification of those with unique (1,183) and only nonunique (692) spectra. The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium (59) via the PRIDE partner repository with the dataset identifier PXD000938.
Structural prediction analyses.
To predict structures for full-length protein sequences, representative amino acid sequences from the five most abundant protein clusters were submitted to I-TASSER (60) run with default settings. The template modeling score (TM score) for each best-fit template was calculated from the C score using equation 17 from Yang et al. (28).
Supplementary Material
Acknowledgments
We thank Bonnie Poulos for preparing viral concentrates, Genoscope for viral metagenomic sequencing, members of Tucson Marine Phage Lab for comments on the manuscript, and University Information Technology Services Research Computing Group and the Arizona Research Laboratories Biotechnology Computing for High-Performance Computing Cluster access and support. We thank Kristen Corrier and Manesh Shah of University of Tennessee/Oak Ridge National Laboratory for efforts in filter-aided sample preparation (FASP) preparation of viral samples and MS analyses, and aspects of proteome informatics, respectively. The four viral concentrates were collected as part of exceptional commitment by scientists and sponsors who made the Tara Oceans expedition possible [full list in Brum et al. (6)]. Funding specific to this project was provided by a Ford Foundation Postdoctoral Fellowship (to E.-H.K.), the Gordon and Betty Moore Foundation through Grants GBMF2631 and GBMF3790 (to M.B.S.), and a grant to the UA Ecosystem Genomics Institute through the UA Technology and Research Initiative Fund and the Water, Environmental and Energy Solutions Initiative (to M.B.S. and V.I.R.). This article is contribution 35 of the Tara Oceans Expedition 2009–2012.
Footnotes
The authors declare no conflict of interest.
This article is a PNAS Direct Submission.
Data deposition: Sequences, assemblies, annotation, and processed data such as alignments, similarity matrixes, and network files have been deposited in iVirus, mirrors.iplantcollaborative.org/browse/iplant/home/shared/iVirus/TOV_4_metaproteomes. The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE Partner Repository, www.ebi.ac.uk/pride/archive/ (identifier PXD000938).
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1525139113/-/DCSupplemental.
References
- 1.Falkowski PG, Fenchel T, Delong EF. The microbial engines that drive Earth’s biogeochemical cycles. Science. 2008;320(5879):1034–1039. doi: 10.1126/science.1153213. [DOI] [PubMed] [Google Scholar]
- 2.Suttle CA. Marine viruses--major players in the global ecosystem. Nat Rev Microbiol. 2007;5(10):801–812. doi: 10.1038/nrmicro1750. [DOI] [PubMed] [Google Scholar]
- 3.Sullivan MB, et al. Prevalence and evolution of core photosystem II genes in marine cyanobacterial viruses and their hosts. PLoS Biol. 2006;4(8):e234. doi: 10.1371/journal.pbio.0040234. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Hurwitz BL, Hallam SJ, Sullivan MB. Metabolic reprogramming by viruses in the sunlit and dark ocean. Genome Biol. 2013;14(11):R123. doi: 10.1186/gb-2013-14-11-r123. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Hurwitz BL, Sullivan MB. The Pacific Ocean virome (POV): A marine viral metagenomic dataset and associated protein clusters for quantitative viral ecology. PLoS One. 2013;8(2):e57355. doi: 10.1371/journal.pone.0057355. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Brum JR, et al. Patterns and ecological drivers of ocean viral communities. Science. 2015;348(6237):1261498. doi: 10.1126/science.1261498. [DOI] [PubMed] [Google Scholar]
- 7.Hurwitz BL, Westveld AH, Brum JR, Sullivan MB. Modeling ecological drivers in marine viral communities using comparative metagenomics and network analyses. Proc Natl Acad Sci USA. 2014;111(29):10714–10719. doi: 10.1073/pnas.1319778111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Hurwitz BL, Brum JR, Sullivan MB. Depth-stratified functional and taxonomic niche specialization in the “core” and “flexible” Pacific Ocean Virome. ISME J. 2015;9(2):472–484. doi: 10.1038/ismej.2014.143. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Brum JR, Sullivan MB. Rising to the challenge: Accelerated pace of discovery transforms marine virology. Nat Rev Microbiol. 2015;13(3):147–159. doi: 10.1038/nrmicro3404. [DOI] [PubMed] [Google Scholar]
- 10.Roux S, Hallam SJ, Woyke T, Sullivan MB. Viral dark matter and virus-host interactions resolved from publicly available microbial genomes. eLife. 2015;4:e08490. doi: 10.7554/eLife.08490. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Kang I, Oh H-M, Kang D, Cho J-C. Genome of a SAR116 bacteriophage shows the prevalence of this phage type in the oceans. Proc Natl Acad Sci USA. 2013;110(30):12343–12348. doi: 10.1073/pnas.1219930110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Zhao Y, et al. Abundant SAR11 viruses in the ocean. Nature. 2013;494(7437):357–360. doi: 10.1038/nature11921. [DOI] [PubMed] [Google Scholar]
- 13.Holmfeldt K, et al. Twelve previously unknown phage genera are ubiquitous in global oceans. Proc Natl Acad Sci USA. 2013;110(31):12798–12803. doi: 10.1073/pnas.1305956110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Labonté JM, et al. Single-cell genomics-based analysis of virus-host interactions in marine surface bacterioplankton. ISME J. 2015;9(11):2386–2399. doi: 10.1038/ismej.2015.48. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Roux S, et al. Ecology and evolution of viruses infecting uncultivated SUP05 bacteria as revealed by single-cell- and meta-genomics. eLife. 2014;3:e03125. doi: 10.7554/eLife.03125. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Ignacio-Espinoza JC, Solonenko SA, Sullivan MB. The global virome: Not as big as we thought? Curr Opin Virol. 2013;3(5):566–571. doi: 10.1016/j.coviro.2013.07.004. [DOI] [PubMed] [Google Scholar]
- 17.Seguritan V, et al. Artificial neural networks trained to detect viral and phage structural proteins. PLoS Comput Biol. 2012;8(8):e1002657. doi: 10.1371/journal.pcbi.1002657. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Lavigne R, Ceyssens P-J, Robben J. 2009. Phage proteomics: Applications of mass spectrometry. Bacteriophages: Methods and Protocols, Volume 2: Molecular and Applied Aspects, eds Clokie MRJ, Kropinski AM (Humana, New York), pp 239–251.
- 19.Allen MJ, Howard JA, Lilley KS, Wilson WH. Proteomic analysis of the EhV-86 virion. Proteome Sci. 2008;6:11. doi: 10.1186/1477-5956-6-11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Sullivan MB, et al. The genome and structural proteome of an ocean siphovirus: A new window into the cyanobacterial “mobilome.”. Environ Microbiol. 2009;11(11):2935–2951. doi: 10.1111/j.1462-2920.2009.02081.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Hettich RL, Sharma R, Chourey K, Giannone RJ. Microbial metaproteomics: Identifying the repertoire of proteins that microorganisms use to compete and cooperate in complex environmental communities. Curr Opin Microbiol. 2012;15(3):373–380. doi: 10.1016/j.mib.2012.04.008. [DOI] [PubMed] [Google Scholar]
- 22.VerBerkmoes NC, Denef VJ, Hettich RL, Banfield JF. Systems biology: Functional analysis of natural microbial consortia using community proteomics. Nat Rev Microbiol. 2009;7(3):196–205. doi: 10.1038/nrmicro2080. [DOI] [PubMed] [Google Scholar]
- 23.Karsenti E, et al. Tara Oceans Consortium A holistic approach to marine eco-systems biology. PLoS Biol. 2011;9(10):e1001177. doi: 10.1371/journal.pbio.1001177. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Sullivan MB, et al. Genomic analysis of oceanic cyanobacterial myoviruses compared with T4-like myoviruses from diverse hosts and environments. Environ Microbiol. 2010;12(11):3035–3056. doi: 10.1111/j.1462-2920.2010.02280.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Yooseph S, et al. The Sorcerer II Global Ocean Sampling expedition: Expanding the universe of protein families. PLoS Biol. 2007;5(3):e16. doi: 10.1371/journal.pbio.0050016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Gan L, et al. Capsid conformational sampling in HK97 maturation visualized by X-ray crystallography and cryo-EM. Structure. 2006;14(11):1655–1665. doi: 10.1016/j.str.2006.09.006. [DOI] [PubMed] [Google Scholar]
- 27.Zhang X, et al. A new topology of the HK97-like fold revealed in Bordetella bacteriophage by cryoEM at 3.5 A resolution. eLife. 2013;2:e01299. doi: 10.7554/eLife.01299. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Yang J, et al. The I-TASSER Suite: Protein structure and function prediction. Nat Methods. 2015;12(1):7–8. doi: 10.1038/nmeth.3213. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Tétart F, et al. Phylogeny of the major head and tail genes of the wide-ranging T4-type bacteriophages. J Bacteriol. 2001;183(1):358–366. doi: 10.1128/JB.183.1.358-366.2001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Fokine A, et al. Structural and functional similarities between the capsid proteins of bacteriophages T4 and HK97 point to a common ancestry. Proc Natl Acad Sci USA. 2005;102(20):7163–7168. doi: 10.1073/pnas.0502164102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Gipson P, et al. Protruding knob-like proteins violate local symmetries in an icosahedral marine virus. Nat Commun. 2014;5:4278. doi: 10.1038/ncomms5278. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Liu X, et al. Structural changes in a marine podovirus associated with release of its genome into Prochlorococcus. Nat Struct Mol Biol. 2010;17(7):830–836. doi: 10.1038/nsmb.1823. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Bamford DH, Grimes JM, Stuart DI. What does structure tell us about virus evolution? Curr Opin Struct Biol. 2005;15(6):655–663. doi: 10.1016/j.sbi.2005.10.012. [DOI] [PubMed] [Google Scholar]
- 34.Veesler D, Cambillau C. A common evolutionary origin for tailed-bacteriophage functional modules and bacterial machineries. Microbiol Mol Biol Rev. 2011;75(3):423–433. doi: 10.1128/MMBR.00014-11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Pietilä MK, et al. Structure of the archaeal head-tailed virus HSTV-1 completes the HK97 fold story. Proc Natl Acad Sci USA. 2013;110(26):10604–10609. doi: 10.1073/pnas.1303047110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Morais MC, et al. Conservation of the capsid structure in tailed dsDNA bacteriophages: The pseudoatomic structure of ϕ29. Mol Cell. 2005;18(2):149–159. doi: 10.1016/j.molcel.2005.03.013. [DOI] [PubMed] [Google Scholar]
- 37.Simmons SL, et al. Population genomic analysis of strain variation in Leptospirillum group II bacteria involved in acid mine drainage formation. PLoS Biol. 2008;6(7):e177. doi: 10.1371/journal.pbio.0060177. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Schloissnig S, et al. Genomic variation landscape of the human gut microbiome. Nature. 2013;493(7430):45–50. doi: 10.1038/nature11711. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Pesant S, et al. Tara Oceans Consortium Coordinators Open science resources for the discovery and analysis of Tara Oceans data. Sci Data. 2015;2:150023. doi: 10.1038/sdata.2015.23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.John SG, et al. A simple and efficient method for concentration of ocean viruses by chemical flocculation. Environ Microbiol Rep. 2011;3(2):195–202. doi: 10.1111/j.1758-2229.2010.00208.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Thurber RV, Haynes M, Breitbart M, Wegley L, Rohwer F. Laboratory procedures to generate viral metagenomes. Nat Protoc. 2009;4(4):470–483. doi: 10.1038/nprot.2009.10. [DOI] [PubMed] [Google Scholar]
- 42.Patel A, et al. Virus and prokaryote enumeration from planktonic aquatic environments by epifluorescence microscopy with SYBR Green I. Nat Protoc. 2007;2(2):269–276. doi: 10.1038/nprot.2007.6. [DOI] [PubMed] [Google Scholar]
- 43.Hurwitz BL, Deng L, Poulos BT, Sullivan MB. Evaluation of methods to concentrate and purify ocean virus communities through comparative, replicated metagenomics. Environ Microbiol. 2013;15(5):1428–1440. doi: 10.1111/j.1462-2920.2012.02836.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Luo R, et al. SOAPdenovo2: An empirically improved memory-efficient short-read de novo assembler. Gigascience. 2012;1(1):18. doi: 10.1186/2047-217X-1-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Kultima JR, et al. MOCAT: A metagenomics assembly and gene prediction toolkit. PLoS One. 2012;7(10):e47656. doi: 10.1371/journal.pone.0047656. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Hyatt D, et al. Prodigal: Prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010;11:119. doi: 10.1186/1471-2105-11-119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Edgar RC. MUSCLE: A multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics. 2004;5:113. doi: 10.1186/1471-2105-5-113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Lee WP, et al. MOSAIK: A hash-based algorithm for accurate next-generation sequencing short-read mapping. PLoS ONE. 2014;9(3):e90581. doi: 10.1371/journal.pone.0090581. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Shannon P, et al. Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13(11):2498–2504. doi: 10.1101/gr.1239303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Wiśniewski JR, Zougman A, Nagaraj N, Mann M. Universal sample preparation method for proteome analysis. Nat Methods. 2009;6(5):359–362. doi: 10.1038/nmeth.1322. [DOI] [PubMed] [Google Scholar]
- 51.Washburn MP, Wolters D, Yates JR., 3rd Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat Biotechnol. 2001;19(3):242–247. doi: 10.1038/85686. [DOI] [PubMed] [Google Scholar]
- 52.Eng JK, McCormack AL, Yates JR. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J Am Soc Mass Spectrom. 1994;5(11):976–989. doi: 10.1016/1044-0305(94)80016-2. [DOI] [PubMed] [Google Scholar]
- 53.Tabb DL, McDonald WH, Yates JR., 3rd DTASelect and Contrast: Tools for assembling and comparing protein identifications from shotgun proteomics. J Proteome Res. 2002;1(1):21–26. doi: 10.1021/pr015504q. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Käll L, Canterbury JD, Weston J, Noble WS, MacCoss MJ. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat Methods. 2007;4(11):923–925. doi: 10.1038/nmeth1113. [DOI] [PubMed] [Google Scholar]
- 55.Deutsch EW, et al. A guided tour of the Trans-Proteomic Pipeline. Proteomics. 2010;10(6):1150–1159. doi: 10.1002/pmic.200900375. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Craig R, Beavis RC. TANDEM: Matching proteins with tandem mass spectra. Bioinformatics. 2004;20(9):1466–1467. doi: 10.1093/bioinformatics/bth092. [DOI] [PubMed] [Google Scholar]
- 57.Kessner D, Chambers M, Burke R, Agus D, Mallick P. ProteoWizard: Open source software for rapid proteomics tools development. Bioinformatics. 2008;24(21):2534–2536. doi: 10.1093/bioinformatics/btn323. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Mondav R, et al. Discovery of a novel methanogen prevalent in thawing permafrost. Nat Commun. 2014;5:3212. doi: 10.1038/ncomms4212. [DOI] [PubMed] [Google Scholar]
- 59.Vizcaíno JA, et al. ProteomeXchange provides globally coordinated proteomics data submission and dissemination. Nat Biotechnol. 2014;32(3):223–226. doi: 10.1038/nbt.2839. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Zhang Y. I-TASSER server for protein 3D structure prediction. BMC Bioinformatics. 2008;9:40. doi: 10.1186/1471-2105-9-40. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Wilkinson L. Exact and approximate area-proportional circular Venn and Euler diagrams. IEEE Trans Vis Comput Graph. 2012;18(2):321–331. doi: 10.1109/TVCG.2011.56. [DOI] [PubMed] [Google Scholar]
- 62.R Core Team 2012. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, Vienna)
- 63.Schlitzer R. 2011. Ocean Data View. Version 4.4.4. Available at odv.awi.de, 2011.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.