Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Apr 11.
Published in final edited form as: J Proteome Res. 2020 Dec 17;20(1):326–336. doi: 10.1021/acs.jproteome.0c00382

Development of an Ocean Protein Portal for Interactive Discovery and Education

Mak A Saito 1, Jaclyn K Saunders 2, Michael Chagnon 3, David A Gaylord 4, Adam Shepherd 5, Noelle A Held 6, Christopher Dupont 7, Nicholas Symmonds 8, Amber York 9, Matthew Charron 10, Danie B Kinkade 11
PMCID: PMC8036901  NIHMSID: NIHMS1683341  PMID: 32897077

Abstract

Proteins are critical in catalyzing chemical reactions, forming key cellular structures, and in regulating cellular processes. Investigation of marine microbial proteins by metaproteomics methods enables the discovery of numerous aspects of microbial biogeochemical processes. However, these datasets present big data challenges as they often involve many samples collected across broad geospatial and temporal scales, resulting in thousands of protein identifications, abundances, and corresponding annotation information. The Ocean Protein Portal (OPP) was created to enable data sharing and discovery among multiple scientific domains and serve both research and education functions. The portal focuses on three use case questions: “Where is my protein of interest?”, “Who makes it?”, and “How much is there?” and provides profile and section visualizations, real-time taxonomic analysis, and links to metadata, sequence analysis, and other external resources to enable connections to be made between biogeochemical and proteomics datasets.

Keywords: metaproteomics, ocean, biogeochemistry, data sharing, web portal

Graphical Abstract

graphic file with name nihms-1683341-f0001.jpg

INTRODUCTION

For decades, environmental scientists have relied on standard measurements to assess ecosystem change and health, such as temperature, oxygen concentration, nutrient content, chlorophyll abundance, and so on.1-3 These approaches, while essential in detecting ecosystem level understanding, are limited in their ability to explore what groups of organisms within those ecosystems are experiencing and how they may be responding to environmental change. Recent improvements in “omics” capabilities—consisting of four major omics: genomics, transcriptomics, proteomics, and metabolomics—now allow researchers to begin to open the “black box” of ecosystems to investigate each organism’s catalog of genes (genome), how they choose to deploy those genes in specific environmental settings (transcripts and proteins), and the resulting impact on metabolism and the chemical environment (metabolites).4-7 While these new capabilities are exciting, research is still in the relatively early stages of maximizing their utility. Moreover, because every individual biological sample can return thousands to millions of units of raw data (sequence or spectra), these datatypes are firmly in the realm of big data and bring unique informatic challenges.

We have developed a web portal called the “Ocean Protein Portal” that focuses on developing and improving the delivery of data products related to the measurement of proteins in the oceans, usually referred to as ocean metaproteomics. Oceans cover ~70% of the Earth’s surface and play a critical role in maintaining habitable conditions on the planet. Thus, the continued health of the oceans is an issue of sustainability. Moreover, the ocean and terrestrial microbial communities are responsible for most of the biogeochemical reactions that created and maintain habitable conditions on Earth.8 The direct measurement of proteins in the oceans has generated considerable excitement because proteins are the functional units of the cell. They represent where “the rubber meets the road”: enzymatic proteins are the biomolecules that interface with the environment and conduct biogeochemical reactions (Figure 1), rather than the blueprint of genetic potential that genomic data provides. Similarly, while RNA measurements provide information about the transcription of genes, the shorter timescales of RNA production and decay need to be considered in their interpretation. Protein measurements, with their longer timescales, can be applied as biomarkers of ecosystem health. Additionally, enzymatic proteins that are directly responsible for biogeochemical reactions can be measured and their activities estimated to validate global ecosystem models. Individual key proteins have been used to detect specific responses of microbial organisms to nutrients and environmental stressors (e.g., iron, nitrogen, phosphorus, and metabolite transporters)5,6,9-16 or important biogeochemical reactions (e.g., enzymes that catalyze carbon and nitrogen biogeochemical reactions).6,17-20 As a result, there is a growing interest among experimentalists, observationalists, and modelers to use metaproteomic data for contextual information about their research.

Figure 1.

Figure 1.

Example three-dimensional (3D) structures of common proteins found in the marine environment with important functional roles and routinely found within the OPP. Left: viral protein capsid of a marine cyanophage.54 Center: TonB vitamin transporter spanning the cell membrane.55 Right: carboxysome shell protein (CsoS1D) from Prochlorococcus marinus MED4 (PBD codes 2XD8, 2GSK, 3FCH).56 While genomics shows the potential to make these proteins, protein measurements can show the response of each organism to environmental cues by biosynthesis of specific proteins.

The fields of environmental genomic and transcriptomic informatics is more mature than for protein informatics, with millions of dollars invested to date on data access and analysis portals, including the defunct CAMERA project,21,22 the Department of Energy Joint Genome Institute’s Integrated Microbial Genomes and Metagenomes server (DOE JGI-IMG-M),23 the Ocean Gene Atlas that uses the Tara Ocean expedition dataset,24,25 and iMicrobe.26 In comparison, the Ocean Protein Portal is, to our knowledge, the first investment to date focused on environmental metaproteomic data that produced an operational product in active use across multiple science domains, including oceanography, geobiology, microbiology, and biochemistry communities. Here, we describe Version 1.1 of the Ocean Protein Portal as a means to promote the use of ocean metaproteomic data in research across multiple scientific domains and education.

METHODS

Technical Aspects

The OPP Version 1.1 is currently built using an Elasticsearch database for protein and peptide data, that is accessed by the UI, generated with Django, Javascript, OceansMap, Bokeh, and Matplotlib code.27,28 METATRYP Version 2 Least Common Ancestor software uses PostgreSQL and Python.29 Ingestion of data occurs through a process where data generators deposit data for the three file types described in Table 1 according to specified data templates while working with a BCO-DMO data manager. Complete research expedition metadata and colocated environmental datasets are discoverable through the BCO-DMO project pages (linked from the OPP). Both OPP and METATRYP are hosted on servers at the Woods Hole Oceanographic Institution (Table 2).

Table 1.

File Types Required by the OPP for Full Functionalitya

file description file type
protein identifications and abundance CSV template
peptide identifications and abundance CSV template
amino acid sequences of identified proteins FASTA text file
metadata expedition and dataset forms rich text format (RTF)

Table 2.

Ocean Protein Portal Websites and Submission Resources

description web address
The Ocean Protein Portal www.oceanproteinportal.org
METATRYP Version 2 Least Common Ancestor Analysis tool https://metatryp.whoi.edu
data submission instructions and protein and peptide data templates https://github.com/oceanproteinportal/data-file-templates
metadata form https://www.bco-dmo.org/files/bcodmo/DATASET.rtf

An ingestion pipeline has been developed through the application of metaproteomic domain-specific data templates into Elasticsearch using custom scripts and Minio file storage and has been tested within the BCO-DMO informatics ingestion and data management pipeline. This ingestion pipeline approach utilizing specified templates eases the database relationship connections in Elasticsearch among the data field names in accordance with the specified OPP ontology. We also used the Frictionlessdata data package to link the three files together, which can be expanded upon for further development of the OPP. The ontology design for processing these datatypes follows the Research Data Alliance output and recommendations from their Data Type Registry Working Group.

RESULTS AND DISCUSSION

Ocean Protein Portal as a Resource to Study Protein Biogeography and Function in the Oceans

The portal arose from community interest and use case development from the EarthCube ECO-GEO Research Coordination Network focused on environmental omics data. The OPP team represents a collaboration between domain scientists, informaticists, data managers, and computer programmers. The OPP use cases were designed to allow a broad range of scientists and students to discover answers to the questions: (1) “Where is my protein of interest in the oceans?”, (2) “Who makes the protein?”, and (3) “How much is there?”. The OPP is primarily a mechanism to study a single protein query at a time rather than a tool for a comprehensive analysis of a metaproteomic dataset. We previously published a metaproteomic viewer that facilitates some metaproteomic data visualization and analysis.16 Thus far, the OPP has achieved two milestones in two major categories: the launch of a functioning web user interface (UI) and essential backend infrastructure for the UI functioning, and educational and outreach activities to promote the study of proteins in environmental settings utilizing the OPP web UI (Table 3).

Table 3.

Accomplishments To Date for the Ocean Protein Portal Project

activity
reference
Best Practices Workshop and Publication for Data Sharing and Metadata Saito et al.,30
Development of Ocean Metaproteomic Viewer python software as test bed for OPP visualizations. Presented and published at Scientific Python conference. Held et al.16
Launch of METATRYP Least Common Ancestor Software and API (used by portal) February 2018 Saunders et al.29
Launch of Ocean Protein Portal Version 1 capable of answering use case questions: (1) Where is protein of interest? (2) who makes it? (3) how much is there? February 2019
First year metrics over 1000 uses of portal as of March 2020
Ingestion of protein datasets from Arctic, Antarctic, Pacific, and Atlantic Ocean. Future large datasets expected from Atlantic, Pacific, and Antarctic regions, including from BATS and HOT time series stations and the CICLOPS Ross Sea expedition. ongoing
Use of Ocean Protein Portal in undergraduate and graduate education Mt Alison College, MIT-WHOI, others
Use of OPP for the discovery of novel functional protein distributions and publication of data Mazzotta et al.52

The OPP UI enables users to answer the three use case questions above for their protein of interest in the oceans via multiple search strategies (Figure 2). The simplest is by entering its common name, for example, the carbon fixing enzyme “RUBISCO”, into the “Search Value” text box with the “Search Term” Protein Name selected. Wildcard searches (using “*”, for example “carboxy*”) are also allowed since protein names are not standardized and multiple names can be used to describe the same protein in the literature. Alternatively, users can search using accession numbers of various standardized bioinformatic identifiers, such as KEGG (Kyoto Encyclopedia of Genes and Genomes), UniProt, PFam (Protein Family), or EC (Enzyme Commission number), that allow cross-platform connectivity. Finally, peptide and full protein sequence searches are possible. For full protein sequences, the user enters the protein amino acid sequence, and the OPP breaks the sequence into smaller tryptic peptides—the tryptic peptides being the measured components of the deposited proteomics data—then searches for exact matches of those component peptides in the OPP database. All searches can be narrowed by various parameters (concentration, depth, filter size, dataset, and date range) using the sidebar widgets. Queries return a table of all matches, listing their protein and KEGG names, the dataset and expedition they were identified within, and the quantitative abundance within that dataset (in spectral count units currently). A map of station locations where the queried protein was identified is shown (stations where the protein is found become highlighted; Figure 2), and a map hover over capability provides expedition metadata.

Figure 2.

Figure 2.

Operational Ocean Protein Portal. A product name search (“major capsid protein”) showing capsid proteins from marine viruses (inset Table), vertical profile of capsid proteins (left inset window), protein sequence (center inset), and sectional distribution (right inset) of a major capsid protein from cyanophage, overlaid on the background map of stations (e.g., pink stations). This protein is used to make the physical body of the virus capsid sphere shown in Figure 1 (left), and its distribution across several thousand kilometers of ocean space in the Central Pacific Ocean can be determined with a simple search in the OPP. This protein is one of over 100,000 proteins ingested to date that can be searched for and visualized in the OPP.

After the initial query, users have three options at their disposal for further investigation of their protein of interest. First, users can visualize protein abundance in a vertical profile (one-dimensional (1D) by depth, “Profile Plot” button) or ocean section (two-dimensional (2D) by depth with interpolation across transect distance, “View Section” button) mode as pop-up windows (Figure 2). These visualizations use the open-source python tools Bokeh and Matplotlib and were prototyped by Held et al.16 Next, users can utilize a suite of links to other bioinformatic resources specific to their protein of interest, leveraging the capabilities of other pre-existing tools. These include BLAST sequence searches (“View Sequence”) that automatically inserts the protein amino acid sequence into NIH National Center for Biotechnology Information’s (NCBI) blastp search box facilitating search of the NIH sequence database as well a hyperlink to the European Bioinformatic Institute’s UniProt sequence database page for the closest related UniProt protein match, when available. The “Expedition” hyperlink routes the user to the full metadata and environmental datasets associated with the sample’s expedition hosted on the ocean environmental data repository at the Biological and Chemical Oceanography Data Management Office (BCO-DMO). Information about datasets and expeditions is also available on the “About Datasets” tab, including contact information of the data generators for each dataset.

Finally, the OPP has a compute capability that enables users to answer “who” is making their protein of interest (Figure 3). This is a key question within the field of metaproteomics because of the possibility that peptide constituents of proteins could be found in multiple organisms present within an individual sample. To address this, the OPP utilizes the software tool METATRYP we previously created that searches a database of all tryptic peptides among a group of organisms specified or within meta-omic assemblies from the environment.5,29 METATRYP then identifies peptides that are shared among multiple organisms and reports which organisms share the peptides and calculates the “Least Common Ancestor” (LCA) of the identified taxa possessing this peptide. The METATRYP databases use the NCBI Taxonomy database to identify the ancestral phylogeny of the taxa identified that possess the peptide in question. This analysis happens in real time using an API call to the “metatryp.whoi.edu” resource. By clicking on the “peptides” link, the user progresses to the “Peptide Found” table, where each peptide component can then be examined for its presence across numerous genomes and metagenome resources. The results can be visualized in heatmap and tree formats to allow the user to gain an immediate understanding of who the Least Common Ancestor of the protein constituent is and their associated taxonomic lineages (Figure 3). The OPP is currently using METATRYP Version 2 that has improved performance, can calculate the Least Common Ancestor, and has the capacity to separate its database into genomes, metagenomes, and metagenomic products that are described in a separate manuscript.29

Figure 3.

Figure 3.

Example of least common ancestor (LCA) analysis representing the taxonomic groups that a queried peptide is found within using the METATRYP API within the OPP UI. This carboxysome shell protein is conserved across multiple bacterial phyla, resulting in a similar broad bacteria LCA level returned using both genomes (left) and metagenome (right) databases within METATRYP.

Data Ingestion Templates and Data/Metadata Management

Ocean metaproteomic data is not currently standardized in terms of processed output fields and metadata. As a result, the process of ingesting data from a diverse data generator community can be challenging from a data management perspective. Efficient data ingestion is key to sustainability both with regards to the recruitment of voluntary data submissions to the OPP by data generators, and in terms of the effort needed by data managers and computing staff to successfully ingest data to allow it to function properly within the OPP system.

Through our collaboration with BCO-DMO, we have developed a data ingestion template to facilitate the incorporation of complex metaproteomic datasets into the OPP from a variety of data generators with their diverse informatic pipelines (Table 1). This effort leveraged community-driven best practices that arose from the Earth-Cube-supported Ocean Metaproteomics Data Sharing workshop.30 For every spectral count datapoint, there are associated 10 metadata fields and 13 annotation fields that can be captured by the current OPP schema. Example metadata reported for each sample includes sampling location (latitude and longitude), depth, date, time, expedition identifier, station number, and filter pore size. Some of these parameters are required, such as the geospatial metadata, while other parameters are optional, such as various annotation fields dependent upon the resolution of the data generators annotation informatics pipeline. We currently do not reannotate deposited datasets but hope to add that capability in the future will allow standardized searching across datasets for proteins of interest. To facilitate this, the database structure is built to allow updated versions with additional supplementary annotation fields that could capture new microbiological and protein function discoveries in previously deposited datasets while maintaining the data generators’ initial annotations which may link with published research.

Challenges of Comparisons Across Datasets, Units, and Normalization in the OPP

The current design of the OPP allows users to examine where a protein of interest is in the ocean microbial community, if that protein occurs in at least one of the ingested datasets. One challenge currently is that most ocean metaproteomic data is collected in relative abundance units of spectral counts, precursor/fragment fragment ion intensities (e.g., peptide ms1 peak areas or ms2 peak intensities from DIA datasets), making quantitative comparisons between datasets difficult because of varying instrumentation detection limits and informatic pipelines. The best solution to this is to shift eventually to absolute quantitation of copies of protein per volume of seawater (e.g., fmol/L), which can be compared across space and time with confidence. While absolute quantitation has been used in the ocean, using a technique called targeted metaproteomics,5,6,20 this datatype is currently scarcer compared to the relative abundance “global” metaproteomics. Moreover, intercomparison and intercalibration of the analytical method is needed to validate quantitative values across different data generating laboratories and periods of analysis within laboratories.

Despite these challenges, users will be tempted to compare abundances of their protein of interest across different datasets within the global ocean, comparing different expeditions. While such comparisons may be useful with a binary approach (presence/absence) or relative quantitation approach, we have cautioned users from meta-analyses. Instead, we encourage users to contact data generators, and if appropriate to collaborate with them on interpretation of results to avoid misinterpreting data as explained in the OPP data use policy found below and on the “About OPP” page in the UI.

Normalization of data for relative quantitative comparisons is also a factor to consider in the interpretation and comparison of results. While the OPP does not currently have any stipulation as to the type of spectral count units being used, we currently encourage the use of nonexclusive total (unnormalized) spectral counts to avoid poor search query performance and/or limit distortion of marine vertical community structure. This rationale, which is specific to metaproteomics and ocean vertical profile sampling, respectively, is explained here in more detail. For background, a spectral count is an easy to calculate unit defined as the count of mass spectra with a peptide identified within it. Within each sample analysis, 10’s of thousands of spectra are typically collected, and spectra that match to a peptide from proteins predicted by a specified sequence database are tabulated by peptide-to-spectrum mapping (PSM) algorithms. Software that calculates spectral counts often have the ability to calculate normalized spectral counts; for example, one normalization strategy is where each protein within the dataset is divided by the total number of spectral counts within the sample and multiplied by the average spectral counts in all samples. These normalizations can be problematic in metaproteomics samples because the number of PSMs and resultant total spectral counts can vary greatly between sampling depths, sites, and times as large changes in biological community structure occur. Decreases in total spectral counts may be due to limitations of the database being used with fewer peptide identifications with depth, or increased interferences by organic molecules and degraded peptides that are known to be prevalent with depth.17,18 An example of this problem is shown in Figure 4 where data-dependent global proteome analyses of microbial biomass sample from 20 to 800 m depth in the North Atlantic Ocean (see Breier et al., submitted for complete dataset description). For this dataset, samples were all injected with a uniform amount of protein (2 μg) onto the LC-MS (Thermo Q-Exactive) using an identical chromatographic gradient. Despite this constant amount of injected material, the profile resulted in more PSMs observed in the shallower waters (shown by the greater sum of total spectral counts, at 0.7 and 0.26% false discovery rates on the protein and peptide level, respectively), where microbial biomass is more abundant and better characterized by metagenomic databases. Four representative microbial proteins that have maxima at different depths show how normalization can cause considerable biases in their vertical structure. Surface proteins (UrtA1 and TufA) tend to be less abundant and deep proteins are more abundant (GroEL and OpuAC) than the comparable total spectral counts at each depth due to normalization. Based on these biases, it is not clear that this type of normalization provides benefit to the analysis. Alternatively, normalization to total protein extracted with depth may be more useful to realistically portray protein distributions (Saunders et al., in prep).

Figure 4.

Figure 4.

Normalization biases in metaproteomic data across depth in the ocean at the Bermuda Atlantic Time Series Station (31°40′N 64°10′W) in the North Atlantic Ocean on April 14, 2018 collected on 0.2 μm filters by Clio AUV. Left: Sum of total spectral counts (SC) for all proteins at each depth (red circles) and sum of spectral counts after normalizing to the average of all samples (blue crosses). Profiles for four microbial proteins that are abundant at the surface (urea transporter UrtA1), chlorophyll maximum (elongation factor TufA 80 m), mid-depth (chaperonin GroEL at 175 m), and deep (ligand-binding protein OpuAC family at 800 m). Changes in the biological community result in greater numbers of peptide-to-spectrum matches in the upper water column. This creates biases when normalization is conducted across depths by treating them as “similar” biosamples, with decreased shallow and increased deep normalized counts compared to the total counts. Data from Breier et al., submitted.

The NSAF normalization (normalized abundance spectral factor) and similar approaches (APEX; emPAI) that take into account protein length are also often used to prevent bias toward the identification of large proteins with many tryptic peptides over shorter protein sequences with fewer tryptic peptides.31-33 These corrections seem logical in laboratory experiments, but the metagenomic and metatranscriptomic databases that spectra are mapped to are often replete with incomplete open reading frames, resulting in incorrect molecular-weight estimations and the resulting length corrections to be incorrect. Hence, we currently caution use of NSAF units within the OPP, at least until the use of newer metagenomic assembly techniques becomes more widespread, such as when PSM solely to metagenome assembled genomes (MAGs) and single amplified genomes (SAGs) is possible, reducing concern for the presence of incomplete open reading frames.

Finally, there can be calculations of exclusive spectral counts, where each spectrum is only allowed to map to one sequence within the database, even if that peptide sequence is found within multiple proteins from the PSM search database. The occurrence of a peptide within multiple metagenomic or metatranscriptomic reads is a common occurrence within metaproteomics as the natural diversity found within the environment can be captured with sequencing, resulting in multiple sequence assemblies that have both high sequence identities and share identical tryptic peptides. Software such as Scaffold by Proteome Software allows output of “exclusive” spectral counts where spectra of peptides are restricted to map to only one protein sequence through the use of a straightforward parsimony algorithm where the protein that has the most peptide matches captures those spectral counts, or alternative “total” spectral counts where those peptide spectra are allowed to map to multiple proteins simultaneously. In cases where a meta-analysis of an entire dataset is being conducted for overall protein taxonomic diversity or function, the use of exclusive spectral counts is important to avoid double counting peptides. In contrast, in the single protein-query use case that the OPP is built for, allowing sharing of those peptides can actually be important in allowing exploration of the diversity of protein sequences that exist because exclusive spectral counts can “rob” the peptides from alternate near-identical protein sequences that may also be present, potentially suppressing the identification of rarer proteins in these communities. While a future update of the OPP could facilitate switching between multiple unit types (e.g., total, exclusive, normalized to total protein spectral counts, or total protein abundance), it is nonetheless important to articulate the implications and pitfalls of each approach in dealing with complex metaproteomic datasets. We encourage those doing high-resolution relative quantitation analyses to contact data submitters who can advise in the generation and interpretation of normalized spectral data in consideration of these technical challenges described. While emerging targeted metaproteomic data in absolute abundance units (fmoles per liter of seawater) will avoid many of these normalization and attribution problems, the ease with which relative abundance datasets containing thousands of proteins (in spectral counts or peak intensities) are generated makes them attractive to broad audiences for hypothesis generation and discovery, and hence the OPP is designed to serve this datatype.

OPP Schema

An initial data description (schema) for the OPP was generated along with the OPP prototype using a Resource Description Framework (RDF) format as an extension from the BCO-DMO schema.34 This Ocean Protein Portal Data Type Schema (OPP-DT)35 defines the different observational entities (e.g., peptide spectral counts, protein spectral counts, FASTA sequence), the associated metadata entities (cruise, sampling date, depth of sample, etc.), and the basic relationships between these entities currently in the portal. Figure 5 illustrates the OPP-DT subclass Total Spectral Counts, the observational entities within this class, the associated metadata entities, the relationship requirements between these entities, and an example of where a specified metadata entity can be linked out to other scientific data catalogs. This database schema allows for the functioning of the OPP web application UI. Additionally, this schema facilitates the submission of data into the OPP and helps users of the OPP interact with the data through a clear understanding of the relationships between the data fields.

Figure 5.

Figure 5.

Identifier relationships from the Ocean Protein Portal Total Spectral Count Data Type (OPP-DT).35 It illustrates the various relationship requirements between the three aggregate datatypes that comprise an OPP Total Spectral Count dataset.

Data Use Policy

The OPP is adopting the data use policies similar to the GEOTRACES program, where correct attribution and citation is viewed as an important aspect of the data policy. Moreover, the 2017 Workshop participants for Best Practices in Data Sharing30 recommended that users interested in using metaproteomic datasets in publications contact data generators and consider discussing collaboration if using their metaproteomic data. This serves two important purposes: First, there is a danger that nonexpert users may misinterpret or misuse data resulting in incorrect interpretations given the youth of the metaproteomic datatype especially when considering issues of cross dataset comparisons and normalizations. Publication of interpretations made from incorrect data use could damage broader community confidence in the metaproteomic datatype. Second, attribution to, and collaboration with, the data generators will create a valuable incentive for data generators to share future datasets in the OPP’s data search and visualization environment, versus solely depositing data in raw spectra repositories, where the data will not be accessible to broader communities outside of proteomics. Hence, the data policy outlined here is useful to the sustainability of the OPP. We anticipate that the use of visualizations in publications generated from the OPP could become commonplace and upon publication of the original datasets could occur with simple citation and/or permission of the data generators.

OPP Scoping Decisions

The OPP was scoped to allow it to be launched within a short time window, to avoid becoming obsolete by tying itself directly to specific proteomic informatic pipelines, and to be lightweight computationally and in terms of code maintenance to control upkeep costs for long-term sustainability. A key decision made thus far was for the OPP to accept processed protein and peptide data from depositors, rather than raw mass spectral data. The OPP does not conduct computationally expensive spectral-level reanalyses. These scoping decisions are also important in allowing the domain expert data generators to select and develop their preferred informatic pipelines. There are many up-stream proteomic pipelines used by data generators that produce comparable results, including the peptide-to-spectrum mapping search engines Sequest, Comet, X!Tandem, Mascot, MS-Fragger, OMSSA, etc.; data independent acquisition (DIA) and targeted search tools including Skyline, DIA-Umpire, Scaffold-DIA, EncyclopeDIA, etc; and multiple validating and integrating proteomic data systems such as Scaffold and the Trans-Proteome Pipeline.36-48 The OPP aims to leverage these packages by accepting the processed data produced by whichever package the data generator utilizes. The OPP was designed to accommodate versioning of submissions and associated metadata to enable data producers to make improvements to their pipelines and update datasets through the OPP data management in collaboration with BCO-DMO. Raw spectra repositories are available through the ProteomXchange; datasets deposited to the OPP can be linked to these repositories allowing expert users to conduct their own reanalyses if they choose to. Finally, the OPP is not a metagenomic or metatranscriptomic portal given the large amount of resources previously dedicated to those datatypes described above, but can connect with them through hyperlinks currently, and perhaps directly in the future using APIs.

Metrics to Date

The OPP is an online tool launched in 2019 and is in active use. Since its launch, the OPP has ingested and is serving eight large metaproteomic datasets from multiple data generator laboratories and each dataset can have multiple stations and depths within it (Table 4). Data are from the Atlantic,10 Pacific,6 Arctic,17,18,49,50 and Antarctic (Ross Sea)14 Oceans totaling 220 samples, containing 108 549 proteins and 1 581 602 peptides altogether. Note this is roughly equivalent to the number of samples within the well-known Tara Metagenome project.51 In parallel, the Least Common Ancestry software METATRYP (Version 2) is operational as a standalone tool and is also connected to the OPP via an API, and contains a total of 182 354 079 unique peptides within the database from 142 genomes, 3 metagenomes, and 4782 specialized genome assembly products (MAGs and SAGs) to date. Use metrics from Google Analytics include over 1300 website use instances of the OPP to date by 700 unique users (Figure 6, left), publication of protein distribution patterns and visualizations from the OPP.52

Table 4.

Ocean Metaproteomic Data Sets Currently within the Ocean Protein Portal

dataset name expedition number—location filter type (μm) sample publication status
Metzyme 0.2 KM1128; Central Pacific Ocean 0.2–3.0   37 Saito et al.6
Metzyme 3.0 KM1128; Central Pacific Ocean 3.0–51   40 in preparation
Nunn Arctic-Bering Sea HLY1301; Arctic Ocean and Bering Sea 0.003–0.8  2 May et al.50
Mikan et al.57
Morris CoFeMUG KN192; South Atlantic Ocean 0.03–3   16 Morris et al.10
Walsh Canada Basin JOIS 2015; Arctic Ocean 0.2–3.0  9 in preparation
Walsh Baffin Bay ArcticNet2013_CCSG_Amundsen; Arctic Ocean 0.2–3.0   12 in preparation
ProteOMZ FK160115; Central Pacific Ocean 0.2–3.0 103 in preparation
Ross Sea Net Tow (Bender) NBP0601; Ross Sea, Southern Ocean/Coastal Antarctica >20  2 Bender et al.14

Figure 6.

Figure 6.

Left: Users and average session duration metrics for the Ocean Protein Portal to date, with unique users totally ~700 since the launch in Spring of 2019. Right: Social media feedback from a graduate student at Oxford University, UK.

Sustainability

As with all data portals, the OPP faces challenges in operational sustainability and the development of improvements to increase functionality. It was designed with sustainability in mind, by minimizing expensive real-time computing capabilities, by leveraging open-access software, limiting the scope of datatypes accepted into the OPP, and not attempting to conduct real-time spectral analysis. The current funding model is to use grants for feature development, and “Broader Outreach” funding within core oceanography grants for operational costs (virtual machine, storage, data ingestion, code maintenance). Critical to this effort is for ingestion efforts to be streamlined through the data templates and ingestion pipeline described above to be sufficiently lightweight in data management conducted in collaboration with BCO-DMO.

Educational Use

In addition to the use in research, we hope that the OPP will be a useful tool in education. The OPP can provide students a means to understand how the otherwise invisible molecules they learn about in biology and chemistry classes are deployed by life in the natural environment. For example, students can observe how enzymes involved in carbon fixation and photosynthesis are concentrated in the upper layers of the ocean where light penetrates. There has already been interest in the educational use of the OPP. For example, the portal is being used in undergraduate teaching and thesis research projects at Mount Allison University (Amanda Cockshutt, pers. comm.) and within graduate microbiology, marine bioinorganic chemistry, and marine microbial biogeochemistry courses. Finally, there is an active social media account that has helped to generate interest and traffic to the OPP, as well as facilitate communication between users and the development team (Figure 6, right). Future curriculum development could help enable teachers and professors in using the OPP.

Future Improvements

A number of future improvements are planned. The current sequence-based search capability of the OPP allows the user to interrogate the dataset independently of annotation information, and hence is useful in situations where the protein function is not yet known or well characterized, as is the case for many nutrient transporters. Currently, sequence search sends full-length sequences to the METATRYP API, which digests the sequence into predicted tryptic peptides, then searches them against the OPP peptide database. While this search avenue is operational, it often does not produce any search results because the OPP requires identical string matches of the query peptide against peptides in the OPP database for identification, and hence does not provide flexibility for sequence variability associated with natural biological diversity that users are accustomed to from standard sequence alignment tools (e.g., BLAST: Basic Local Alignment Search Tool53). In the future, we hope to incorporate a BLAST-like search of query sequences against peptides in the database, allowing for some sequence variability to exist between the user’s query sequence and the OPP database peptides.

CONCLUSIONS

The Ocean Protein Portal was developed to facilitate research and education by allowing users to search for a protein of interest and examine its distribution in nature. Moreover, taxonomic assessment of the protein is enabled through the use of least common ancestor analysis. With growing interest in ocean health, the OPP will be a valuable resource in connecting a broad audience to ocean metaproteomic datasets, enabling a greater understanding of ocean biochemistry and how global and regional environmental change is influencing these critical environments.

ACKNOWLEDGMENTS

The development of the OPP was supported by an NSF EarthCube grant: “Laying the Foundation for an Ocean Protein Portal” (NSF 1639714), and as part of the broader impacts of NSF-OCE grants 1850719 and 1658030 and NIH grant GM135709-01A1. The underlying METATRYP peptide taxonomic software was developed in a grant from the Gordon and Betty Moore Foundation Marine Microbiology Initiative program (GBMF #8453). J.K.S. was supported by a NASA Postdoctoral Fellowship. The OPP team is a collaboration between the Saito laboratory, the Information Services Application group, and the Biological and Chemical Oceanography Data Management Office all at the Woods Hole Oceanographic Institution. Consulting services were provided by the RPS group. The efforts of the participants of the Data Sharing Workshop for Ocean Metaproteomics (May 2017) were also instrumental in developing best practices for ocean metaproteomics data sharing. We thank two anonymous reviewers for useful suggestions. Finally we are indebted to current (Table 4) and future data contributors without whom the Ocean Protein Portal would not exist.

Footnotes

The authors declare no competing financial interest.

Contributor Information

Mak A. Saito, Woods Hole Oceanographic Institution, Falmouth, Massachusetts 02543, United States.

Jaclyn K. Saunders, Woods Hole Oceanographic Institution, Falmouth, Massachusetts 02543, United States.

Michael Chagnon, RPS Group, South Kingston, Rhode Island 02879, United States; Kaimika Technology, Cumberland, Rhode Island 02864, United States.

David A. Gaylord, Woods Hole Oceanographic Institution, Falmouth, Massachusetts 02543, United States.

Adam Shepherd, Woods Hole Oceanographic Institution, Falmouth, Massachusetts 02543, United States.

Noelle A. Held, Woods Hole Oceanographic Institution, Falmouth, Massachusetts 02543, United States.

Christopher Dupont, Woods Hole Oceanographic Institute, Falmouth, Massachusetts 02543, United States.

Nicholas Symmonds, Woods Hole Oceanographic Institution, Falmouth, Massachusetts 02543, United States.

Amber York, Woods Hole Oceanographic Institution, Falmouth, Massachusetts 02543, United States.

Matthew Charron, Kaimika Technology, Cumberland, Rhode Island 02864, United States.

Danie B. Kinkade, Woods Hole Oceanographic Institute, Falmouth, Massachusetts 02543, United States.

REFERENCES

  • (1).Gebbie G; Huybers P The little ice age and 20th-century deep Pacific cooling. Science 2019, 363, 70–74. [DOI] [PubMed] [Google Scholar]
  • (2).Steinberg DK; Carlson CA; Bates NR; Johnson RJ; Michaels AF; Knap AH Overview of the US JGOFS Bermuda Atlantic Time-series Study (BATS): a decade-scale look at ocean biology and biogeochemistry. Deep Sea Res., Part II 2001, 48, 1405–1447. [Google Scholar]
  • (3).Church MJ; Lomas MW; Muller-Karger F Sea change: Charting the course for biogeochemical ocean time-series research in a new millennium. Deep Sea Res., Part II 2013, 93, 2–15. [Google Scholar]
  • (4).Rusch DB; Halpern AL; Sutton G; Heidelberg KB; Williamson S; Yooseph S; Wu D; Eisen JA; Hoffman JM; Remington K; Beeson K; Tran B; Smith H; Baden-Tillson H; Stewart C; Thorpe J; Freeman J; Andrews-Pfannkoch C; Venter JE; Li K; Kravitz S; Heidelberg JF; Utterback T; Rogers Y-H; Falcón LI; Souza V; Bonilla-Rosso G; Eguiarte LE; Karl DM; Sathyendranath S; Platt T; Bermingham E; Gallardo V; Tamayo-Castillo G; Ferrari MR; Strausberg RL; Nealson K; Friedman R; Frazier M; Venter JC The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific. PLoS Biol. 2007, 5, e77. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (5).Saito MA; Dorsk A; Post AF; McIlvin MR; Rappé MS; DiTullio GR; Moran DM Needles in the blue sea: Sub-species specificity in targeted protein biomarker analyses within the vast oceanic microbial metaproteome. Proteomics 2015, 15, 3521–3531. [DOI] [PubMed] [Google Scholar]
  • (6).Saito MA; McIlvin MR; Moran DM; Goepfert TJ; DiTullio GR; Post AF; Lamborg CH Multiple nutrient stresses at intersecting Pacific Ocean biomes detected by protein biomarkers. Science 2014, 345, 1173–1177. [DOI] [PubMed] [Google Scholar]
  • (7).Soule MCK; Longnecker K; Johnson WM; Kujawinski EB Environmental metabolomics: Analytical strategies. Mar. Chem 2015, 177, 374–387. [Google Scholar]
  • (8).Falkowski PG; Fenchel T; Delong EF The Microbial Engines That Drive Earth’s Biogeochemical Cycles. Science 2008, 320, 1034–1039. [DOI] [PubMed] [Google Scholar]
  • (9).Sowell SM; Wilhelm LJ; Norbeck AD; Lipton MS; Nicora CD; Barofsky DF; Carlson CA; Smith RD; Giovanonni SJ Transport functions dominate the SAR11 metaproteome at low-nutrient extremes in the Sargasso Sea. ISME J. 2009, 3, 93–105. [DOI] [PubMed] [Google Scholar]
  • (10).Morris RM; Nunn BL; Frazar C; Goodlett DR; Ting YS; Rocap G Comparative metaproteomics reveals ocean-scale shifts in microbial nutrient utilization and energy transduction. ISME J. 2010, 4, 673–685. [DOI] [PubMed] [Google Scholar]
  • (11).Bertrand EM; Moran DM; McIlvin MR; Hoffman JM; Allen AE; Saito MA Methionine synthase interreplacement in diatom cultures and communities: Implications for the persistence of B12 use by eukaryotic phytoplankton. Limnol. Oceanogr. 2013, 58, 1431–1450. [Google Scholar]
  • (12).Bridoux MC; Neibauer J; Ingalls AE; Nunn BL; Keil RG Suspended marine particulate proteins in coastal and oligotrophic waters. J. Mar. Syst 2015, 143, 39–48. [Google Scholar]
  • (13).Colatriano D; Ramachandran A; Yergeau E; Maranger R; Gélinas Y; Walsh DA Metaproteomics of aquatic microbial communities in a deep and stratified estuary. Proteomics 2015, 15, 3566–3579. [DOI] [PubMed] [Google Scholar]
  • (14).Bender SJ; Moran DM; McIlvin MR; Zheng H; McCrow JP; Badger J; DiTullio GR; Allen AE; Saito MA Colony formation in Phaeocystis antarctica: connecting molecular mechanisms with iron biogeochemistry. Biogeosciences 2018, 15, 4923–4942. [Google Scholar]
  • (15).Bergauer K; Fernandez-Guerra A; Garcia JA; Sprenger RR; Stepanauskas R; Pachiadaki MG; Jensen ON; Herndl GJ Organic matter processing by microbial communities throughout the Atlantic water column as revealed by metaproteomics. Proc. Natl. Acad. Sci. U.S.A 2018, 115, E400–E408. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (16).Held N; Saunders J; Futrelle J; Saito M In Harnessing the Power of Scientific Python to Investigate Biogeochemistry and Metaproteomes of the Central Pacific Ocean. Proceedings of the Python in Science Conference, 2018. [Google Scholar]
  • (17).Moore EK; Nunn BL; Goodlett DR; Harvey HR Identifying and tracking proteins through the marine water column: Insights into the inputs and preservation mechanisms of protein in sediments. Geochim. Cosmochim. Acta 2012, 83, 324–359. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (18).Moore EK; Harvey HR; Faux JF; Goodlett DR; Nunn BL Protein recycling in Bering Sea algal incubations. Mar. Ecol.: Prog. Ser 2014, 515, 45–59. [Google Scholar]
  • (19).Hawley AK; Brewer HM; Norbeck AD; Pasa-Tolic L; Hallam SJ Metaproteomics reveals differential modes of metabolic coupling among ubiquitous oxygen minimum zone microbes. Proc. Natl. Acad. Sci. U.S.A 2014, 111, 11395–11400. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (20).Saito MA; McIlvin MR; Moran DM; Santoro AE; Dupont CL; Rafter PA; Saunders JK; Kaul D; Lamborg CG ; Westley M; Valois F; Waterbury JB Abundant nitrite-oxidizing metalloenzymes in the mesopelagic zone of the tropical Pacific Ocean. Nat. Geosci 2020, 13, 355–362. [Google Scholar]
  • (21).Sun S; Chen J; Li W; Altintas I; Lin A; Peltier S; Stocks K; Allen EE; Ellisman M; Grethe J; Wooley J Community cyberinfrastructure for Advanced Microbial Ecology Research and Analysis: the CAMERA resource. Nucleic Acids Res. 2011, 39, D546–D551. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (22).Seshadri R; Kravitz SA; Smarr L; Gilna P; Frazier M CAMERA: a community resource for metagenomics. PLoS Biol. 2007, 5, No. e75. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (23).Markowitz VM; Ivanova NN; Szeto E; Palaniappan K; Chu K; Dalevi D; Chen I-MA; Grechkin Y; Dubchak I; Anderson I IMG/M: a data management and analysis system for metagenomes. Nucleic Acids Res. 2007, 36, D534–D538. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (24).Carradec Q; Pelletier E; Da Silva C; Alberti A; Seeleuthner Y; Blanc-Mathieu R; Lima-Mendez G; Rocha F; Tirichine L; Labadie K A global ocean atlas of eukaryotic genes. Nat. Commun 2018, 9, No. 373. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (25).Villar E; Vannier T; Vernette C; Lescot M; Cuenca M; Alexandre A; Bachelerie P; Rosnet T; Pelletier E; Sunagawa S; Hingamp P The Ocean Gene Atlas: exploring the biogeography of plankton genes online. Nucleic Acids Res. 2018, 46, W289–W295. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (26).Youens-Clark K; Bomhoff M; Ponsero AJ; Wood-Charlson EM; Lynch J; Choi I; Hartman JH; Hurwitz BL iMicrobe: Tools and data-driven discovery platform for the microbiome sciences. GigaScience 2019, 8, No. giz083. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (27).Bokeh Development Team. Bokeh: Python library for interactive visualization. 2020, https://bokeh.org.
  • (28).Hunter JD Matplotlib: A 2D Graphics Environment. Comput. Sci. Eng 2007, 9 (3), 90–95. [Google Scholar]
  • (29).Saunders JK; Gaylord DA; Held NA; Symmonds N; Dupont CL; Shepherd A; Kinkade D; Saito MA METATRYP v 2.0: Metaproteomic Least Common Ancestor Analysis for Taxonomic Inference Using Specialized Sequence Assemblies-Standalone Software and Web Servers for Marine Microorganisms and Coronaviruses. Proteome Res. 2020, DOI: 10.1021/acs.jproteome.0c00385. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (30).Saito MA; Bertrand EM; Duffy ME; Gaylord DA; Held NA; Hervey WJ IV; Hettich RL; Jagtap PD; Janech MG; Kinkade DB; et al. Progress and Challenges in Ocean Metaproteomics and Proposed Best Practices for Data Sharing. J. Proteome Res 2019, 18, 1461–1476. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (31).Braisted JC; Kuntumalla S; Vogel C; Marcotte EM; Rodrigues AR; Wang R; Huang S-T; Ferlanti ES; Saeed AI; Fleischmann RD; et al. The APEX Quantitative Proteomics Tool: generating protein quantitation estimates from LC-MS/MS proteomics results. BMC Bioinf. 2008, 9, No. 529. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (32).Lu P; Vogel C; Wang R; Yao X; Marcotte EM Absolute protein expression profiling estimates the relative contributions of transcriptional and translational regulation. Nat. Biotechnol 2007, 25, 117–124. [DOI] [PubMed] [Google Scholar]
  • (33).McIlwain S; Mathews M; Bereman MS; Rubel EW; MacCoss MJ; Noble WS Estimating relative abundances of proteins from shotgun proteomics data. BMC Bioinf. 2012, 13, 308. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (34).Shepherd A BCO-DMO Ocean Data Ontology. Zenodo, 2019. [Google Scholar]
  • (35).Shepherd A; Saito M; Saunders J; Held N; Kinkade D Ocean Protein Portal - Data Type Ontology. Zenodo, 2019. [Google Scholar]
  • (36).Azencott R; Hawke DH; Kong A Improvement of OMSSA for High Accuracy MS/MS Data. J. Biomol. Tech 2014, 25, No. S32. [Google Scholar]
  • (37).Craig R; Beavis RC A method for reducing the time required to match protein sequences with tandem mass spectra. Rapid Commun. Mass Spectrom 2003, 17, 2310–2316. [DOI] [PubMed] [Google Scholar]
  • (38).Deutsch EW; Mendoza L; Shteynberg D; Slagel J; Sun Z; Moritz RL Trans-Proteomic Pipeline, a standardized data processing pipeline for large-scale reproducible proteomics informatics. Proteomics: Clin. Appl 2015, 9, 745–754. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (39).Egertson JD; MacLean B; Johnson R; Xuan Y; MacCoss MJ Multiplexed peptide analysis using data-independent acquisition and Skyline. Nat. Protoc 2015, 10, 887. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (40).Eng JK; Jahan TA; Hoopmann MR Comet: an open-source MS/MS sequence database search tool. Proteomics 2013, 13, 22–24. [DOI] [PubMed] [Google Scholar]
  • (41).Eng JK; McCormack AL; Yates JR An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom 1994, 5, 976–989. [DOI] [PubMed] [Google Scholar]
  • (42).Gillet LC; Navarro P; Tate S; Rüst H; Selevsek N; Reiter L; Bonner R; Aebersold R Targeted data extraction of the MS/MS spectra generated by data-independent acquisition: a new concept for consistent and accurate proteome analysis. Mol. Cell. Proteomics 2012, 11, No. O111.016717. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (43).Kong AT; Leprevost FV; Avtonomov DM; Mellacheruvu D; Nesvizhskii AI MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry–based proteomics. Nat. Methods 2017, 14, 513. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (44).Perkins DN; Pappin DJ; Creasy DM; Cottrell JS Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20, 3551–3567. [DOI] [PubMed] [Google Scholar]
  • (45).Pedrioli PG In Trans-Proteomic Pipeline: A Pipeline for Proteomic Analysis. Proteome Bioinformatics; Springer, 2010; pp 213–238. [DOI] [PubMed] [Google Scholar]
  • (46).Pino LK; Searle BC; Bollinger JG; Nunn B; MacLean B; MacCoss MJ The Skyline ecosystem: Informatics for quantitative mass spectrometry proteomics. Mass Spectrom. Rev 2020, 39, 229–244. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (47).Searle BC Scaffold: a bioinformatic tool for validating MS/MS-based proteomic studies. Proteomics 2010, 10, 1265–1269. [DOI] [PubMed] [Google Scholar]
  • (48).Searle BC; Pino LK; Egertson JD; Ting YS; Lawrence RT; MacLean BX; Ville’n J; MacCoss MJ Chromatogram libraries improve peptide detection and quantification by data independent acquisition mass spectrometry. Nat. Commun 2018, 9, 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (49).Mikan MP; Harvey HR; Timmins-Schiffman E; Riffle M; May DH; Salter I; Noble WS; Nunn BL Metaproteomics reveal that rapid perturbations in organic matter prioritize functional restructuring over taxonomy in western Arctic Ocean microbiomes. ISME J. 2020, 14, 39–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (50).May DH; Timmins-Schiffman E; Mikan MP; Harvey HR; Borenstein E; Nunn BL; Noble WS An Alignment-Free “Metapeptide” Strategy for Metaproteomic Characterization of Microbiome Samples Using Shotgun Metagenomic Sequencing. J. Proteome Res 2016, 15, 2697–2705. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (51).Pesant S; Not F; Picheral M; Kandels-Lewis S; Le Bescot N; Gorsky G; Iudicone D; Karsenti E; Speich S; Troublé R Open science resources for the discovery and analysis of Tara Oceans data. Sci. Data 2015, 2, 1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (52).Mazzotta MG; McIlvin MR; Saito MA Characterization of the Fe metalloproteome of a ubiquitous marine heterotroph, Pseudoalteromonas (BB2-AT2): multiple bacterioferritin copies enable significant Fe storage. Metallomics 2020, 12, 654–667. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (53).Altschul SF; Madden TL; Schaffer AA; Zhang J; Zhang Z; Miller W; Lipman DJ Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25, 3389–402. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (54).Liu X; Zhang Q; Murata K; Baker ML; Sullivan MB; Fu C; Dougherty MT; Schmid MF; Osburne MS; Chisholm SW Structural changes in a marine podovirus associated with release of its genome into Prochlorococcus. Nat. Struct. Mol. Biol 2010, 17, 830–836. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (55).Shultis DD; Purdy MD; Banchs CN; Wiener MC Outer membrane active transport: structure of the BtuB: TonB complex. Science 2006, 312, 1396–1399. [DOI] [PubMed] [Google Scholar]
  • (56).Klein MG; Zwart P; Bagby SC; Cai F; Chisholm SW; Heinhorst S; Cannon GC; Kerfeld CA Identification and structural analysis of a novel carboxysome shell protein with implications for metabolite transport. J. Mol. Biol 2009, 392, 319–333. [DOI] [PubMed] [Google Scholar]
  • (57).Mikan MP; Harvey HR; Timmins-Schiffman E; Riffle M; May DH; Salter I; Noble WS; Nunn BI Metaproteomics reveal that rapid perturbations in organic matter prioritize functional restructuring over taxonomy in western Arctic Ocean microbiomes. ISME J. 2020, 14 (1), 39–52. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES