Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Mar 28.
Published in final edited form as: Nat Methods. 2023 Mar;20(3):339–346. doi: 10.1038/s41592-023-01802-5

Sampling the proteome by emerging single-molecule and mass spectrometry methods

Michael J MacCoss 1,#,*, Javier Antonio Alfaro 2,3,4,#,*, Danielle A Faivre 1, Christine C Wu 1, Meni Wanunu 5, Nikolai Slavov 6,7,#,*
PMCID: PMC10044470  NIHMSID: NIHMS1883706  PMID: 36899164

Summary

Mammalian cells have about 30,000-fold more protein molecules than mRNA molecules, which has major implications in the development of proteomics technologies. We review strategies that have been helpful for counting billions of protein molecules by liquid chromatography-tandem mass spectrometry (LC-MS/MS) and suggest that these strategies can benefit single-molecule methods, especially in mitigating the challenges of the wide dynamic range of the proteome.

Introduction

The ubiquitous roles of proteins in biomedicine are well appreciated and have motivated technologies seeking to advance the sensitivity and throughput of quantitative protein analysis. While proteomic technologies may use different approaches, they face similar challenges, such as quantifying proteins of vastly different abundances, some present in only a few copies and some present in tens of millions of copies per typical mammalian cell. This wide dynamic range poses a substantial challenge for investigating proteome biology.

Mass spectrometry (MS) has powered proteomics from the first demonstration of peptide sequencing using MS in the 1970s1,2. Since then, milestones in MS-based proteomics have included de novo sequencing entire proteins in the late 1980s3,4, soft ionization by electrospray5, automated spectral interpretation6, multiplexing the acquisition of spectra on different peptides using data independent acquisition7, multiplexing the acquisition of different samples using tandem mass tags8, and quantifying thousands of proteins in single human cells9,10. Together, the steady growth in the rate of protein identification using MS has been reminiscent of Moore's law, resulting in about 1,250-fold higher throughput: from about 20 protein data points per hour in 200111 to about 25,000 protein data points per hour achieved by plexDIA12. This increased throughput has been critical for addressing challenges in biomedical research13. It also highlights the power of experimental strategies and technological progress to tackle the immense demands of proteomics in terms of quantity and dynamic range that is required for thorough analysis, given the large number of proteins of widely varying concentrations in a cell.

More recently, non-MS methods have made exciting steps towards identifying and potentially sequencing single polypeptide molecules14-16. Conceptually, these methods aim to adapt flow-cell and nanopore methods developed for nucleic acid analysis for protein analysis. Flow-cell based methods include highly parallel single-molecule N-terminal peptide sequencing methods based on either Edman degradation17 or amino peptidases18. Another approach aims to use degenerate affinity reagents to recognize individual protein molecules separated spatially in a flow cell19,20. Other groups are working to adapt nanopore sequencing to peptides and proteins21,22. Most of these methods aim to detect a subset of the amino acids within a polypeptide sequence, which provides a fingerprint, or a constraint, on choosing a sequence among the known protein coding gene products from the genome. While these methods have yet to be applied to biologically derived protein mixtures, they have generated significant enthusiasm within the scientific community as a complement to MS analysis14.

These developments have motivated renewed interest and investment in advancing proteomics technologies, as reflected in private funding14 and in recent National Human Genome Research Institute (NHGRI) funding opportunities aimed at accelerating the development of technologies for single-molecule sequencing and single-cell proteome analysis. Because there is excitement for emerging single-molecule counting methods for proteomics, we felt it was timely to compare them to strategies used by the current state-of-the-art proteomics methods based on liquid chromatography-tandem mass spectrometry (LC-MS/MS). We hope our opinion will provide benchmarks and directions for the technological breakthroughs that need to be achieved for single molecule protein/peptide counting to achieve parity and complement the capabilities of LC-MS/MS based proteomics methods. How many molecules need to be counted? How extreme is the dynamic range problem? Do current solutions for handling the dynamic range problem limit the sequence coverage of the proteome? What will new technologies need to accomplish to reach parity with LC-MS/MS, and how will these technologies complement one another? What can these emerging technologies learn from LC-MS/MS based proteomics? These are the questions we aim to address.

How many molecules need to be counted?

Many of the challenges for accurate and sensitive protein quantification are shared by all proteomics methods, such as the quantification over a wide dynamic range. Indeed, a typical mammalian cell contains billions of protein molecules but less than half a million RNA molecules23, Figure 1a. Some proteins are present at hundreds of copies per cell while others (e.g., histones) at tens of millions of copies per cell, resulting in about 106 dynamic range24. The range of protein abundances is even larger for body fluids, such as plasma where protein abundances may differ by 1010, e.g., between albumin and IL-625. This presents a fundamental challenge because the presence of abundant proteins make it rare to count molecules from low abundant proteins, such as having to count billions of albumin molecules before having a chance to detect a single IL-6 molecule. This means that the single-molecule approaches that have been used successfully to quantify the transcriptome, which spans about 103 dynamic range, face major challenges in scaling to quantify the proteome22.

Figure 1 ∣. Overview of RNA and protein statistics.

Figure 1 ∣

A, A representative human cell, such a fibroblast, has billions of protein molecules compared to merely hundreds of thousands of RNA molecules26. Accordingly, MS analysis samples more protein molecules per sample than the RNA molecules sampled by RNA-seq. b, Estimated cost per molecule for MS and RNA-seq. The single-cell estimates are based on published numbers for unique molecular identifiers per single cell analyzed by Smart-seq367 and number of protein molecules counted by plexDIA12.

A typical mammalian cell, i.e., a HeLa cell with a volume of ~3,000 μm3, contains about 300,000 mRNA molecules26 and about 10,000,000,000 protein molecules23, Figure 1a. The cell is a crowded mesh of proteins, with a typical density of 3 million protein molecules per cubic micrometer. Even a yeast cell with a volume of ~30 μm3 contains ~100 million molecules. This protein density estimate has been supported independently using molecular measurement based on MS, as well as fluorescence microscopy using green fluorescent protein. Given these different independent measurements, it is estimated that the typical HeLa cell contains at least ~3–5 billion proteins per cell and others like macrophages (5,000 μm3) and cardiomyocytes (15,000 μm3) will contain substantially more. Because of this range in volume, we used ~10 billion proteins per cell in our calculations.

Given these estimates of the relative abundance ratio of mRNA to protein molecules, we calculate that about 30,000-fold more counts are required to characterize the protein molecules at an analogous coverage of what has been achieved with the transcriptome, Figure 1. Given the potential need to count a large number of protein molecules, we next explore the feasibility of achieving the required scale at affordable cost using estimates for cost per molecule. This factor is important, but it must be considered in the context of many other factors, such as the ability to sample large numbers of diverse sequences and to multiplex efficiently.

How much do single molecule counting methods cost?

While single-molecule protein counting approaches are yet to report the analysis of a complex protein mixtures, we believe that with time and resources the efforts of reading peptide sequences in a spatially parallelized format will be successful14. Without knowing what the capabilities and limitations are for these emerging protein and peptide sequencing methods – we make the optimistic assumption that these methods will be able to achieve sequencing counts of polypeptides on par with what state-of-the-art Illumina sequencing can achieve currently with oligonucleotides. Thus, we use single-molecule RNA sequencing by Illumina as a proxy to represent single-molecule protein counting approaches, Fig. 1b. To estimate the cost for current advanced technologies, we use the estimate of 10,000 USD for sequencing 4 billion reads by Illumina NovaSeq over ~2-days and 500 USD for performing a 2-hour quantitative LC-MS/MS analysis. These costs were chosen as conservative estimates based on inquiries from several academic core facilities and the rates include personnel, sample prep, and basic computational analysis as part of the service. While academic research laboratories may achieve lower costs, these prices represent objective estimates for widely accessible services. Fig. 1b shows that the cost per protein molecule analyzed by LC-MS/MS is lower than the cost of DNA molecule sequenced by Illumina. This indicates that single-molecule DNA sequencing has not yet achieved a cost that would enable counting of sufficient numbers of molecules to achieve affordable and comprehensive quantification of mammalian proteomes.

Counting ions by LC-MS/MS

Traditionally, the MS proteomics field reports lists of peptides detected and the proteins they are derived from. As peptides elute off the HPLC column, the instrument counts large numbers of peptide ions based on their mass-to-charge (m/z) ratio, independently of their sequence identification (Figure 2a). The abundance of each analyte is often determined from a background subtracted peak area of the extracted ion chromatogram(s). Depending on the method used, the peak area can be obtained from the unfragmented MS1 spectra or from tandem mass spectra (MS/MS or MS2) collected using methods like data independent acquisition. The peak area is derived from the detector ion current, either from the flow of ions to an electron multiplier27 or the generation of an image current in a Fourier transform mass analyzer28. The current is a measure of the number of ions (charged molecules) counted, normalized by the amount of time spent sampling the signal. The measured signal is proportional to ions/second, and thus, it can be converted into a number of counted ions and for direct comparison with single molecule counting methods9,29,30.

Figure 2 ∣. A liquid chromatography-mass spectrometry experiment can count billions of peptide ions within 90 min.

Figure 2 ∣

Signal from the MS1 spectra of an LC-MS run of enriched extracellular vesicles from human plasma using data independent acquisition of an ThermoFisher Eclipse Tribrid. a, A 2-dimensional ion map of the MS1 peptide signal separated in both retention time (RT) and m/z dimensions. The red dashed line indicates the location of the spectrum in (b). b, Selection of a single MS1 spectrum collected at 57.67 min where blue m/z values have been assigned a peptide sequence and red m/z values are unassigned in the analysis. c, A total ion current (TIC) plot of the signal intensity from (a) at all time points. The TIC signal is plotted in black, and the blue represents the fraction of the MS1 signal (e.g., in b) that has been confidently assigned to peptide sequences. The y-axis represents an approximation of counts (ions per second). The insert is a histogram counting distinct molecular entities (features) for different measured intensities. The gray bars of the insert represent all molecular features and the blue represents those assigned a peptide label. The data were only analyzed for unmodified and fully tryptic peptides in the canonical human fasta. d, Represents the same data plotted in (c) but with the y-axis of each spectrum adjusted to an estimate of ions by multiplying the counts by the Orbitrap fill time. The variable fill times allow peptides with relatively low abundance near 20-30 min to be measured with a similar number of ions as the most abundant peptides in the analysis. The result is billions of peptide ions counted within just 90 minutes. Data available at: https://panoramaweb.org/Single_Molecule_Counting.url under PXD035637. Code available at: https://github.com/uw-maccosslab/single_molecule_counting.

LC-MS/MS methods can improve the sensitivity to low-abundance analytes by changing the time spent sampling the signal (aka dwell time, integration time, or injection time). In some MS instruments, such as ion traps, the time spent sampling ions changes dynamically depending on the signal at that time31. This dynamic adjustment of the injection time, known as automatic gain control (AGC), provides an ideal ion population for the MS measurement (Figure 2b). However, an added benefit of AGC is that it enables the instrument to spend less time on abundant molecular species but scale the current into a larger quantity while maintaining quantitative linearity. Likewise, it enables spending more time on less abundant peptides to enable the measurement of the weaker signal; This increases the dynamic range and the total number of ions identified (Figure 2c). Dividing each spectrum intensity by the time taken to acquire the spectrum gives a normalized signal for each spectrum that is analogous to normalizing the counts obtained between flow cells in a single-molecule counting experiment32.

LC-MS has a much greater dynamic range than would be expected from simply counting the billions of ions and assigning the counts to peptides. This increase in dynamic range arises because LC-MS first chromatographically separates peptides based on their physical properties so that peptides of the same sequence are measured together (Figure 3). This strategy of counting the same peptide sequences together to provide a quantity is effectively a compression scheme for counting molecules. Additionally, using gas-phase methods, MS can further improve the dynamic range by measuring the m/z of all peptides and fragments with the same values together. Thus, the effect of highly abundant peptides on the measurement of lowly abundant peptides is minimized because they are measured separately and in some experiment types, separate trap fills (i.e. analogous to measuring abundant transcripts in different flow cells from low abundance transcripts). The mass spectrometry community has capitalized on this strategy to improve the detection and precision of low abundance molecules7,33-35 in the presence of analytes with much greater abundance. Because the timescale of this measurement is fast (sub-second) MS can analyze such compressed groups of ions (~10 to 1x106 ion copies at a time) tens of thousands of times per hour.

Figure 3 ∣. Fractionation prior to counting molecules improves the dynamic range in proteomics.

Figure 3 ∣

a, The dynamic range problem of the human proteome is far more extreme than that of transcriptomics. The enormous dynamic range of peptide abundances requires massive oversampling of the most abundant peptide (solid blue) to obtain counts for the least abundant peptide (vertical stripe red) b, LC-MS separates peptides biochemically, ionizes them, and samples the peptides at different times and with different spectra. While a mass spectrometer works in the gas phase, it is analogous to separating peptides/proteins prior to counting and then applying normalization to make the quantities comparable between spectra or flow cells. This strategy significantly improves the counting statistics of low abundance molecules in the presence of high abundance molecules. In ion trap mass spectrometry, the normalization approach to optimize ions in each spectrum and adjust the signal by the variable fill time is known as automatic gain control (AGC).

For example, a 90 min LC-MS/MS analysis of peptides in plasma frequently measures 3x109 ions from just the unfragmented MS1 signal. Yet, this frequently represents only peptides from ~350-450 proteins because the dynamic range of the plasma proteome is notoriously large36. Thus, if plasma is analyzed using a single flow cell with 1 million single molecule “reads”, ~950,000 of those reads would be of the 12 most abundant proteins,25 leaving only 50,000 (or 5%) of the remaining reads to quantify the rest of the proteins in the sample.

The dynamic range of plasma can be mitigated by depleting the most abundant proteins by immunoaffinity subtraction chromatography37. Such chromatography frequently removes 14 of the most abundant proteins in human plasma (e.g., albumin, IgG, antitrypsin, IgA, transferrin, haptoglobin, fibrinogen, alpha2-macroglobulin, alpha1-acid glycoprotein, IgM, apolipoprotein A1, apolipoprotein A2, complement C3 and transthyretin). Depletion increases the number of detected proteins, but unfortunately these affinity columns are species specific and thus are largely limited for use with human samples. These columns also capture the entire complex and binding proteins of the target antigens – removing unintended proteins. For example, patients with cancer make autoantibodies to known cancer biomarkers38 (e.g. thyroglobulin, MUC16 (CA125), and PSA) which complicate their analysis using immunoaffinity methods39 and depletion of IgGs can remove these biomarkers. Depletion of apolipoprotein A1, will also deplete HDL particles40, a promising plasma sub proteome for the diagnosis of coronary artery disease41. Such unintentional depletions contribute to biases and complicate the interpretation of the proteomic results.

Figure 2 illustrates the analysis of an extracellular vesicle (EV) fraction enriched from plasma, digested using trypsin and measured by data independent acquisition with an Orbitrap Eclipse. This sample has a reduced dynamic range, compared to the whole plasma proteome making it an interesting avenue for biomarker discovery. The plasma vesicle fraction represents about 1-2% of the plasma proteome, is enriched in tissue derived proteins, and depleted in abundant plasma proteins. The total ion current (ions per second) from just the MS1 signal was >1012, of which 46.4% of the current could be assigned to a peptide sequence using the fragment ion data. This current represented >5 billion ions of which 1.2 billion ions (24.1%) were assigned to peptide sequences – not counting the ions measured in the MS/MS spectra. To perform similarly, single-molecule methods like Illumina would analogously need to collect billions of reads from a mixture biochemically separated into 1000s of individual samples (~1 million reads per sample; see Figure 3b). The signal is normalized between flow cells to achieve counts that can be comparable between flow cells with ~24% of the reads being able to be mapped back to the reference genome. This plasma EV analysis was not sample limited and, thus, represents an analysis near the upper end of what can be achieved for the analysis of ions per analysis time.

Assuming that emerging polypeptide counting methods can achieve the current throughput of Illumina NovaSeq for DNA (4 billion reads for $10,000), their cost for analyzing a mammalian proteome would be much higher than the cost for MS analysis. This also suggests that single-molecule counting approaches must be at least 20x cheaper than Illumina sequencing to be cost effective when compared with $500 per LC-MS/MS analysis. Stated another way, LC-MS/MS is currently more efficient at counting peptides than next generation sequencing is at counting oligonucleotides.

Scalability: the elephant in the room with single-molecule methods

The sheer volume of protein molecules in a cell prompts a reality check - will single-molecule methods alone reach the required throughput to sufficiently sample the proteome? For single molecule counting methods to have the same coverage and breadth of the proteome as they do the transcriptome, they will need to have 10,000-30,000x more reads of similar quality as currently generated by RNA-seq. Thus, protein single molecule counting based methods will require technological advancements that greatly exceed the capabilities of nucleotide single molecule counting methods.

A major factor that limits imaging based single-molecule sequencing is the density at which the molecules can be spaced, and the imaging strategies used to count the spatially resolved “reads” (we are assuming a 2D imaging plane in this discussion). The limit for the spatial density is constrained by the wavelength of light. Using fluorescence detection, the emission spectrum is in the 250-700 nm range (the actual theoretical resolution limit is about half the wavelength emitted). This provides a practical upper limit on planar molecular density of ~1 μm2. Thus, assuming perfect measurement of reads and ideal spatial placement we can estimate the best-case scenario for the minimal flow cell area vs. number of reads: 1 million reads: 1 mm2, 100 million reads: 1 cm2, 10 billion reads: 10 cm2, 1 trillion reads: 1 m2. The area that needs to be imaged is limited by microscopy. These limits can be relaxed by super resolution imaging, but at the expense of decreased imaging speeds. Even with advances in widefield microscopy, there is a compromise between the field of view and the measured pixel size using a given charge coupled device (CCD) detector.

These estimates explain why 10 billion nucleotide reads is time consuming and expensive for scRNAseq analysis. Thus, the throughput of current nucleotide sequencing methods falls short of achieving the 400 billion reads needed for a full proteome analysis of a bulk sample at a similar coverage to that currently achieved on the transcriptome by RNA-seq.

Methods analyzing intact protein molecules, such as top-down MS42 or single molecule methods that aim to count proteins20, may be able to sample the proteome with fewer total counts. This is in contrast to peptide approaches that usually count multiple unique peptide sequences per proteoform. The difference between measuring intact proteoforms and peptides from the digestion of complex mixtures is analogous to the differences between short read RNA-seq and long read isoform sequencing. Intact protein analysis is further aided by recent methods for charge detection mass spectrometry (CDMS) where individual ion events can be measured43,44.

A look at some alternative advanced single-molecule methods suggests a huge gap in throughput. The Pacific Biosciences Sequel II platform for genome sequencing can handle, at best, 107 molecules in each sequencing run, which takes a couple of days to complete. The highest-throughput Oxford Nanopore Technology (ONT) platform, the Promethion, can run up to 48 flow cells at a time, providing an approximate maximum throughput of 5x107 molecules per run, which takes 1-2 days for data acquisition (signal processing time not included). These two examples are the most sophisticated single-molecule analyzers, and yet, the throughput offered is significantly short of the required throughput for analyzing protein mixtures on par with LC-MS. The high limit of 5x107 for single molecule technologies is no coincidence - these limitations are governed by physical limitations in scaling up device architecture for single-molecule interrogation, limitations in molecular turnover in the devices, as well as limitations in data acquisition and transfer rates.

Taking the ONT pore sequencer and direct RNA sequencing as an example, 500 ng of input RNA contains about 1012 mRNA molecules and only 106 of these are sampled in a MinION nanopore based flow cell. The vast discrepancy between input requirements and actual molecules analyzed (only 1 part per million is sampled!) is a testament to the intertwined limitations of single-molecule technologies: 500 ng ensures that molecules arrive to a nanoscale detector with minimal off-times, otherwise the sensor will be mostly vacated, and throughput will be compromised. In addition, the speed at which molecules pass through the pores cannot be too fast (typically 100 nm of polymer contour length per second), because the maximum measurement bandwidth of the electrical signal recording cannot exceed a few kHz due to data transfer speed and signal-to-noise limitations.

These multiple constraints have set natural limits for single molecule processing, but there is no inherent reason for these to be hard limits. As flow cells are improved to enable analyses from smaller sample volumes, and/or strategies to deliver molecules more efficiently to the pores rather than rely on diffusion, one can imagine over 100-fold reductions in input requirements from >100 ng to <1 ng, at similar throughputs. Similarly, if one were to assume that data transfer and bandwidths would increase by ~100 fold over the next 5-7 years, one can expect transitioning from 103 pores in a flowcell to 105, which would boost the throughput 100-fold to about 5x109 molecules per run (1-2 days). We estimate that these limitations will have to be overcome before single-molecule proteomics can be approached at scale.

What limits LC-MS/MS and can the technology improve to sample the proteome?

Most MS proteomics methods use a bottom-up strategy of digesting proteins to peptides to overcome the enormous physiochemical diversity of proteins in the cell. Overwhelmingly these methods make use of trypsin which produces peptides from proteins that have good cleavage specificity, are well suited for both reversed phase separations, produce mostly doubly and triply charged peptides, and fragment well because of the localization of a basic c-terminal residue and presence of a mobile proton. That said, not all tryptic peptides are well suited for LC-MS/MS, and because of this, proteins in complex mixtures are mainly identified through partial sequences. The sequence coverage of an identified protein varies between 10-100% (on average 30-50%) depending on the protein and the experiment. One approach to mitigate this limitation and maximize protein sequence coverage is to combine the results from different proteases with different specificity45,46. However, the increased sampling of ions derived from redundant peptides from the same proteins, while useful for improving coverage, comes at the expense of dynamic range as more ions must be sampled from additional peptides from abundant proteins before sampling ions from rare molecular species. To overcome the dynamic range problem alternative methods have been developed to minimize peptide coverage, capturing or depleting a subset of the peptides, while maximizing the different proteins sampled – this is analogous to exon capture47, ChIP48, or similar methods used in genomics prior to single molecule sequencing. Thus, there is a balance between maximizing coverage of individual proteins and the dynamic range of the proteins measured.

The major limiting factor in the sensitivity of LC-MS/MS methods is the electrospray process, which turns peptide molecules in solution into gas-phase ions5. If a molecule isn’t converted to a gas-phase ion, it cannot be quantified with a mass spectrometer. Using electrospray, MS methods can quantify proteins present at 5,000 - 20,000 copies in the context of complex mammalian proteomes9,49.

The number of ions sampled may be increased by using methods like multidimensional chromatography11 or making multiple analyses using different portions of the mass range50. These approaches can significantly improve the depth of proteome coverage, but at the expense of increased analysis time and throughput. A 6x increase in time may only increase the number of peptides that can be measured by 2x – because the increased time is at least partially redundant with the peptides measured in prior fractions. Ultimately this comes at the expense of protein input material and significantly reduces the number of samples that can be measured. Thus a primary challenge is to achieve deep proteome coverage with smaller samples, such as single cells, and faster, thus enabling higher throughput51,52.

Another way to improve LC-MS/MS is in the more efficient use of the ions that are generated. Currently, in most data independent acquisition methods, a single wide m/z range is isolated at once and the rest of the ion beam that isn’t isolated is lost. Data dependent acquisition methods sample an even smaller fraction of the ion beam. With bulk samples, this means that only ~1/50th of the ion beam is currently being used as only one of 50 precursor windows is measured at once53. With single-cell samples, 3-4 windows are used and thus about ⅓ of all ions available to the MS instrument are analyzed12 at the expense of limiting within spectrum selectivity. Methods like diaPASEF (parallel accumulation-serial fragmentation combined with data independent acquisition) offer potential to significantly increase the sampling of the peptide ion beam.

Another important way to advance LC-MS/MS is to improve the computational methods that are used to assign peptide sequences to the ion current that is measured. Currently only ~15-50% of the measured ion current is assigned to peptide sequences54. Thus, an improvement in both the physical instrumentation for enhancing the sampling of the ion beam and computational methods for enhanced data interpretation could see a 50-75x improvement in the number of ions counted before LC-MS/MS becomes limited by the electrospray process. This improvement in ion counts will improve the relative measurement precision of the peptides measured, elevate low abundance species within the limit of detection, and enable measurements to be made in shorter time and with less material. We expect innovations in data acquisition and interpretation to enable quantification and sequence identification for a large fraction of the tens of thousands of peptide-like features detected in single cells, and thus substantially increase the depth of proteome coverage54.

What can emerging single molecule counting methods adopt from LC-MS/MS?

Peptide quantification using LC-MS has evolved over the last several decades in ways that have improved our analyses of complex protein mixtures. Peptide ions are not counted one at a time but are aggregated, effectively compressing the signal from many peptide ions into a single measurement. This compression reduces time and minimizes the effect of abundant peptides on the counting precision of low abundant peptides – improving the dynamic range (Figure 3). However, the emphasis on generating and sorting 'like' ions constrains the choice of enzymes to produce peptides ideally suited for the respective method. Because tryptic peptides are ideally suited for LC-MS/MS doesn’t mean it will be ideally suited for other methods. The conundrum is that reducing the bias by adding more distinct enzymes or nonspecific enzymes leads to more peptides with different sequences for each protein making it even harder to sample low abundance proteins in the presence of abundant proteins. Put simply, approaches to reduce these biases and increase sequence coverage in proteomics could push the field towards counting more ions from different peptide species – exacerbating the counting problem. Understanding the strengths and weaknesses of LC-MS as it has approached complex proteomes can perhaps constructively guide the emerging field of single-molecule proteomics. As advice to this budding field, consider the following.

Fractionate:

Better to run many smaller counting experiments on fractionated samples than one very large counting experiment (Figure 2). If peptides or proteins are separated using an analytical method like liquid chromatography, electrophoresis, or affinity capture, the less abundant molecules will be enriched in certain fractions, resulting in a better representation of these peptides in the downstream detection/quantification processes. To make optimal use of this separation, methods equivalent to automatic gain control (AGC), as done with ion trap instruments55, will need to be developed so that uniform fractions are fed into the flow cell for single-molecule readout. For example, each biochemical fraction can be diluted to the same concentration and equal quantities of the fractions loaded into many flow cells.

In addition to improving the dynamic range of the measurement, the use of a separation method based on a physicochemical property can be used to improve the sequence determination of the peptide or protein. In LC-MS/MS, the use of either predicted retention time or previously measured retention time is a powerful feature for the discrimination of correct and incorrect peptide detections56-58. This minimizes the FDR and improves sensitivity. Indeed, nanopore proteomics methods are making first steps in this direction59.

The measurement of a signal across many points during a chromatographic separation also enables the integration of a chromatographic peak. Despite the unparalleled selectivity of LC-MS/MS measurements, there is often a background signal that complicates the quantitative linearity of the measurements. By integrating the peak along the separation, it is possible to perform a background subtraction, which improves quantitative accuracy.

When there are many molecules to count, you will need to count many at a time.

As mentioned above, to measure peptides using mass spectrometry from many billions of ions it became impractical to count ions one at a time in a realistic timescale. When done in a flow cell, single molecule counting methods will have to count so many molecules that they will likely either 1) exceed the density of the flow cell or 2) require a flow cell(s) with impractical physical dimensions. We hope to inspire new methods that are analogous to the switch in mass spectrometry from pulse counting (single molecule) to ion current measurement (each "read" will contain a variable quantity of many counts).

Overcoming biases.

Arguably the most challenging aspect of proteomics is the massive physiochemical diversity of proteins in the cell. To overcome this vast diversity in solubility, size, post-translational modifications, ionization and fragmentation by mass spectrometry, presence of autoantibodies, (embedding of domains in membranes, or protein-protein interactions, most proteomics experiments take a bottom-up strategy for the analysis of complex mixtures by digesting proteins to peptides prior to analysis. Performing analyses on the peptide level greatly simplifies the physiochemical diversity of the analytes. In general, tryptic peptides are well matched for reversed phase chromatography, electrospray ionization, and tandem mass spectrometry. Methods for top-down proteomics have advanced enormously and have opened the door to characterizing proteoforms that are often ignored in understanding the function of the cell but these methods have greater constraints in their ability to analyze proteins with extremes in physicochemical properties60.

Over the last two decades there have been massive improvements in nanoflow separations, electrospray ionization, transmission of ions from atmospheric pressure to vacuum, tandem mass spectrometry, and pipelined data acquisition that have resulted in sensitivities now approaching 10-50 zmol for peptides. However, one of the greatest challenges for single cell and low-input proteomics is the absorption of proteins and peptides to surfaces. In general, the sensitivity limits of proteomics samples have not been because of LC-MS/MS itself but the loss of sample to surfaces prior to entering the system. To solve these problems, there have been methods developed specifically to improve the recovery of protein from small numbers of cells using many strategies, including one-pot digestion61,62, massively parallel sample preparation in surface droplets63, addition of carrier proteins64, and barcoding and combining samples using mass tags to spread losses between many samples12.

Despite the potential sensitivity of emerging single molecule counting methods, these will need to overcome the same biochemical challenges of analyzing intact proteins, adsorptive losses to surfaces, variable enzyme digestion kinetics, and biases against certain peptide properties. While biases for sequencing peptides and proteins in flow-cells and nanopores will almost certainly be different than LC-MS/MS, the strategies for improving the recovery of peptides for entry into the instrument will largely be the same.

Sample multiplexing:

Peptides from multiple samples can be barcoded (e.g., by covalent chemical labels), subsequently mixed, and analyzed simultaneously. Sample multiplexing has helped increase the throughput of MS proteomics8,65. Analogous multiplexing methods are likely to be implemented by single-molecule methods to increase the number of samples analyzed as multiplexing is a powerful feature of single molecule DNA sequencing. Yet, multiplexing with single-molecule approaches spreads the counted molecules between many samples and thus reduces the number of molecules counted per sample, which results in shallower depth of proteome coverage and sequence completeness.

Instrument companies historically focus on the bottom line before science.

It is also important for new methods to have a clear fiscal return on investment. A couple of the new single molecule protein sequencing methods hope to convert peptide or protein sequences into DNA barcodes that can then be analyzed with traditional next generation sequencing technology66. However, as discussed above, the large number of protein molecules will require sequencing billions of molecules to obtain coverage of the proteome that can be obtained by LC-MS/MS22. Because this coverage can be obtained for ~500 USD per analysis by LC-MS/MS and sequencing billions of DNA reads can cost ~10,000 USD, it would require Next Generation Sequencing companies to reduce their costs to ~5% their current rates. Without separation, a proteomic technology needs to count with high specificity about 1 billion intact protein molecules (or 20 billion peptides) for 500 USD (including personnel, sample handling, and analysis) to disrupt current LC-MS technologies. This price reduction would be a game changer for DNA sequencing and would further revolutionize genomics. However, it would require DNA sequencing companies to reduce their income from genomics applications to be financially competitive in the proteomics market. If they do this, then they will have done something that is rarely done in the proteomics field – minimize the financial return of existing products to be competitive in new high-risk areas.

Summary

Here we provided a perspective on the potential and challenges of scaling the use of single molecule counting methods to the analysis of the proteome. We use LC-MS based proteomics as a comparison by illustrating how many peptide molecules are counted in the gas-phase using standard mass spectrometry methods. This comparison will be useful for single molecule counting methods to use as a benchmark to obtain parity with LC-MS data. The challenges of analyzing the proteome by counting single peptide or protein molecules in a spatially resolved flow cell represents significant challenges over counting nucleotides – because of both the physiochemical complexity of proteins and the sheer greater number of proteins in the cell. To support innovation around these emerging methods we provide some suggestions learned by the LC-MS/MS based proteomics community.

Acknowledgements

This work was supported in part by National Institutes of Health grants U19 AG065156, R24 GM141156, F31 AG066318, an Allen Distinguished Investigator award through The Paul G. Allen Frontiers Group to N.S., a Seed Networks Award from CZI CZF2019-002424 to N.S., an R01 award from NIGMS R01GM144967, an R01 award from NHGRI R01HG10087 to M.W., the project ‘International Centre for Cancer Vaccine Science’ that is carried out within the International Agendas Programme of the Foundation for Polish Science co-financed by the European Union under the European Regional Development Fund. We thank the PL-Grid and CI-TASK Infrastructure, Poland, for providing their hardware and software resources. This work is supported by “Knowledge At the Tip of Your fingers: Clinical Knowledge for Humanity” (KATY) project funded from the European Union’s Horizon 2020 research and innovation program under grant agreement No. 101017453. The authors would like to acknowledge the helpful discussions with Edward Marcotte and members of the Alfaro, MacCoss, and Slavov labs. MJM appreciates the constructive feedback provided by UW Genome Sciences faculty.

Footnotes

Competing interests statement

The MacCoss Lab at the University of Washington has a sponsored research agreement with Thermo Fisher Scientific, a manufacturer of mass spectrometry instrumentation. M.J.M. is a paid consultant for Thermo Fisher Scientific. The Slavov laboratory at Northeastern University has a research agreement with Bruker, a manufacturer of mass spectrometry instrumentation.

References

  • 1.Nau H & Biemann K Amino acid sequencing by gas chromatography--mass spectrometry using perfluoro-dideuteroalkylated peptide derivatives. A. Gas chromatographic retention indices. Anal. Biochem 73, 139–153 (1976). [DOI] [PubMed] [Google Scholar]
  • 2.Hass GM et al. The amino acid sequence of a carboxypeptidase inhibitor from potatoes. Biochemistry 14, 1334–1342 (1975). [DOI] [PubMed] [Google Scholar]
  • 3.Hunt DF, Yates JR 3rd, Shabanowitz J, Winston S & Hauer CR Protein sequencing by tandem mass spectrometry. Proc. Natl. Acad. Sci. U. S. A 83, 6233–6237 (1986). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Johnson RS & Biemann K The primary structure of thioredoxin from Chromatium vinosum determined by high-performance tandem mass spectrometry. Biochemistry 26, 1209–1214 (1987). [DOI] [PubMed] [Google Scholar]
  • 5.Yamashita M & Fenn JB Electrospray ion source. Another variation on the free-jet theme. J. Phys. Chem 88, 4451–4459 (1984). [Google Scholar]
  • 6.Eng JK, McCormack AL & Yates JR An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom 5, 976–989 (1994). [DOI] [PubMed] [Google Scholar]
  • 7.Venable JD, Dong M-Q, Wohlschlegel J, Dillin A & Yates JR Automated approach for quantitative analysis of complex peptide mixtures from tandem mass spectra. Nat. Methods 1, 39–45 (2004). [DOI] [PubMed] [Google Scholar]
  • 8.Ross PL et al. Multiplexed protein quantitation in Saccharomyces cerevisiae using amine-reactive isobaric tagging reagents. Mol. Cell. Proteomics 3, 1154–1169 (2004). [DOI] [PubMed] [Google Scholar]
  • 9.Specht H et al. Single-cell proteomic and transcriptomic analysis of macrophage heterogeneity using SCoPE2. Genome Biol. 22, 50 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Petelski AA et al. Multiplexed single-cell proteomics using SCoPE2. Nat. Protoc 16, 5398–5425 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Washburn MP, Wolters D & Yates JR 3rd. Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat. Biotechnol 19, 242–247 (2001). [DOI] [PubMed] [Google Scholar]
  • 12.Derks J et al. Increasing the throughput of sensitive proteomics by plexDIA. Nat. Biotechnol (2022) doi: 10.1038/s41587-022-01389-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Messner CB et al. Ultra-High-Throughput Clinical Proteomics Reveals Classifiers of COVID-19 Infection. Cell Syst 11, 11–24.e4 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Alfaro JA et al. The emerging landscape of single-molecule protein sequencing technologies. Nat. Methods 18, 604–617 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Swaminathan J, Boulgakov AA & Marcotte EM A theoretical justification for single molecule peptide sequencing. PLoS Comput. Biol 11, e1004080 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Palmblad M Theoretical Considerations for Next-Generation Proteomics. J. Proteome Res 20, 3395–3399 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Swaminathan J et al. Highly parallel single-molecule identification of proteins in zeptomole-scale mixtures. Nat. Biotechnol (2018) doi: 10.1038/nbt.4278. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Reed BD et al. Real-time dynamic single-molecule protein sequencing on an integrated semiconductor device. bioRxiv 2022.01.04.475002 (2022) doi: 10.1101/2022.01.04.475002. [DOI] [PubMed] [Google Scholar]
  • 19.Mallick P Methods of assaying proteins. US Patent (2021).
  • 20.Egertson JD et al. A theoretical framework for proteome-scale single-molecule protein identification using multi-affinity protein binding reagents. bioRxiv 2021.10.11.463967 (2021) doi: 10.1101/2021.10.11.463967. [DOI] [Google Scholar]
  • 21.Brinkerhoff H, Kang ASW, Liu J, Aksimentiev A & Dekker C Multiple rereads of single proteins at single-amino acid resolution using nanopores. Science 374, 1509–1513 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Slavov N Counting protein molecules for single-cell proteomics. Cell 185, 232–234 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Milo R What is the total number of protein molecules per cell volume? A call to rethink some published values. Bioessays 35, 1050–1055 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Bekker-Jensen DB et al. An Optimized Shotgun Strategy for the Rapid Generation of Comprehensive Human Proteomes. Cell Syst 4, 587–599.e4 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Anderson NL & Anderson NG The human plasma proteome: history, character, and diagnostic prospects. Mol. Cell. Proteomics 1, 845–867 (2002). [DOI] [PubMed] [Google Scholar]
  • 26.Marinov GK et al. From single-cell to cell-pool transcriptomes: stochasticity in gene expression and RNA splicing. Genome Res. 24, 496–510 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Peterson DW & Hayes JM Signal-to-Noise Ratios in Mass Spectroscopic lon-Current-Measurement Systems, in Contemporary Topics in Analytical and Clinical Chemistry: Volume 3 (eds. Hercules DM, Hieftje GM, Snyder LR & Evenson MA) 217–252 (Springer US, 1978). [Google Scholar]
  • 28.Scigelova M, Hornshaw M, Giannakopulos A & Makarov A Fourier transform mass spectrometry. Mol. Cell. Proteomics 10, M111.009431 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Makarov A & Denisov E Dynamics of ions of intact proteins in the Orbitrap mass analyzer. J. Am. Soc. Mass Spectrom 20, 1486–1495 (2009). [DOI] [PubMed] [Google Scholar]
  • 30.MacCoss MJ, Toth MJ & Matthews DE Evaluation and optimization of ion-current ratio measurements by selected-ion-monitoring mass spectrometry. Anal. Chem 73, 2976–2984 (2001). [DOI] [PubMed] [Google Scholar]
  • 31.Schwartz JC, Zhou X-G & Bier ME Method and apparatus of increasing dynamic range and sensitivity of a mass spectrometer. US Patent (1996).
  • 32.Zhao S, Ye Z & Stanton R Misuse of RPKM or TPM normalization when comparing across samples and sequencing protocols. RNA 26, 903–909 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Belov ME et al. Dynamic range expansion applied to mass spectrometry based on data-dependent selective ion ejection in capillary liquid chromatography fourier transform ion cyclotron resonance for enhanced proteome characterization. Anal. Chem 73, 5052–5060 (2001). [DOI] [PubMed] [Google Scholar]
  • 34.Meier F, Geyer PE, Virreira Winter S, Cox J & Mann M BoxCar acquisition method enables single-shot proteomics at a depth of 10,000 proteins in 100 minutes. Nat. Methods 15, 440–448 (2018). [DOI] [PubMed] [Google Scholar]
  • 35.Egertson JD et al. Multiplexed MS/MS for improved data-independent acquisition. Nat. Methods 10, 744–746 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Anderson NL et al. The human plasma proteome: a nonredundant list developed by combination of four separate sources. Mol. Cell. Proteomics 3, 311–326 (2004). [DOI] [PubMed] [Google Scholar]
  • 37.Pieper R et al. Multi-component immunoaffinity subtraction chromatography: an innovative step towards a comprehensive survey of the human plasma proteome. Proteomics 3, 422–432 (2003). [DOI] [PubMed] [Google Scholar]
  • 38.Macdonald IK, Parsy-Kowalska CB & Chapman CJ Autoantibodies: Opportunities for Early Cancer Detection. Trends Cancer Res. 3, 198–213 (2017). [DOI] [PubMed] [Google Scholar]
  • 39.Hoofnagle AN & Wener MH The fundamental flaws of immunoassays and potential solutions using tandem mass spectrometry. J. Immunol. Methods 347, 3–11 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.McVicar JP, Kunitake ST, Hamilton RL & Kane JP Characteristics of human lipoproteins isolated by selected-affinity immunosorption of apolipoprotein A-I. Proc. Natl. Acad. Sci. U. S. A 81, 1356–1360 (1984). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Heinecke JW The HDL proteome: a marker--and perhaps mediator--of coronary artery disease. J. Lipid Res 50 Suppl, S167–71 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Siuti N & Kelleher NL Decoding protein modifications using top-down mass spectrometry. Nat. Methods 4, 817–821 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Kafader JO et al. Multiplexed mass spectrometry of individual ions improves measurement of proteoforms and their complexes. Nat. Methods 17, 391–394 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Wörner TP et al. Resolving heterogeneous macromolecular assemblies by Orbitrap-based single-particle charge detection mass spectrometry. Nat. Methods 17, 395–398 (2020). [DOI] [PubMed] [Google Scholar]
  • 45.Gatlin CL, Eng JK, Cross ST, Detter JC & Yates JR 3rd. Automated identification of amino acid sequence variations in proteins by HPLC/microspray tandem mass spectrometry. Anal. Chem 72, 757–763 (2000). [DOI] [PubMed] [Google Scholar]
  • 46.MacCoss MJ et al. Shotgun identification of protein modifications from protein complexes and lens tissue. Proc. Natl. Acad. Sci. U. S. A 99, 7900–7905 (2002). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Turner EH, Lee C, Ng SB, Nickerson DA & Shendure J Massively parallel exon capture and library-free resequencing across 16 genomes. Nat. Methods 6, 315–316 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Johnson DS, Mortazavi A, Myers RM & Wold B Genome-wide mapping of in vivo protein-DNA interactions. Science 316, 1497–1502 (2007). [DOI] [PubMed] [Google Scholar]
  • 49.Gray Huffman R et al. Prioritized single-cell proteomics reveals molecular and functional polarization across primary macrophages. bioRxiv 2022.03.16.484655 (2022) doi: 10.1101/2022.03.16.484655. [DOI] [Google Scholar]
  • 50.Panchaud A et al. Precursor acquisition independent from ion count: how to dive deeper into the proteomics ocean. Anal. Chem 81, 6481–6488 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Slavov N Increasing proteomics throughput. Nature biotechnology vol. 39 809–810 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Derks J & Slavov N Strategies for increasing the depth and throughput of protein analysis by plexDIA. J. Proteome Res (in press) (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Pino LK, Just SC, MacCoss MJ & Searle BC Acquiring and Analyzing Data Independent Acquisition Proteomics Experiments without Spectrum Libraries. Mol. Cell. Proteomics 19, 1088–1103 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Slavov N Driving Single Cell Proteomics Forward with Innovation. J. Proteome Res 20, 4915–4918 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Schwartz JC & Kovtoun VV Automatic gain control (AGC) method for an ion trap and a temporally non-uniform ion beam. US Patent (2011).
  • 56.Klammer AA, Yi X, MacCoss MJ & Noble WS Improving tandem mass spectrum identification using peptide retention time prediction across diverse chromatography conditions. Anal. Chem 79, 6111–6118 (2007). [DOI] [PubMed] [Google Scholar]
  • 57.Searle BC et al. Chromatogram libraries improve peptide detection and quantification by data independent acquisition mass spectrometry. Nat. Commun 9, 5128 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Chen AT, Franks A & Slavov N DART-ID increases single-cell proteome coverage. PLoS Comput. Biol 15, e1007082 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Zrehen A, Ohayon S, Huttner D & Meller A On-chip protein separation with single-molecule resolution. Sci. Rep 10, 15313 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Donnelly DP et al. Best practices and benchmarks for intact protein analysis for top-down mass spectrometry. Nat. Methods 16, 587–594 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Zhu Y et al. Nanodroplet processing platform for deep and quantitative proteome profiling of 10–100 mammalian cells. Nat. Commun 9, 1–10 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Specht H, Harmange G, Perlman DH & Emmott E Automated sample preparation for high-throughput single-cell proteomics. BioRxiv (2018) doi: 10.1101/399774. [DOI] [Google Scholar]
  • 63.Leduc A, Huffman RG, Cantlon J, Khan S & Slavov N Exploring functional protein covariation across single cells using nPOP. Genome Biol. 23, 261 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Budnik B, Levy E, Harmange G & Slavov N SCoPE-MS: mass spectrometry of single mammalian cells quantifies proteome heterogeneity during cell differentiation. Genome Biol. 19, 161 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Framework for multiplicative scaling of single-cell proteomics. Nat. Biotechnol 10.1038/s41587-022-01411-1 (2022). [DOI] [PubMed] [Google Scholar]
  • 66.Hong JM et al. ProtSeq: Toward high-throughput, single-molecule protein sequencing via amino acid conversion into DNA barcodes. iScience 25, 103586 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Hagemann-Jensen M et al. Single-cell RNA counting at allele and isoform resolution using Smart-seq3. Nat. Biotechnol 38, 708–714 (2020). [DOI] [PubMed] [Google Scholar]

RESOURCES