Skip to main content
ACS AuthorChoice logoLink to ACS AuthorChoice
. 2024 Mar 13;23(4):1263–1271. doi: 10.1021/acs.jproteome.3c00730

Fit for Purpose Approach To Evaluate Detection of Amino Acid Substitutions in Shotgun Proteomics

Taylor J Lundgren , Patricia L Clark †,‡,*, Matthew M Champion †,*
PMCID: PMC11003417  PMID: 38478054

Abstract

graphic file with name pr3c00730_0005.jpg

Amino acid substitutions (AASs) alter proteins from their genome-expected sequences. Accumulation of substitutions in proteins underlies numerous diseases and antibiotic mechanisms. Accurate global detection of AASs and their frequencies is crucial for understanding these mechanisms. Shotgun proteomics provides an untargeted method for measuring AASs but introduces biases when extrapolating from the genome to identify AASs. To characterize these biases, we created a “ground-truth” approach using the similarities betweenEscherichia coli and Salmonella typhimurium to model the complexity of AAS detection. Shotgun proteomics on mixed lysates generated libraries representing ∼100,000 peptide-spectra and 4161 peptide sequences with a single AAS and defined stoichiometry. Identifying S. typhimurium peptide-spectra with only the E. coli genome resulted in 64.1% correctly identified library peptides. Specific AASs exhibit variable identification efficiencies. There was no inherent bias from the stoichiometry of the substitutions. Short peptides and AASs localized near peptide termini had poor identification efficiency. We identify a new class of “scissor substitutions” that gain or lose protease cleavage sites. Scissor substitutions also had poor identification efficiency. This ground-truth AAS library reveals various sources of bias, which will guide the application of shotgun proteomics to validate AAS hypotheses.

Keywords: amino acid substitution, bottom-up proteomics, bioinformatics, spectral library, ground truth

Introduction

Incorporation of an incorrect amino acid during translation can negatively impact protein stability and function, leading to increased misfolding and proteotoxic stress.1 Several oncogenes increase mistranslation through modification of mRNA and tRNA.25 Similarly, several antibiotics create mistranslation-driven proteotoxic stress, which compromises the cellular membrane of bacteria.6 However, in both cases, the distribution and sequence biases of the mistranslated proteins are largely unknown, due to the difficulty of identifying diverse, rare substitutions in the background of a genome-defined proteome. This uncertainty restricts the informed improvement of therapeutics that target translational fidelity. Defining proteotoxic products of unfaithful translation requires the untargeted discovery of nongenomic modified proteins. Global detection of amino acid substitutions (AASs) in proteins is of particular interest to evaluate translation fidelity and the consequences of mis-translation on protein homeostasis.1,79

Shotgun proteomics is a powerful technique for untargeted protein identification.10,11 However, peptide-spectra are predominately identified by matching to a genome-defined database and subsequent protein inference.12 Identification and inference beyond genome-anticipated peptides introduce a significant bioinformatic challenge, which complicates AAS identification. Classical permutation-based searching for stochastically modified amino acid residues in peptide-spectra is only effective when few modifications are considered.13,14 The full suite of single AAS permutations dramatically expands the search space to be intractable to a permutation-based search. For example, an average tryptic peptide of 14 aa times 18 mass-unique substitutions equals 252 possible canonical single AAS permutations before considering other common post-translational peptide modifications that may also occur. Alternatively, de novo peptide sequencing is not constrained by the genome, but this approach lacks the identification power of a database-driven search.1517

Other proteomic search approaches have successfully identified substituted peptides. Previously, many hundreds of substitutions were identified using targeted databases informed by mRNA sequencing or oncological single nucleotide polymorphisms.1820 Novel search algorithms have enabled the detection of many peptide modifications, including AASs, by using only the reference genome. Recently, Mordret et al. adapted dependent-peptide search to identify 1679 unique substitutions.7 There are also reports of successfully identifying substitutions from reference genomes using proprietary commercial software, such as the SPIDER algorithm in PEAKS.6,17,21,22

The current paradigm for evaluating software identification of peptides is limited to the range and yield of peptide-spectra matches (PSMs) at controlled false discovery rates (FDRs).23 Yet the yield of detected AAS PSMs is an information-poor metric because it cannot characterize nor count unidentified or mis-identified AAS peptide-spectra. A comprehensive description of un- and mis-identified spectra can only be accomplished by a priori knowledge in a ground-truth data set, which is necessary to describe the limitations introduced by the identification of AAS-peptide-spectra. This description could also evaluate logical assumptions made about AAS identification, such as inferred lower limits of detection, the impact of peptide stoichiometry, and the integrity of FDR estimation.

Previously used positive controls for AAS detection include incorporating synthetic peptides and customized search databases.6,19,21 These have been effective for targeted validation of successfully identified AAS PSM; however, both are limited in the number of representative AAS peptide sequences that can be reasonably included in an experiment due to price and expansion of the search space, respectively.24 This makes it challenging to represent each substitution type across diverse peptide or protein localization and across unique competition from the sample matrix, other peptides, and artificial modifications. To reflect the majority of AASs, a positive control for global identification of AASs would be diverse in substitution type, location within the peptide, and absolute and relative abundance.

To recreate the broad range of substitution and peptide contexts, we leveraged naturally occurring diversity to create a useful positive control for global AAS detection. Many proteins have homologues in closely related species, with naturally occurring amino acid polymorphisms between species defined by their respective genomes.25 We hypothesized that combining two closely related species would result in a subset of anticipated AAS peptides of sufficient complexity and diversity to represent the physiochemical characteristics observable in a shotgun proteomics experiment (Figure 1A). Peptides from both organisms could be identified in a single standard database search where both genomes are provided, resulting in a set of positively identified spectra that represent substitution (Figure 1B). The same data could then be searched with only one genome and an AAS peptide-spectrum identification strategy of choice (Figure 1C). Pairwise comparison of searches by spectrum allows evaluation of the AAS search strategy by global and categorical efficiency at controlled stoichiometry.

Figure 1.

Figure 1

Creating the AAS peptide-spectra library. (A) Peptide sequences that mimic an AAS are predicted from in silico digested E. coli and S. typhimurium protein sequences. (B) Raw data from a mixed proteome sample are identified by a standard database search with both genomes provided. Identified spectra that match any sequence mimicking an AAS are used to create a peptide-spectra library, used as a positive control for the AAS discovery search. (C) Same data are searched with only one genome and a method to discover AAS peptides. We used MSFragger’s mass-offset search to identify S. typhimurium peptides when provided with only an E. coli genome, or vice versa, and evaluated identification performance against the ground-truth library.

We selectedEscherichia coli andSalmonella typhimurium for this approach, as these bacteria have complete genome sequences and sufficient evolutionary distance for unambiguous protein inference.26,27 This resulted in a ground-truth library of 52,756 AAS representing spectra from Salmonella that we used to discover biases in AAS identification. We adapted the mass-offset approach to identify AASs in the same raw data used to create the library and found moderate success, with 64.1% peptide identification efficiency.28 We found that successful PSMs had similar absolute and relative abundance sensitivities but scored lower than their library counterparts. Furthermore, we identified both substitution type and location as driving biases in the PSM identification efficiency. We also defined “scissor substitutions”, a unique subclass of substitutions that remove or introduce a protease cleavage site, which result in fewer successful identifications. We demonstrate the utility of a spectral library and the power of efficiency as criteria for evaluating the limitations of AAS identification.

Materials and Methods

Peptide Preparation

Overnight cultures ofS. typhimurium LT2 cells orE. coli K12 MG1655 were diluted in LB media, grown to log phase, and pelleted by centrifugation. Next, the pellet was suspended in lysis buffer and lysed using a bead beater (Biospec). Cell lysate was clarified by centrifugation and quantified using a Pierce BCA assay (Thermo Fisher) per the manufacturer’s protocols. Samples were reduced, alkylated, and subsequently loaded onto S-Trap mini columns (ProtiFi) per the manufacturer’s protocol. Proteins were digested with 1 μg of trypsin in 160 μL of 100 mM tetraethylammonium bicarbonate (TEAB) pH 8.5 for 2 h at 47 °C. Peptides were eluted per the manufacturer’s protocol and dried down to ∼20 μL in a speedvac. S. typhimurium peptides were desalted using an Oasis HLB desalting column (Waters) and E. coli peptides desalted using C18 ZipTip (EMD Millipore), following the respective manufacturer’s protocol. Desalted peptides were dried down in a vacuum concentrator and then suspended in 0.1% formic acid in water to a final concentration of 300 ng/μL.

Serial Dilution of S. typhimurium

Desalted S. typhimurium peptides were serially diluted 2-fold by the addition of 60 μL of peptides to 60 μL of 0.1% formic acid. DesaltedE. coli peptides (30 μL) were added to each S. typhimurium dilution to create a constant background ofE. coli peptides.

Liquid Chromatography–Mass Spectrometry

Technical duplicate injections of 1.33 μL per sample were separated with a PepSep TEN C18 10 cm × 100 μM column (Bruker) and eluted with a 90 min segmented linear gradient. Mass spectra were collected on a Bruker TIMS-TOF Pro operating with a modified DDA-PASEF 1.1s cycle time method: the CaptiveSpray source was set to 1700 V, and the collision energy maximum was set to 70 eV.

Identification of Mass Spectra in FragPipe

Raw data was searched using FragPipe (v.17.1) GUI with MSFragger (3.4) and filtered with Philosopher (v4.2.2-RC).11 Software parameters for each search are included in the GitHub repository (see the fragpipe.config file).E. coli MG1655 (UP000000625) and S. typhimurium LT2 (UP000001014) genomes were downloaded from Uniprot (2022.03.25) with common contaminants and decoy sequences added in FragPipe.11 The two-genome search included up to one missed cleavage, carbamidomethylation of cysteine as a fixed modification, oxidation of methionine as a variable modification, and no mass-offsets. The single-genome search in MSFragger was set to use the mass-offset algorithm with a corresponding offset to discover each AAS and the top 26 PTMs found in a default set open search. The option to report mass-offset as a variable modification was set to 1 (“Yes—and remove delta mass”).

Genomic Analysis To Identify Single Amino Acid Variant Peptides

To identify a target list of tryptic peptides that differ by one aa between organisms, each genome was digested in silico using Protease Guru.29 The resultant list of peptides was input in a Python script (FindSSP.py) that outputs a target list of all of the peptide sequences that represent a single amino acid variant (SAAV) between the two organisms. The script excludes the mass ambiguous substitutions I/L → L/I and R/K → X!R/K at the peptide C-terminus.

Annotation and Filtering of AASs

Spectra representing AASs betweenE. coli andS. typhimurium were parsed and filtered using a Python script (MSFraggerFindSubs.py). To summarize, all PSMs were imported from MSFragger’s psm.tsv outputs. Modified sequences were matched from the indicated mass-offset to all considered modifications within 25 ppm of peptide mass error. For example, APEPT[-18]IDEK would be annotated as “T → A or dehydration”. All substituted sequences were then filtered to include only target peptide sequences identified by FindSSP.py. PSMs were filtered to remove unquantified peptides with 0 intensity precursors, ambiguous, or non-AAS modifications.

Defining the AAS Spectral Library

The AAS spectral library was defined as all spectra identified in the two-genome search with aS. typhimurium sequence exactly 1 aa different than anyE. coli peptide sequence, or vice versa. Each peptide-spectra match was annotated with the peptide characteristics (intensity, retention time, ion mobility, length); characteristics relative to the E. coli cognate sequence (substitution type, delta retention time, delta ion mobility) and the sequence determined in the one-genome search for categorization.

Data Availability

Data used to generate figures are available at https://github.com/ChampionLab/substitutionannotation. Raw spectra outputs are available at the MassIVE repository (doi:10.25345/C5416T88R).

Code Availability

Python scripts used are available at https://github.com/ChampionLab/substitutionannotation.

Results

Constructing the Ground-Truth Substitution Library

To evaluate the identification of substitutions in shotgun proteomics, we needed a set of peptide-spectra that met three criteria. First, each spectrum needs sufficient evidence to be confidently assigned an amino acid sequence. Second, each spectrum must represent a peptide sequence that differs by one amino acid from a reference genome. Third, the collection of spectra should represent substitutions diverse in substitution type, location, and sequence context. We found that by mixing two closely related bacteria,E. coli andS. typhimurium, evolutionary homology results in a ground-truth spectral library that broadly satisfies these three constraints. To evaluate the similarity between the genome-defined peptides of these organisms, both genomes were digested to tryptic peptides in silico using Protease Guru.29 Each E. coli peptide was categorized by the number of unique amino acids against anyS. typhimurium peptide, and vice versa. (see Materials & Methods). Of the 200,537 in silicoS. typhimurium peptides considered, we found that 24.5% were homologous sequences; 14.0% of sequences differed from at least oneE. coli peptide by exactly one amino acid; 13.2% differed from at least one E. coli peptide by exactly two amino acids; and 52.2% had more than 2 different aa than any E. coli peptide (Supporting Information, Figure S1, Table S1).

However, we expect that many of the 28,077 single-substitution-representing peptides will not be observable under one experimental condition for reasons including low gene expression, incomplete digestion enzyme efficiency, and peptide characteristics incompatible with efficient ionization.30,31 To collect spectra representing these peptide sequences, we prepared, quantified, and digested E. coli and S. typhimurium whole cell lysates individually using standard shotgun proteomic preparation methods. Digested lysates were combined, serially diluted, and measured via nUHPLC-MS-MS/MS, and a standard database search was performed with both the E. coli and S. typhimurium genomes. We filtered all peptide-spectrum matches for sequences found to represent a single AAS. This resulted in 52,756 spectra and 2568 unique peptide sequences that comprise our ground-truth spectral library. This library shows considerable diversity in physiochemical properties (Supporting Information, Figure S2) and represents 241 of the 342 possible AASs detectable by mass spectrometry (Supporting Information, Figure S3). This ground-truth library meets our three criteria and is representative of AASs observable in typical shotgun proteomic experiments.

Determining the Efficiency of Mass-Offset Discovery of Amino Acid Substituted Spectra

We applied our ground-truth substitution library to determine the efficiency of AAS identification in shotgun proteomics. We used mass-offset search to identifyS. typhimurium peptide-spectra using only anE. coli search database (Figure 1C) in the same raw data used to create the spectral library. Similar to dependent peptide search in Mordret et al.,7 we adapted mass-offset PTM search functionality in MSFragger to identify AAS peptides (see Materials and Methods).11 We created a Python script (Materials and Methods) to annotate identified mass-offsets with specific changes in mass and aa localization as substitutions while filtering out mass-ambiguous modifications and unquantified peptides. We tracked the fate of each individual spectrum by comparing its sequence in the library to that determined in the single-genome search and evaluated identification efficiency globally and categorically. Most library peptide sequences (64.1%) were identified. Only 38.3% of the library spectra were correctly identified (Figure 2A, in green). Many spectra (34.0%, “no identification”) did not score well enough for sequence assignment after standard FDR control. This demonstrates a challenge to obtaining confidence in sequence assignment without a priori genomic knowledge. The remaining spectra (27.7%, in gold) were confidently assigned an incorrect sequence, whether matched to the unmodified genomic cognate sequence (14.9%), assigned an incorrectly modified sequence (5.7%), or filtered out due to mass ambiguity with other PTMs (2.2%) or lack of intensity (2.4%). These categories represent specific targets for iterative improvement of the identification software and are broadly applicable to any search software. The fates of library spectra suggest that most spectra are un- or mis-identified because spectral evidence of a substitution generates less confidence in spectrum identity than alignment with the genomic database.

Figure 2.

Figure 2

Tracing the final fate of library spectra in the one-genome search demonstrates intensity-independent challenges for spectrum identification. (A) Final fate of each library spectrum identification in the one-genome search. Spectra were first categorized as representing a library peptide sequence (green) or other sequences (gold). Spectra were further categorized as a correct identification (green) or an incorrect identification (gold). The subcategories of incorrectly identified spectra represent either failure to identify the spectra or specific competing identification hypotheses in sequence assignment. These categories can be used to guide search software improvements. (B) PeptideProphet score difference (one-genome search minus library) for each correctly identified spectrum. There is a tail of spectra that scored more poorly in the one-genome search. (C) Minimum observed intensity for each peptide, grouped by the final sample of the dilution series that led to successful identification. The distributions of minimum observed peptide intensities were similar throughout the dilution series despite decreasing Salmonella stoichiometry. (D) Dynamic range observed for each correctly identified AAS peptide (hatched white bars) and allS. typhimurium peptides (black bars) through the dilution series. Both peptide groups have similar observed dynamic range distributions.

Successfully Identified Spectra Met Higher Score Thresholds with Lower Individual Scores

We hypothesized that the inability to identify many spectra in the single-genome search was due to an increased burden of proof without a priori sequence knowledge. Spectra sequence assignment uniquely depends on a PSM score relative to a cutoff value. The score cutoff thresholds are universally based on the target-decoy strategy.23 To investigate if identifying substitutions with mass-offset affected the target-decoy resolution, we compared the distribution of decoy PSM scores between the one-genome and library search (Supporting Information, Figure S4). There are more decoys with higher scores in the one-genome search, indicating that higher score thresholds are necessary to maintain confidence with a controlled false discovery rate. Next, we asked if the same spectral evidence was uniquely weighted by the mass-offset expansion of search space. To do this, we parsed the library spectra correctly identified in the one-genome search and took the difference of the library search PeptideProphet Probability score from the one-genome search score (Figure 2B).32 Although the plurality of correctly identified spectra received the same score in both searches, many spectra received a lower score in the single-genome search. This indicates that mass-offset peptides are disadvantaged in the scoring algorithm. Together, these results demonstrate that AAS peptide-spectra score more poorly and require higher thresholds for successful sequence identification via mass-offset.

We next sought to determine the driving factors that distinguish neutral- and score-disadvantaged substitution identifications. Because a secondary factor of scoring is MS2 fragment ion signal intensity, we suspected peptide abundance may distinguish AASs that were not successfully identified. Co-eluting peptides provide competitive ions that may mask the identification of low-abundance peptides. To determine if AAS identification efficiency was uniquely affected by stoichiometry, we diluted S. typhimurium lysate 2-fold against a constant background of E. coli lysate (Figure 2C,D). We calculated the dynamic range for all S. typhimurium peptides as the log difference between the maximum and minimum observed precursor intensity per peptide sequence across all dilutions. We observed a similar dynamic range between S. typhimurium AASs, representing peptides discovered in the single-genome search and all S. typhimurium peptides identified in the library search (Figure 2D). We next asked if substitutions had a unique lower limit of identification or a stoichiometry-dependent limit of identification. We determined the lower limit of abundance for peptide identification by plotting the intensity distribution of peptides at their last identification in the dilution series (Figure 2C). The minimum detected S. typhimurium peptide intensity distribution was similar across decreasing stoichiometry. The similar dynamic range and lower limit of identification imply that identification of substituted peptide-spectra is limited by conditions that reduce identification of all spectra and is not uniquely disadvantaged by abundance or stoichiometry.

Not All Substitution Types are Detected Efficiently

We investigated if simply enumerating the mass-offset was sufficient to identify each substitution type. The one-genome search did not identify any spectra of 45 substitution types, though most (32 substitutions) had a low number of sample spectra (n < 50) in our library (Figures 3A, S3). For example, no spectra representing a substitution from Cys were successfully identified. Substitution types with no PSMs in the one-genome search but many in the library include substitutions that are isobaric with common modifications, such as T → A (1000 PSMs), D → Q (621 PSMs), and Q → D (529 PSMs). Identified substitution types had a broad range of efficiencies from 3.6% (R → I/L) to 100% (H → M) (Figure 3A). Likewise, many library spectra representing a substitution involving Arg or Lys had below-average identification efficiency in the one-genome search. We conclude that mass-offset identification is generally suitable for global substitution identification but that special care should be taken for the identification of specific substitution types.

Figure 3.

Figure 3

Efficiency of library spectra identification demonstrates categorical bias in AAS identification. (A) AAS library spectra identification efficiency by substitution type, comparing the genome-anticipated E. coli aa to the aa present in the S. typhimurium peptide. Substitution types absent in the library (gray) include aa highly conserved in bacteria such as Cys, Phe, and Trp. Substitutions involving Arg, Cys, or Lys had an below average identification efficiency. (B) AAS library spectra identification efficiency by substitution position and peptide length. Substitution position and peptide length combinations not represented in our library are shown in gray. Substitution position 1 represents the N-terminal aa; the y = x position represents the C-terminal aa. Substitutions in the middle of moderately sized peptides had the highest identification efficiency. In contrast, small peptides (<9 aa), large peptides (>25 aa), and substitutions at the b2, y1, or y2 positions had poor identification efficiency.

Specific Fragment Ions are Required for Confident Assignment of Some Substituted Peptide Sequences

We next looked for unique characteristics of the library spectra not correctly identified in a single-genome search. The distributions of peptide intensity, retention time, ion mobility, shift in retention, or shift in ion mobility from the cognate peptide were similar between successfully identified library spectra (Supporting Information, Figure S2) and incorrectly or unidentified library spectra (Supporting Information, Figure S5). The remaining characteristics of a spectrum, fragment ions, and their intensities are uniquely weighted during AAS discovery. Lacking a priori sequence knowledge imposes a burden of proof on specific fragment ions diagnostic of peptide modification. We identified two scenarios where this burden of proof decreased AAS identification efficiency: substitutions near the peptide terminus and ambiguous localization of isobaric mass-offsets.

Each amino acid in a peptide contributes to the mass of multiple fragment ions. For example, the N-terminal residue is part of the mass balance of every b ion generated in MS2 but does not contribute to any y ions. AASs near either terminus would generate few complementary and modified b/y ion pairs that provide strong evidence for sequence assignment. Additionally, the number of ions potentially diagnostic for a substitution increases with the peptide length. We hypothesized that these constraints would introduce position and length biases in single-genome search AAS peptide identification. To determine if substitution location influenced the identification of AAS spectra, we plotted the efficiency of library spectra identification by length and substitution position (Figure 3B, number of representative spectra presented in Supporting Information, Figure S6, individual distributions of identification efficiency by peptide length or substitution position presented in Supporting Information, Figure S7). We found short (<9 aa) or long (>23 aa) peptides; also, substitutions at the b2, y1, or y2 positions were identified at below average efficiency. N-terminal substitutions were robustly identified compared to the average success rate. This may be caused by the software’s arbitrary assignment of the substitution position to the aa closest to the N-terminus when the precise residue cannot be determined. The C-terminal substitutions that we identified were limited to the swap of lysine and arginine. Other C-terminal peptide substitutions would lack a protease cleavage site. Substitutions at these positions in the protein sequence were identified in the longer peptide with a missed protease cleavage and a substitution at the missed cleavage residue (see the section on Scissor Substitutions below).

Isobaric substitutions also require specific diagnostic ions for proper localization and identification. For instance, a G → A substitution may be confused with a S → T, D → E, N → Q, or V → I/L substitution, as each represents a mass-offset of 14.016 Da. If both G and S are in the peptide sequence, unambiguous identification requires intensity from the y or b ions between these residues (Supporting Information, Figure S8A). These ambiguous localizations of isobaric mass-offsets account for many of the 3054 spectra assigned an incorrectly substituted sequence. To better understand the mass degeneracy within AASs, we plotted the number of substitution mass shifts within ±0.02 Da for each substitution type (Supporting Information, Figure S8B). We found that 190 out of 342 substitution types were isobaric with at least one other substitution, while some had up to five unique substitutions representing the same change in mass. Surprisingly, many multiply degenerate substitutions had average or better library spectra identification efficiency (compare Figures 3A, Supporting Information, Figure S8B), likely due to strong localization of the mass-offset. Additionally, the mass shift of each single substitution is within ±0.02 Da of the mass shift from multiple combinations of two substitutions. These also require specific fragment ions to unambiguously identify when both single- and double-substitution peptides of similar mass are known to be present. In our library, only 72 spectra corresponding to five peptides fall into this category and were assigned the sequence representing the single substitution. These were identified with near-average efficiency (23/72 or 31.9%) in the one-genome search. Mass-ambiguous substitutions suggest the application of other dimensions of data, such as retention time or ion mobility, for confident identification.

Scissor Substitutions Conflict with How Peptides are Predicted from Protein Sequences

Substitutions that add or remove a protease cleavage site, such as a substitution to or from lysine in a tryptic digest, result in peptides of different lengths and sequences than their genomic cognates. We call these “scissor substitutions” and hypothesize that their detection is disadvantaged by how search database peptides are predicted from proteins. Genome-defined protease cleavage motifs are identified before identifying spectra, and peptide modifications are determined without reconsideration of peptide cleavage. For scissor substitutions that lead to the loss of a cutsite, spectrum identification requires the consideration of a missed cleavage in silico for an anticipated, modified peptide to match the physical peptide (Figure 4A). An example annotated PSM and extracted ion current for this substitution type is provided (Supporting Information, Figure S9). In contrast, scissor substitutions that introduce a new cutsite cause a discrepancy between the shorter cleaved peptides in situ and the longer in silico anticipated sequence, with no additional consideration currently available to adjust software expectations (Figure 4B). In agreement with this logic, we found that the library spectra identification efficiency for spectra representing the gain or loss of a cutsite was below average (Figure 4C). There is a marked distinction between the identification of substitutions resulting in the loss of a cutsite (24.6%) and the gain of a cutsite (16.7%, Figure 4C). To ensure that poor identification of library spectra representing the gain of a protease cleavage motif was independent of the representative S. typhimurium peptide contexts, we performed a reciprocal search to discover E. coli PSMs using only the S. typhimurium genome. As expected, we again observed below-average identification efficiency of substitutions that remove a cutsite (27.5%) and dramatically low efficiency for substitutions that introduce a cutsite (11.1%, Figure 4C).

Figure 4.

Figure 4

Identification of scissor substitution peptide-spectra is disadvantaged by peptide search engine logic. Scissor substitutions, which add or remove a cutsite for the protease used to generate peptides, are disadvantaged by singular digestion of proteins in silico. (A) Scissor substitutions that remove a protease cleavage site result in a physical peptide that matches an in silico missed cleavage peptide (red), a scenario commonly considered in search software. (B) In contrast, scissor substitutions that add a cutsite result in a physically cleaved peptide. Peptide prediction from protein sequences typically occurs before the annotation of identified substitutions, resulting in an undigested in silico peptide and no detection of the expected physical peptides. (C) Identification efficiency of AAS PSMs for each class of scissor substitutions. Very few scissor-substitution-representing S. typhimurium peptides (left) that gain a new tryptic site (white hatched bars) were identified in our search using only the E. coli genome. As expected, this pattern was reciprocated when E. coli peptides representing a scissor substitution (right) were identified using only the S. typhimurium genome, implicating the peptide identification search logic.

Discussion

Applying our mixed organism ground-truth library to evaluate the identification of substituted peptide-spectra using one-genome and the mass-offset search strategy revealed the global and categorical efficiency of software analysis, which should inform both software and experimental adjustments for improved sensitivity. In our work, we set an initial benchmark of 64% substituted sequence coverage for a complex and unfractionated data set. Achieving increased proteome depth via in-solution or gas-phase fractionation is likely to improve library peptide identification efficiency. Our library of substituted peptides behaved like other peptides with no unique limits of abundance and stoichiometry on their identification. The mass spectrometer is unaware of ion identity when isolating and fragmenting peptides. Furthermore, substituted peptides occupy a distinct analytical space of mass, retention time, and ion mobility from those of their genomic cognate peptides. Thus, the problem of abundance and matrix competition is not solely cognate driven but rather based on all proteomic interference. Based on the observed frequency of substitutions in other works, we expect many substitutions to have low abundance and signal intensity relative to the proteome.7,21 Therefore, approaches such as fractionation, known to improve dynamic range and sensitivity, should be employed.3336

From these data, the biggest limitation of AAS identification was the PSM score. The modal fates of incorrectly identified library spectra were those not confidently assigned a sequence (34%), followed by spectra with enough evidence for the cognate sequence but not for a modification indicative of a substitution (15%). Selective score boosting of bona fide AAS spectra could be accomplished by application of other dimensions of the data. There are existing algorithms for prediction of retention time and ion mobility based on peptide sequence.37,38 Likewise, there are scoring models that can account for changes to retention time, such as Percolator.39 In addition to boosting the score of bona fide AAS spectra, these approaches would also decrease confidence in decoy sequences matching to spectra based on mass alone. Combining these data and existing or novel analysis tools may provide better score-based resolution of true AAS PSMs, other peptide sequence assignment hypotheses, and decoy AAS PSMs.

The categories of peptide-spectra misassignment described here delineate other competing hypotheses of spectra-peptide sequence assignment. While not designed to be an exhaustive list, these categories captured all of the misassigned spectra in our experiment. The two modal categories, unmodified and incorrect substitution assignments, likely reflect the specific fragment-ion bias found in our negative results. Both the length of peptide and the relative position of the substitution can have a dramatic effect on the number of potential ions that unambiguously identify the aa sequence, as opposed to other ions that support all of the similar sequence hypotheses. The additional dimensions of mass spectrometry data may alleviate the additional burden of proof on specific fragment ions. While the observed ion mobility shifts between AAS peptides in our work was small, it has been demonstrated that these shifts are larger near the peptide termini and may boost the identification efficiency of substitutions at the b2 or y2 positions.38 Alternatively, substituted peptide sequence coverage can be improved by splitting a sample and digesting each portion with complementary proteases. The new set of peptides in the parallel digest will have unique lengths and relative substitution positions, providing additional independent evidence of a substitution.

We found that the protease used to generate peptides also defines a subset of substitutions that we term scissor substitutions, which remove or introduce a protease cleavage motif. Scissor substitutions present a significant order of operations challenge that is easy to identify in hindsight but difficult to address a priori. Aligning software expectations to physical peptides that lose a cutsite is already possible in virtually every search engine by in silico missed cleavage. Despite this logical alignment and despite not being located near the peptide termini, these library spectra were still identified with below average efficiency. We do not have a solution to align software expectations for substituted peptides that introduce a cleavage site. The addition of modification annotation and second cleavage would recover spectra identifications only for peptide sequences already identified by other spectra. Some of these are explained by stochastic missed cleavages in vitro. It is much easier to compensate for scissor substitutions during peptide preparation by splitting a sample to be digested by two complementary proteases. Poor substitution coverage at the cleavage motifs of one enzyme would be bolstered by improved coverage in the other digest.

Our ground-truth positive control provides a template for the evaluation of diverse global identification strategies for AASs against the current gold standard of a database-driven search. This approach demonstrates for the first time fundamental factors that uniquely limit the identification of substituted peptides, namely, PSM score, substitution type, and a specific fragment-ion burden. These limitations suggest maximizing peptide sensitivity and splitting a sample for digestion with complementary proteases for improved substituted peptide sequence coverage. Significant work remains for confidence in missing values, where a targeted approach is advisable. Shotgun proteomics is thus a promising tool for positive hypothesis testing regarding the global identification of AASs.

Acknowledgments

We thank Dr. Aleksi Nesvizhskii and Dr. Fengchao Yu for assistance with using FragPipe, Dr. Jim Slauch for the kind gift of Salmonella, Daniel Hu for advice, Dr. Patricia Champion for critical reading of the manuscript, and Dr. Bill Boggess in the Notre Dame Mass Spectrometry and Proteomics Facility for technical advice and assistance. P.L.C. acknowledges support from NIH award DP1 GM146256. M.M.C. acknowledges support from R01GM139277. T.J.L. acknowledges support from NIH training grant T32GM075762.

Supporting Information Available

The following Supporting Information is available free of charge at ACS Web site. The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jproteome.3c00730.

  • Similarity between the tryptic peptides of S. typhimurium and E. coli; physiochemical properties of the positive library; representation of substitution types in the positive library; expansion of the search space increases score threshold for confident sequence assignment; physiochemical properties of library spectra un- or in-correctly identified in the one-genome search; representation of substitution positions in the positive library; individual distributions of identification efficiency of substitutions by peptide length or substitution position; mass-ambiguity of substitutions; evidence of a scissor substitution; number of in silico tryptic peptides representing 0, 1, 2, or 2+ different amino acids from any peptide in the other organism; and common modifications included in the mass-offset list (PDF)

Author Contributions

M.M.C., P.L.C., and T.J.L. designed the research; T.J.L. performed the research, M.M.C. and T.J.L. performed data analysis, M.M.C., P.L.C., and T.J.L. performed data interpretation, M.M.C. and T.J.L contributed new analytical tools T.J.L., M.M.C., and P.L.C. designed the figures and wrote the paper.

The authors declare no competing financial interest.

Supplementary Material

References

  1. Drummond D. A.; Wilke C. O. Mistranslation-Induced Protein Misfolding as a Dominant Constraint on Coding-Sequence Evolution. Cell 2008, 134 (2), 341–352. 10.1016/j.cell.2008.05.042. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Dai D.; Wang H.; Zhu L.; Jin H.; Wang X. N6-Methyladenosine Links RNA Metabolism to Cancer Progression. Cell Death Dis. 2018, 9 (2), 124–213. 10.1038/s41419-017-0129-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Liu L.; Wang Y.; Wu J.; Liu J.; Qin Z.; Fan H. N6-Methyladenosine: A Potential Breakthrough for Human Cancer. Mol. Ther. Nucleic Acids 2020, 19, 804–813. 10.1016/j.omtn.2019.12.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Lamichhane T. N.; Mattijssen S.; Maraia R. J. Human Cells Have a Limited Set of tRNA Anticodon Loop Substrates of the tRNA Isopentenyltransferase TRIT1 Tumor Suppressor. Mol. Cell. Biol. 2013, 33 (24), 4900–4908. 10.1128/MCB.01041-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Delaunay S.; Rapino F.; Tharun L.; Zhou Z.; Heukamp L.; Termathe M.; Shostak K.; Klevernic I.; Florin A.; Desmecht H.; Desmet C. J.; Nguyen L.; Leidel S. A.; Willis A. E.; Büttner R.; Chariot A.; Close P. Elp3 Links tRNA Modification to IRES-Dependent Translation of LEF1 to Sustain Metastasis in Breast Cancer. J. Exp. Med. 2016, 213 (11), 2503–2523. 10.1084/jem.20160397. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Wohlgemuth I.; Garofalo R.; Samatova E.; Günenç A. N.; Lenz C.; Urlaub H.; Rodnina M. V. Translation Error Clusters Induced by Aminoglycoside Antibiotics. Nat. Commun. 2021, 12 (1), 1830. 10.1038/s41467-021-21942-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Mordret E.; Dahan O.; Asraf O.; Rak R.; Yehonadav A.; Barnabas G. D.; Cox J.; Geiger T.; Lindner A. B.; Pilpel Y. Systematic Detection of Amino Acid Substitutions in Proteomes Reveals Mechanistic Basis of Ribosome Errors and Selection for Translation Fidelity. Mol. Cell 2019, 75 (3), 427.e5–441.e5. 10.1016/j.molcel.2019.06.041. [DOI] [PubMed] [Google Scholar]
  8. Zaher H. S.; Green R. Fidelity at the Molecular Level: Lessons from Protein Synthesis. Cell 2009, 136 (4), 746–762. 10.1016/j.cell.2009.01.036. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Wilson K. A.; Bar S.; Kapahi P. Haste Makes Waste: The Significance of Translation Fidelity for Development and Longevity. Mol. Cell 2021, 81 (18), 3675–3676. 10.1016/j.molcel.2021.08.036. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Lesur A.; Schmit P.-O.; Bernardin F.; Letellier E.; Brehmer S.; Decker J.; Dittmar G. Highly Multiplexed Targeted Proteomics Acquisition on a TIMS-QTOF. Anal. Chem. 2021, 93 (3), 1383–1392. 10.1021/acs.analchem.0c03180. [DOI] [PubMed] [Google Scholar]
  11. Yu F.; Haynes S. E.; Teo G. C.; Avtonomov D. M.; Polasky D. A.; Nesvizhskii A. I. Fast Quantitative Analysis of timsTOF PASEF Data with MSFragger and IonQuant. Mol. Cell. Proteomics 2020, 19 (9), 1575–1585. 10.1074/mcp.TIR120.002048. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Miller R. M.; Smith L. M. Overview and Considerations in Bottom-up Proteomics. Analyst 2023, 148 (3), 475–486. 10.1039/D2AN01246D. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Tang W. H.; Halpern B. R.; Shilov I. V.; Seymour S. L.; Keating S. P.; Loboda A.; Patel A. A.; Schaeffer D. A.; Nuwaysir L. M. Discovering Known and Unanticipated Protein Modifications Using MS/MS Database Searching. Anal. Chem. 2005, 77 (13), 3931–3946. 10.1021/ac0481046. [DOI] [PubMed] [Google Scholar]
  14. Hansen B. T.; Davey S. W.; Ham A.-J. L.; Liebler D. C. P-Mod: An Algorithm and Software to Map Modifications to Peptide Sequences Using Tandem MS Data. J. Proteome Res. 2005, 4 (2), 358–368. 10.1021/pr0498234. [DOI] [PubMed] [Google Scholar]
  15. Tran N. H.; Qiao R.; Xin L.; Chen X.; Liu C.; Zhang X.; Shan B.; Ghodsi A.; Li M. Deep Learning Enables de Novo Peptide Sequencing from Data-Independent-Acquisition Mass Spectrometry. Nat. Methods 2019, 16 (1), 63–66. 10.1038/s41592-018-0260-3. [DOI] [PubMed] [Google Scholar]
  16. Muth T.; Hartkopf F.; Vaudel M.; Renard B. Y. A Potential Golden Age to Come—Current Tools, Recent Use Cases, and Future Avenues for De Novo Sequencing in Proteomics. Proteomics. 2018, 18 (18), 1700150. 10.1002/pmic.201700150. [DOI] [PubMed] [Google Scholar]
  17. Ma B.; Johnson R. De Novo Sequencing and Homology Searching. Mol. Cell. Proteomics 2012, 11 (2), O111.014902. 10.1074/mcp.O111.014902. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Tan Z.; Zhu J.; Stemmer P. M.; Sun L.; Yang Z.; Schultz K.; Gaffrey M. J.; Cesnik A. J.; Yi X.; Hao X.; Shortreed M. R.; Shi T.; Lubman D. M. Comprehensive Detection of Single Amino Acid Variants and Evaluation of Their Deleterious Potential in a PANC-1 Cell Line. J. Proteome Res. 2020, 19 (4), 1635–1646. 10.1021/acs.jproteome.9b00840. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Moshkovskii S. A.; Ivanov M. V.; Kuznetsova K. G.; Gorshkov M. V. Identification of Single Amino Acid Substitutions in Proteogenomics. Biochemistry 2018, 83 (3), 250–258. 10.1134/S0006297918030057. [DOI] [PubMed] [Google Scholar]
  20. Zhang B.; Wang J.; Wang X.; Zhu J.; Liu Q.; Shi Z.; Chambers M. C.; Zimmerman L. J.; Shaddox K. F.; Kim S.; Davies S. R.; Wang S.; Wang P.; Kinsinger C. R.; Rivers R. C.; Rodriguez H.; Townsend R. R.; Ellis M. J. C.; Carr S. A.; Tabb D. L.; Coffey R. J.; Slebos R. J. C.; Liebler D. C. Proteogenomic Characterization of Human Colon and Rectal Cancer. Nature 2014, 513 (7518), 382–387. 10.1038/nature13438. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Garofalo R.; Wohlgemuth I.; Pearson M.; Lenz C.; Urlaub H.; Rodnina M. V. Broad Range of Missense Error Frequencies in Cellular Proteins. Nucleic Acids Res. 2019, 47 (6), 2932–2945. 10.1093/nar/gky1319. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Han Y.; Ma B.; Zhang K. SPIDER: Software for Protein Identification from Sequence Tags with de Novo Sequencing Error. J. Bioinf. Comput. Biol. 2005, 03 (03), 697–716. 10.1142/s0219720005001247. [DOI] [PubMed] [Google Scholar]
  23. Elias J. E.; Gygi S. P. Target-Decoy Search Strategy for Mass Spectrometry-Based Proteomics. Methods Mol. Biol. 2010, 604, 55–71. 10.1007/978-1-60761-444-9_5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Silva A. S. C.; Bouwmeester R.; Martens L.; Degroeve S. Accurate Peptide Fragmentation Predictions Allow Data Driven Approaches to Replace and Improve upon Proteomics Search Engine Scoring Functions. Bioinformatics 2019, 35 (24), 5243–5248. 10.1093/bioinformatics/btz383. [DOI] [PubMed] [Google Scholar]
  25. Gray T. A.; Clark R. R.; Boucher N.; Lapierre P.; Smith C.; Derbyshire K. M. Intercellular Communication and Conjugation Are Mediated by ESX Secretion Systems in Mycobacteria. Science 2016, 354 (6310), 347–350. 10.1126/science.aag0828. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Blattner F. R.; Plunkett G.; Bloch C. A.; Perna N. T.; Burland V.; Riley M.; Collado-Vides J.; Glasner J. D.; Rode C. K.; Mayhew G. F.; Gregor J.; Davis N. W.; Kirkpatrick H. A.; Goeden M. A.; Rose D. J.; Mau B.; Shao Y. The Complete Genome Sequence of Escherichia Coli K-12. Science 1997, 277 (5331), 1453–1462. 10.1126/science.277.5331.1453. [DOI] [PubMed] [Google Scholar]
  27. McClelland M.; Sanderson K. E.; Spieth J.; Clifton S. W.; Latreille P.; Courtney L.; Porwollik S.; Ali J.; Dante M.; Du F.; Hou S.; Layman D.; Leonard S.; Nguyen C.; Scott K.; Holmes A.; Grewal N.; Mulvaney E.; Ryan E.; Sun H.; Florea L.; Miller W.; Stoneking T.; Nhan M.; Waterston R.; Wilson R. K. Complete Genome Sequence of Salmonella Enterica Serovar Typhimurium LT2. Nature 2001, 413 (6858), 852–856. 10.1038/35101614. [DOI] [PubMed] [Google Scholar]
  28. Yu F.; Teo G. C.; Kong A. T.; Haynes S. E.; Avtonomov D. M.; Geiszler D. J.; Nesvizhskii A. I. Identification of Modified Peptides Using Localization-Aware Open Search. Nat. Commun. 2020, 11 (1), 4065. 10.1038/s41467-020-17921-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Miller R. M.; Ibrahim K.; Smith L. M. ProteaseGuru: A Tool for Protease Selection in Bottom-Up Proteomics. J. Proteome Res. 2021, 20 (4), 1936–1942. 10.1021/acs.jproteome.0c00954. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Deutsch E. W.; Lam H.; Aebersold R. PeptideAtlas: A Resource for Target Selection for Emerging Targeted Proteomics Workflows. EMBO Rep. 2008, 9 (5), 429–434. 10.1038/embor.2008.56. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Kusebauch U.; Deutsch E. W.; Campbell D. S.; Sun Z.; Farrah T.; Moritz R. L. Using PeptideAtlas, SRMAtlas, and PASSEL: Comprehensive Resources for Discovery and Targeted Proteomics. Curr. Protoc. Bioinf. 2014, 46 (1), 13.25.1–13.25.28. 10.1002/0471250953.bi1325s46. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Ma K.; Vitek O.; Nesvizhskii A. I. A Statistical Model-Building Perspective to Identification of MS/MS Spectra with PeptideProphet. BMC Bioinf. 2012, 13 (S16), S1. 10.1186/1471-2105-13-S16-S1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Guergues J.; Wohlfahrt J.; Stevens S. M. Enhancement of Proteome Coverage by Ion Mobility Fractionation Coupled to PASEF on a TIMS-QTOF Instrument. J. Proteome Res. 2022, 21 (8), 2036–2044. 10.1021/acs.jproteome.2c00336. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Ye X.; Tang J.; Mao Y.; Lu X.; Yang Y.; Chen W.; Zhang X.; Xu R.; Tian R. Integrated Proteomics Sample Preparation and Fractionation: Method Development and Applications. Trac. Trends Anal. Chem. 2019, 120, 115667. 10.1016/j.trac.2019.115667. [DOI] [Google Scholar]
  35. Weaver S. D.; Schuster-Little N.; Whelan R. J. Preparative Capillary Electrophoresis (CE) Fractionation of Protein Digests Improves Protein and Peptide Identification in Bottom-up Proteomics. Anal. Methods 2022, 14 (11), 1103–1110. 10.1039/D1AY02145A. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Yang F.; Shen Y.; Camp D. G.; Smith R. D. High-pH Reversed-Phase Chromatography with Fraction Concatenation for 2D Proteomic Analysis. Expert Rev. Proteomics 2012, 9 (2), 129–134. 10.1586/epr.12.15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Bouwmeester R.; Gabriels R.; Hulstaert N.; Martens L.; Degroeve S. DeepLC Can Predict Retention Times for Peptides That Carry As-yet Unseen Modifications. Nat. Methods 2021, 18 (11), 1363–1369. 10.1038/s41592-021-01301-5. [DOI] [PubMed] [Google Scholar]
  38. Meier F.; Köhler N. D.; Brunner A.-D.; Wanka J.-M. H.; Voytik E.; Strauss M. T.; Theis F. J.; Mann M. Deep Learning the Collisional Cross Sections of the Peptide Universe from a Million Experimental Values. Nat. Commun. 2021, 12 (1), 1185. 10.1038/s41467-021-21352-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. The M.; MacCoss M. J.; Noble W. S.; Käll L. Fast and Accurate Protein False Discovery Rates on Large-Scale Proteomics Data Sets with Percolator 3.0. J. Am. Soc. Mass Spectrom. 2016, 27 (11), 1719–1727. 10.1007/s13361-016-1460-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

Data used to generate figures are available at https://github.com/ChampionLab/substitutionannotation. Raw spectra outputs are available at the MassIVE repository (doi:10.25345/C5416T88R).


Articles from Journal of Proteome Research are provided here courtesy of American Chemical Society

RESOURCES