Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2025 Aug 11;41(8):btaf450. doi: 10.1093/bioinformatics/btaf450

IsoBayes: a Bayesian approach for single-isoform proteomics inference

Jordy Bollon 1,2, Michael R Shortreed 3, Erin Jeffery 4, Ben T Jordan 5, Rachel Miller 6, Andrea Cavalli 7,8, Lloyd M Smith 9, Colin N Dewey 10, Gloria M Sheynkman 11,, Simone Tiberi 12,
Editor: Janet Kelso
PMCID: PMC12377908  PMID: 40796134

Abstract

Motivation

Studying protein isoforms is an essential step in biomedical research; at present, the main approach for analyzing proteins is via bottom-up mass spectrometry proteomics, which return peptide identifications, that are indirectly used to infer the presence of protein isoforms. However, the detection and quantification processes are noisy; in particular, peptides may be erroneously detected, and most peptides, known as shared peptides, are associated to multiple protein isoforms. As a consequence, studying individual protein isoforms is challenging, and inferred protein results are often abstracted to the gene-level or to groups of protein isoforms.

Results

Here, we introduce IsoBayes, a novel statistical method to perform inference at the isoform level. Our method enhances the information available, by integrating mass spectrometry proteomics and transcriptomics data in a Bayesian probabilistic framework. To account for the uncertainty in the measurement process, we propose a two-layer latent variable approach: first, we sample if a peptide has been correctly detected (or, alternatively filter peptides); second, we allocate the abundance of such selected peptides across the protein(s) they are compatible with. This enables us, starting from peptide-level data, to recover protein-level data; in particular, we: (i) infer the presence/absence of each protein isoform (via a posterior probability), (ii) estimate its abundance (and credible interval), and (iii) target isoforms where transcript and protein relative abundances significantly differ. We benchmarked our approach in simulations, and in two multi-protease real datasets: our method displays good sensitivity and specificity when detecting protein isoforms, its estimated abundances highly correlate with the ground truth, and can detect changes between protein and transcript relative abundances.

Availability and implementation

IsoBayes is freely distributed as a Bioconductor R package, and is accompanied by an example usage vignette.

1 Introduction

Post-transcriptional regulatory mechanisms, such as alternative splicing or alternative promoter usage, allow single genes to code for multiple isoforms. For example, in humans, it was estimated that approximately 20 000 genes may give rise to over 300 000 protein isoforms (Deveson et al. 2018, Willyard 2018). Characterization of protein isoform diversity, which includes the identification of physiologically relevant protein isoforms, as well as disease-associated aberrant splicing, is a crucial step in biomedical research. At present, the main strategy to infer proteins is via bottom-up mass spectrometry (MS) proteomics, where proteins are indirectly measured via peptides, which act as surrogate markers for their protein(s) of origin. Due to the high degree of sequence similarity between isoform sequences, most peptides, called shared peptides, are compatible with multiple protein isoforms; furthermore, the identification of peptides is a noisy process which can result in erroneous detections (Vesvizhskii and Aebersold 2005). Therefore, inference at the isoform level is challenging, and results are often abstracted to the gene level, or to groups of multiple protein isoforms.

A few methods have been developed to perform inference at the isoform level; notably: ProteinProphet (Nesvizhskii et al. 2003), one of the first probabilistic methods for protein identification, that infers protein isoforms via an expectation-maximization algorithm based on a bipartite graph of peptides and associated proteins; Fido (Serang et al. 2010), a Bayesian network model, that, via graph-transforming algorithms, groups and scores proteins based on peptide-spectrum matches; PIA (Uszkoreit et al. 2015), which ranks protein isoforms based on the scores of the peptides they are compatible with; and EPIFANY (Pfeuffer et al. 2020), that models the conditional dependencies between proteins and peptides via a Bayesian network representation, and estimates protein isoform posterior probabilities exploiting a loopy belief propagation algorithm combined with convolution trees. However, due to the prevalence of shared peptides, protein identification is affected by low statistical power; furthermore, inference only focuses on identifying protein isoforms (presence versus absence), and not on further measures such as their abundance. In order to obtain more stable results, HIquant (Bryan et al. 2016) infers ratios of protein-isoform abundances. However, HIquant does not estimate the presence/absence or the abundance of individual isoforms, and requires multiple samples (i.e. biological replicates); because of these limitations, it is therefore not applicable in the context of this work, which focuses on the inference of protein isoform presence and abundance, from individual MS samples.

Since transcriptomics protocols have better isoform-level resolution, compared to MS protocols, in recent years, some approaches have been proposed to enhance MS data with mRNA expression levels (Ramakrishnan et al. 2009, Liu et al. 2017, Ma et al. 2017, 2019, Carlyle et al. 2018, Salovska et al. 2020). This approach is motivated by the fact that mRNA is a prerequisite of protein, and their abundances are in general positively correlated (Lu et al. 2007, de Sousa Abreu et al. 2009, Maier et al. 2009, Edfors et al. 2016, Liu et al. 2016, Wang et al. 2019). Among these methods, Miller et al. (2022) used long-read RNA-seq data to enhance the reference isoform database, and recover protein groups with high mRNA abundance. Nonetheless, none of these frameworks achieves isoform resolution. Given the critical role of splicing in biology, more accurate methods for the detection of protein isoforms would be highly beneficial for life scientists.

Here, we present IsoBayes, a novel Bayesian approach to study protein isoforms from MS data, which also integrates transcriptomics data, when available.

2 Materials and methods

2.1 IsoBayes overview

Isoform-level data is characterized by two major sources of variability: biological noise, which is of interest, and technical noise, that is nuisance and is due to the measurement process. Our goal is to disentangle the two sources of uncertainty, in order to perform inference on the biological process; to this aim, we explicitly model the noise arising from both shared and erroneously detected peptides. In particular, isoform level abundance (and its presence/absence) is treated as a latent variable (i.e. an unknown parameter), and is sampled via a latent variable model.

Before analyzing the data, peptides are usually filtered based on a false discovery rate (FDR) threshold (typically 0.01); while this allows the removal of unreliable peptides, it also represents a crude cutoff. Here, we propose two distinct frameworks: one based on classical FDR filtering (that we call FDR mode), and a more complex one that uses the probability that peptides are correctly detected (called PEP mode). The FDR, peptide-error probability (PEP) and abundance of each peptide are taken as inputs here, and can be estimated by various proteomics tools, such as MetaMorpheus (Solntsev et al. 2018), Percolator (Käll et al. 2007, The et al. 2016) or MaxQuant (Cox and Mann 2008). In our view, the PEP mode is a more accurate way of modeling peptide uncertainty, because it allows weighting peptides based on their probability of being correctly identified; furthermore, since more peptides are analyzed, inferential results are provided for a larger number of protein isoforms. Our latent variable approach works in two steps. In the first step, we either filter peptides based on their FDR (FDR mode), or sample if they are erroneously identified based on their error probability (PEP mode). In the second step, only for correct detections, we allocate the abundance of each peptide across the protein isoform(s) it is compatible with. This procedure, starting from peptide-level information, allows us to recover protein isoform-level abundance and presence.

When available, IsoBayes also allows for the integration of transcriptomics data. In particular, the relative abundance of transcript isoforms, estimated from (short or long-read) RNA-sequencing (RNA-seq) data, is used to formulate an informative prior for the relative abundance of the corresponding protein isoforms. Therefore, given a peptide associated to two isoforms, with high and low mRNA abundance, a priori, we assume that the peptide abundance is primarily coming from the first case. Clearly, this assumes a positive mRNA-protein correlation, with greater correlation leading to higher benefits. Overall, this integration enhances the data available to the model, and hence improves the accuracy of the inferential results.

For each protein isoform, our approach estimates both its presence and abundance, and provides a measure of the uncertainty of both estimates, via the posterior probability of presence, and a credible interval of its abundance. Abundance estimates, and respective credible intervals, are also aggregated at the gene-level. Additionally, when RNA-seq data is available, we study changes between isoform mRNA and protein relative abundances; in particular, for each isoform, we compute the log2-fold change (log2-FC) between protein and mRNA relative abundances, and estimate the probability that the relative abundance is higher at the protein-level than at the transcript-level. This feature allows scientists to identify candidate isoforms where protein and mRNA abundance levels may differ.

Furthermore, our tool is flexible and general: it requires peptide-level information (i.e. FDR, PSM counts or intensities, and optionally PEP), which can be obtained from both data dependent acquisition and data independent acquisition approaches, and from both label-free and labeled pipelines. Our method is also compatible with the output from any proteomics pipeline (e.g. MetaMorpheus, Percolator, and MaxQuant) and, as a measure of abundance, users could use either peptide intensities, or peptide spectral match (PSM) counts. Note that intensities or PSM counts are only proxies for the actual abundance of peptides, which is not exactly measured in MS data. In this manuscript, we use the term “abundance” to refer to those noisy measurements.

2.2 Mathematical modeling

Initially, we perform a trivial filtering step, where we remove protein isoforms which are not associated to any detected peptide, because they cannot be identified in the given dataset. All the inference is performed on protein isoforms which could potentially be detected, i.e. those associated to at least one (unique or shared) detected peptide, and results refer to them.

Given P such protein isoforms, we assume that the overall protein abundance, denoted by nN, is distributed across the P isoforms according to a multinomial distribution:

X|πMN(π=(π1,,πP),n=p=1Pxp), (1)

where X=(X1,,XP), with Xp representing the random variable indicating the overall abundance originating from the pth protein (and xp its realization), and πp is the probability that a unit of abundance comes from the pth protein, with p=1Pπp=1. Our method is designed to work with integers: this is a convenience choice that simplifies the inference, particularly in the latent variable sampling. However, protein abundance in X can refer to either PSM counts, which are already discrete, or intensities, that are continuous. In the latter case, intensities are rounded to the closest integer, which introduces a minimal approximation. All intensities are automatically rescaled in IsoBayes before being rounded: we divide each value by the total intensity in the sample (across all protein isoforms) and multiply by 105. This ensures that rounding has a minimal impact, particularly when intensities are pre-normalized before being provided to IsoBayes, and may have low values. In our normalized data, the overall approximation introduced by rounding normalized intensities (across all protein isoforms) is, on average, 4×108 compared to the overall normalized intensity value in the sample. Furthermore, in our benchmarks (Section 3.5), we show that rounded intensities and PSM counts lead to similar inferential results.

If X was observed, the likelihood of the model could be easily computed as the density of the multinomial distribution in (1); however, measurements refer to peptides, and X is treated as a latent state. A graphical model of our method is displayed in Fig. 1, and Section S1.1, available as supplementary data at Bioinformatics online reports how the abundances of the N peptides, Y=(Y1,,YN), is obtained from the abundances of the P proteins, X=(X1,,XP). Therefore, the likelihood is defined with respect to the actual observations (i.e. the peptide measurements), and can be written as an integral over the latent data: L(π|Y)=Xf(Y,X=x|π)dx, where f(Y,X=x|π) is the joint density of observations Y and latent states X, given parameters π. Here, instead of working with this integral, we use a Bayesian data augmentation approach (Tanner and Wong 1987, Gelfand and Smith 1990) where parameters and latent states are alternately sampled from their conditional distributions (see Section 2.4.2).

Figure 1.

Figure 1.

IsoBayes graphical model, connecting prior parameters δ to parameters π, to latent protein abundances X, then to the abundance of each protein associated to its compatible peptides [Xpi as in (3)], and finally to peptide-level observations Y. In the example above, protein 1 is connected to unique peptide 1 and to shared peptide 2, protein 2 is associated to shared peptide 2, while protein P is connected to unique peptide N; X11 and X12 indicate the abundance of protein 1 associated to peptides 1 and 2, respectively. IsoBayes, starting from peptide-level data Y, aims to recover the unobserved protein-level total and relative abundances X and π.

Below, we describe two approaches we propose for sampling X, and dealing with peptide uncertainty, based on PEP and FDR filtering, both estimated from proteomics tools such as MetaMorpheus, Percolator or MaxQuant, and taken as input from IsoBayes.

2.3 PEP 2-layer latent variable approach

Assume that N peptides are detected in total, and that PEPi is the estimated probability that the ith peptide, albeit absent, is mistakenly detected. First, we sample if a peptide has been erroneously detected, via a Bernoulli distribution:

εi|PEPiBern(PEPi), for i=1,,N, (2)

where εi=1 if the ith peptide has been mistakenly detected, and 0 if it has been correctly detected. Second, for peptides which are sampled as correctly detected, we spread their abundance to the protein(s) they are compatible with. In particular, we define Yi as the abundance of the ith peptide, and ψi as the list of protein(s) the ith peptide maps to; we further denote by Xpi the (unknown) abundance of peptide i that is associated to protein p. The ith peptide can be redistributed to the proteins in ψi according to the following multinomial distribution:

(X1i,,XPi)|π,ψi,Yi,εiMN(π˜(i),Yi(1εi)), (3)

where π˜(i)=(π˜1(i),,π˜P(i)), with

π˜p(i)=πp/Mp1(pψi)p=1Pπp/Mp1(pψi), (4)

where Mp indicates the number of overall (unique and shared) detected peptides associated to protein isoform p, and 1(A) is 1 if condition A is true, and 0 if condition A is false. In other words, π˜p(i) is proportional to πp/Mp if the ith peptide maps to the pth protein, and is 0 otherwise. Dividing by Mp ensures that we normalize for the number of peptides contributing to each protein’s abundance. Note that Mp1, for p=1,,P, because only isoforms with at least 1 detected peptide are analyzed. The denominator in (4) ensures that π˜(i) is a probability vector adding to 1; i.e. p=1Pπ˜p(i)=1. In (3), the peptide abundance, Yi(1εi), is 0 if the peptide has been sampled as mistakenly detected in (2) (i.e. when εi=1). The protein isoform abundances are then recovered by adding the abundances obtained from the N peptide allocations: Xp=Xp1+XpN, for p=1,,P.

2.4 FDR 1-layer latent variable approach

Alternatively, IsoBayes can filter peptides with an FDR below a user-defined threshold (usually 0.01). The abundance of each selected peptide is then allocated to the protein(s) it is compatible with, as in (3), with εi set to 0 for all peptides i=1,,N. This approach results in a faster runtime, at the cost of a small loss of performance, due to a less accurate propagation of the uncertainty of peptide detections (see Section 3).

2.4.1 Informative prior

We use a conjugate Dirichlet prior for π:

π|δDir(δ=(δ1,,δP)), (5)

which results in a convenient Dirichlet posterior distribution:

π|X=x,δDir((x1+δ1,,xP+δP)). (6)

The hyper-parameters δ are set proportional to the corresponding relative transcript isoform abundances (informative prior), when available, or to 1 (weakly informative prior), if mRNA data is absent (for more details see Section S1.1, available as supplementary data at Bioinformatics online).

2.4.2 Inference

Parameters and latent states are alternately sampled from their conditional distributions, via a Markov chain Monte Carlo (MCMC) scheme, according to two Gibbs samplers (Geman and Geman 1984, Gelfand and Smith 1990): π|X as in (6), and X|Y,π as in (2)(3). Albeit our scheme involves several parameters and latent states, they are all updated via Gibbs samplers, which results in good mixing and convergence. By default, the MCMC is run for 2000 iterations, with a burn-in of 1000 iterations (parameters can be increased by users).

The posterior chains of X and π are then used to compute the output of IsoBayes. First, the probability that the pth protein isoform is present, which is given by Pr(Xp>0), is estimated as the average time the posterior chain of Xp is positive. Second, estimates of the overall and relative abundances are obtained as the posterior means of Xp and πp, and are accompanied by the respective 0.95 level highest posterior density credible intervals (CIs). Finally, when available, RNA-seq data are used to estimate the relative abundances of transcripts, denoted by π1T,,πPT. These quantities allow us to compute the log2-FC between protein isoform and transcript relative abundances, i.e. log2(πpπpT), and the probability that the relative abundance is higher in proteins than in transcripts, i.e. Pr(πp>πpT), which is computed as the average times the posterior chain of πp is >πpT.

Although full MCMC approaches can be computationally intensive, our algorithm is coded in C++, which is a compiled programming language that allows for massive computational gains compared to base R; furthermore, where multiple cores are available, we separate protein isoforms in blocks that have no peptides in common with other blocks, and analyze them in parallel. This allows our algorithm to run within a few minutes (see Section 3).

3 Results

3.1 Real data

We collected data dependent acquisition, label-free, liquid chromatography MS data from the jurkat and WTC-11 cell lines, which for simplicity, from now on, we will call jurkat and WTC-11. The jurkat data was collected via Thermo Scientific LTQ Orbitrap Velos mass spectrometer, while the WTC-11 data was analyzed via Thermo Scientific Orbitrap Eclipse Tribrid mass spectrometer. The jurkat and WTC-11 datasets were collected via 6 (ArgC, AspN, Chym, GluC, LysC, and Trypsin) and 4 (AspN, Chym, LysC, and Trypsin) distinct proteases, respectively, hence leading to 10 datasets overall; each protease consisted of multiple fractions (Supplementary Table S1, available as supplementary data at Bioinformatics online). The former and latter cell lines were also associated to transcriptomics data, in the form of paired-end (2 × 200 bp) short-read RNA-seq (Illumina HiSeq 2000), and long-read RNA-seq (PacBio Iso-Seq), respectively; we used kallisto (Bray et al. 2016) to align short-reads to a reference transcriptome and quantify transcript abundances. Both MS datasets, and respective RNA-seq data, were made publicly available (see Availability). Complete experimental details for the jurkat and WTC-11 MS sample preparation, digestion and MS analysis can be found in Miller et al. (2019), and in Section S1.2, available as supplementary data at Bioinformatics online, respectively; the corresponding RNA-seq datasets are fully described in Sheynkman et al. (2013), and de Souza et al. (2023), respectively.

3.2 Simulation study

In order to assess the perfomance of our method, we initially benchmarked it in a simulation study, where we: (i) generated artificial data, (ii) fit IsoBayes to it, and (iii) compared our estimates with the original ground truth used to simulate the data (i.e. presence/absence and abundance of protein isoforms). In order to generate realistic simulations, we used abundance estimates (X), and peptide-protein connections (ψ) obtained from real data. In particular, we fit IsoBayes to each dataset (in the FDR mode, and filtering peptides with FDR > 0.01), using PSM counts and mRNA abundances, and inferred protein isoform abundances; we then used these estimates to generate 10 simulated datasets (1 per protease). In order to simulate peptide-level abundances, we rounded each protein isoform abundance to the closest integer, and randomly allocated it to the peptide(s) compatible with it via a multinomial distribution, with equal probability for each peptide (for more details, see Section S1.3, available as supplementary data at Bioinformatics online).

We then ran IsoBayes on the simulated data, with and without mRNA abundances. Note that, transcript information was not explicitly used to generate the simulated data: RNA-seq data is only indirectly associated with the protein isoform abundances we simulated, because it was used to obtain the estimated abundances on real data. In particular, the average log10-correlation, across proteases, between mRNA and protein isoform abundances is equal to 0.65. We then calculated several metrics of performance of our results: (i) the area under the curve (AUC) for the protein isoform presence/absence, via the estimated probability of presence; (ii) the Pearson correlation, on the logarithm with base 10 (log10) scale, between the real and estimated protein isoform overall abundances, and (iii) the coverage of the 0.95 level CI for the overall abundance (i.e. the fraction of protein isoforms whose real abundance is contained in the estimated CI); (iv) the average abundance, separately for protein isoforms that are present and absent in the ground truth. Note that, in (ii), to avoid computing the logarithm of 0, a unit is added to all abundances before computing the logarithm; i.e. log10(abundance + 1). Results for all 10 datasets are reported in Supplementary Tables S2 and S3, available as supplementary data at Bioinformatics online, while average results are shown in Table 1. In all simulations, IsoBayes has high AUCs and log10 correlations, and a CI coverage of at least 0.95; as expected, results further improve when incorporating mRNA abundances (i.e. IsoBayes_mRNA). This shows that even noisy mRNA prior information (0.65 log10 correlation at the isoform level) leads to an enhancement of the inferential results. Furthermore, the estimated abundances are significantly higher for isoforms which are actually present in the simulation, than for those that are absent (i.e. 12 and 54 times larger for IsoBayes and IsoBayes_mRNA, respectively).

Table 1.

Average results, across the 10 proteases, from the simulation study; full results are available in Supplementary Tables S2 and S3, available as supplementary data at Bioinformatics online. “Abundance present iso” and “Abundance absent iso” indicate the estimated average abundance for protein isoforms which were actually simulated to be present and absent, respectively.

Method AUC log10-corr 0.95 CI coverage Abundance absent iso Abundance present iso
IsoBayes 0.92 0.87 0.97 0.67 8.31
IsoBayes_mRNA 0.97 0.97 0.99 0.16 8.60

3.3 Real data applications

While simulations enable assessing the performance of a method based on a ground truth, they fail to capture the full complexity of real data; this is particularly true with proteomics data, which is characterized by a large degree of technical and biological noise. However, method validation is challenging on real data, because a ground truth is missing. Here, we benchmarked our approach on the jurkat and WTC-11 real datasets; to overcome the absence of a ground truth, we used an approach which is conceptually similar to the leave-one-out cross validation: for each dataset, we analyzed one protease at a time, and used all the remaining ones (from the same cell line) to validate results. This is possible because all proteases refer to the same cell line, and are expected to lead to highly coherent results. For the validation, we estimated the presence/absence and abundance of protein isoforms via a subset of reliable peptide identifications: peptides with an FDR below 0.01, and mapping to a single protein isoform (also called unique peptides); these peptides have a small probability of being erroneously detected, and no mapping ambiguity. As a consequence of this choice, in the validation step, we removed protein isoforms which are not associated to any (detected or undetected) unique peptide in the theoretical search database, because it is not possible to validate them with our approach.

MS data were pre-processed via two popular proteomics tools, MetaMorpheus (Solntsev et al. 2018) and Percolator (Käll et al. 2007, The et al. 2016) [from the OpenMS (Röst et al. 2016) toolkit], to obtain peptide-level information (i.e. PSM counts, intensities, FDR, PEP, and the peptide-protein compatibility map). We benchmarked our approach against other popular tools for isoform-level inference; namely, Fido (Serang et al. 2010), PIA (Uszkoreit et al. 2015), and EPIFANY (Pfeuffer et al. 2020). A notable method, the ProteinProphet (Nesvizhskii et al. 2003), was not considered here, because of the elevated runtime and low performance it exhibited in recent benchmarks (Pfeuffer et al. 2020). All our competitors filter peptides and PSM counts based on their FDR, and two of them (Fido and EPIFANY) are integrated within the OpenMS toolkit, and cannot be used with the output of other proteomics tools, such as MetaMorpheus. Therefore, for a fair comparison, all methods were benchmarked on the output of OpenMSPercolator, and only peptides with an FDR below 0.01 were retained. We computed three metrics to quantify the complexity induced by shared peptides (i.e. peptides that map to multiple protein isoforms); among filtered peptides: (i) 79% of the PSM counts are associated to shared peptides, (ii) 81% of the peptides are shared, and (iii) shared peptides are (on average) compatible with 4.9 distinct isoforms (Supplementary Table S4, available as supplementary data at Bioinformatics online).

Figure 2 shows the receiver operating characteristic (ROC) curve of each method, while Supplementary Table S5, available as supplementary data at Bioinformatics online reports the corresponding area under the curve (AUC). In all datasets, IsoBayes displays higher sensitivity and specificity than other methods, even when using MS data only; as expected, when incorporating transcriptomics data, performance further improves.

Figure 2.

Figure 2.

ROC curves, comparing true positive rate (TPR) and false positive rate (FPR), for the identification of protein isoforms in each real dataset.

In addition, to study the quality of our isoform-level abundances, we compared, on the log10 scale [i.e. log10(abundance + 1)], our estimates with those detected in our multi-protease validation (Fig. 3 and Supplementary Fig. S1, available as supplementary data at Bioinformatics online). To simplify the visual representation, we aggregated isoforms across the proteases from the same cell line. Note that none of our competitors provides abundance measures, therefore here we only evaluated the estimates from our approach. When using MS data only, the log10 correlation is of 0.64 and 0.53, for the jurkat and WTC-11 datasets, respectively; these values increase to 0.71 and 0.69, when embedding mRNA data (Supplementary Table S6, available as supplementary data at Bioinformatics online). Notably, the mRNA prior leads to a bigger improvement of both AUC and log10 correlation in the WTC-11 dataset (which uses long-read RNA-seq), compared to the jurkat dataset (that uses short-read RNA-seq) (Supplementary Tables S5 and S6, available as supplementary data at Bioinformatics online). Although results are not fully comparable, because short and long read protocols are used on distinct datasets, it appears that long-read transcriptomics data can enhance our inference more than short-read data. In the figures one can see that the scale of our estimates and of the validated abundances do not align; this is because the two are not fully comparable: while our estimates refer to all peptides from a single protease, the validated abundances refer only to unique peptides, but from all proteases (except the one used to compute the estimates). In our simulations, we see that IsoBayes provides unbiased abundance estimates: the average difference between estimated and real abundances is 2×1018 and 2×1017, with and without mRNA abundance, respectively.

Figure 3.

Figure 3.

Hexbin plot for the log10 protein isoform abundances [i.e. log10(abundance + 1)], estimated from IsoBayes_mRNA (x axis), and found in the validation set (y axis). In each cell line, we considered results from all proteases. Left: jurkat dataset; right: WTC-11 dataset.

We also investigated aggregated results at the gene-level: correlations are higher than at the isoform-level (between 0.87 and 0.89), which is expected given the lower level of (biological and technical) noise at the gene level (Supplementary Table S7, available as supplementary data at Bioinformatics online, and Supplementary Figs S2 and S3, available as supplementary data at Bioinformatics online). Interestingly, while mRNA abundance significantly improves results at the isoform-level, it only marginally impacts inference at the gene-level. This is because, while mRNA data aims at improving the allocation of shared peptides, most peptides are shared across isoforms from the same gene, and there is little mapping ambiguity between distinct genes. In particular, after FDR filtering (0.01 threshold), while on average (across proteases) 71% of PSM counts are shared across multiple isoforms, only 31% of PSM counts are associated to distinct genes (Supplementary Table S4, available as supplementary data at Bioinformatics online).

We additionally studied our comparison between isoform protein and transcript abundances. Note that, to avoid infinite values, log2-FCs are stabilized by adding a small constant, κ=1.5×106 to both relative abundances; i.e. for each protein isoform p, we compute the log2-FC as log2(πp+κπpT+κ). For each protein isoform, we compared the estimated log2-FC with the one obtained in the validation set (i.e. the remaining proteases): results are highly coherent, with correlation values between 0.80 and 0.95 (Supplementary Table S6, available as supplementary data at Bioinformatics online). In general, mRNA and protein isoform abundances are positively correlated: the correlation of log10 abundances is between 0.45 and 0.66 in our benchmarks (Supplementary Table S8, available as supplementary data at Bioinformatics online); this supports the usage of mRNA abundance as informative prior in our approach. Nonetheless, there are isoforms, which we aim to detect, where mRNA and protein abundances differ; this is possible because the posterior distribution of π, although informed by the mRNA data via the prior, is dominated by the peptide-level data. Indeed, when considering the extreme estimated probabilities Pr(πp>πpT) (i.e. below 0.01 and above 0.99), we observe a clear separation of the log2-FCs obtained in the validation set, that we refer to as validated log2-FCs (Fig. 4, and Supplementary Figs S4–S6, available as supplementary data at Bioinformatics online). In particular, probabilities near 0 are associated to small values of the validated log2-FC, while probabilities close to 1 lead to large validated log2-FCs; interestingly, probabilities near 0.5 are associated to validated log2-FCs around 0 (i.e. similar relative abundance between proteins and transcripts). Our findings suggest that IsoBayes could be helpful in identifying isoforms where protein and transcript relative abundances significantly differ, possibly indicating post translational changes, or different translation rates between isoforms.

Figure 4.

Figure 4.

Boxplot of the stabilized log2-FCs between protein and transcript relative abundances, identified in the validated set, stratified based on the probability, estimated by IsoBayes_mRNA, that isoform relative abundances are higher at the protein-level than at the transcript-level. Small estimated probabilities (below 0.01) are mainly associated to negative log2-FCs in the validation set; conversely, large estimated probabilities (above 0.99) typically lead to positive log2-FCs in the validation set. In each cell line, we considered results from all proteases. Left: jurkat dataset; right: WTC-11 dataset.

From a computational perspective, the runtimes of all methods are comparable, except for Fido, which required significantly more time (Fig. 5); conversely, in terms of memory, EPIFANY and Fido are the most efficient tools, followed by IsoBayes, while PIA stands as the more expensive approach (Supplementary Fig. S7, available as supplementary data at Bioinformatics online).

Figure 5.

Figure 5.

Average runtime of each method, expressed in minutes, across the proteases of a cell-line. Each method was run on eight cores. Left: jurkat dataset; right: WTC-11 dataset.

Finally, we considered the subset of multi-isoform genes (i.e. genes with 2 or more protein isoforms associated to at least one detected peptide), and found that, on average across our datasets, 73% of the abundance estimated from IsoBayes is associated to the dominant protein isoform (i.e. the most abundant protein isoform within a gene), while the remaining 27% is associated to non-dominant isoforms. This indicates how much information can be gained when performing inference on isoform-level data, compared to aggregating data at the gene-level.

3.4 Isoforms without unique peptides

In order to investigate the extent to which unique peptides influence results, we also studied the subset of protein isoforms solely associated to shared peptides; these isoforms are the hardest ones to infer, because they are not supported by any detected unique peptide. Although all methods display lower performance, the relative ranking of methods is unchanged, and our estimated abundances (at both isoform- and gene-level) correlate well with our multi-protease validation values (Supplementary Figs S8–S10, available as supplementary data at Bioinformatics online, and Supplementary Table S9, available as supplementary data at Bioinformatics online). Importantly, the gap between IsoBayes_mRNA and the other approaches (including IsoBayes) significantly increases; this indicates that incorporating mRNA abundance is particularly beneficial for studying protein isoforms not associated to detected unique peptides. Indeed, it is expected that, in absence of unique peptides, the mRNA-based informative prior plays a crucial role in enhancing the allocation of the abundance from shared peptides. Furthermore, our estimated log2-FCs, between protein and mRNA relative abundances, highly correlate with those estimated in the validation set (Supplementary Table S6, available as supplementary data at Bioinformatics online), and again our estimates of Pr(πp>πpT) allow a good separation of the log2-FCs in the validation set (Supplementary Figs S11–S14, available as supplementary data at Bioinformatics online), even when mRNA abundance is absent. We also considered isoforms from multi-isoform genes, and found similar results (Supplementary Table S10, available as supplementary data at Bioinformatics online).

Overall, these results show how IsoBayes can be used to infer presence and abundance of protein isoforms, even when unique peptides are absent or undetected; this is particularly beneficial when studying processes such as alternative splicing, where multiple isoforms within a gene are expressed, and peptides are typically shared across the isoforms of a gene.

3.5 Robustness to input data

We also tested how robust our results are when varying the input data; in particular, we used MetaMorpheus’ proteomics tool to obtain peptide-level abundances. We then fit our approach, with and without mRNA abundances, to both PSM counts and intensities, and compared results to those obtained using Percolator. All results are highly coherent, both across proteomics tools and between PSM counts and intensities (Supplementary Figs S15–S17, available as supplementary data at Bioinformatics online, and Supplementary Table S11, available as supplementary data at Bioinformatics online), which shows that IsoBayes outputs are robust with respect to the input data provided.

3.6 PEP versus FDR mode

Finally, we tested our PEP mode and compared it to the FDR mode. For this purpose, we used both PSM counts and intensities obtained via MetaMorpheus. In the FDR model, we used the classical 0.01 threshold (as in the analyses shown above). Instead, in the PEP mode we used a weak FDR filter. In particular, we removed peptides with an FDR > 10%, because these peptides are associates to an average error probability (i.e. PEP) above 0.95 in all proteases (0.98, on average), and therefore were considered too unreliable to be analyzed; conversely, peptides with an FDR below 0.1 are associated to an average PEP below 0.2 in all proteases (0.13, on average). In all datasets, the PEP mode leads to a small but consistent increase of performance, for both the presence/absence of protein isoforms, and their abundance (Supplementary Table S12, available as supplementary data at Bioinformatics online). These results are expected, given that the PEP mode allows a better propagation of the peptide detection uncertainty; furthermore, since it considers more peptides (on average, 20% more in our data), it also enables studying a larger number of protein isoforms. At the same time though, because more peptides are analyzed and the latent variable model has an additional layer that needs to be sampled, the PEP model also leads to a significant increase of the runtime, which in our case was between 2 and 10 times, while memory usage only marginally increases (Supplementary Table S13, available as supplementary data at Bioinformatics online).

4 Discussion

In this manuscript, we have presented IsoBayes, a novel method for performing isoform-level inference from MS proteomics data. Our approach accounts for the uncertainty in peptide-to-protein mappings, and in peptide detections (in the PEP mode), via a latent variable approach. When available, we allow integrating mRNA data, from short- or long-read RNA-seq technologies, which enhance the accuracy of the allocation of the abundance from shared peptides, and, therefore improve our inferential results. Our method can infer the presence/absence and abundance of individual isoforms, and allows studying changes between protein and transcript relative abundances.

We designed a simulation study, and two real data analyses, based on a multi-protease approach, where we benchmarked our method to state-of-the-art competitors. Our tool displays good sensitivity and specificity when detecting protein isoforms, its abundance estimates highly correlate with those found in the other proteases, and can detect changes between protein and transcript relative abundances. As expected, the availability of mRNA data improves the accuracy of all our results.

Our model performs well even when only considering the subset of protein isoforms without any detected unique peptide. This indicates that IsoBayes can be used to investigate processes such as alternative splicing, where peptides are typically shared across multiple isoforms from the same gene.

We also found that our results are robust and consistent across proteomics tools (MetaMorpheus, and Percolator), and abundance estimates (PSM counts, and intensities). Finally, we showed that considering the error probability of each peptide (i.e. PEP) allows for a better propagation of the peptide detection uncertainty, compared to using a crude FDR cutoff, and leads to a marginal increase in performance, at the cost of an increased runtime.

Overall, we believe that our tool may be of great utility to life scientists and computational biologists, who aim to investigate protein usage at the isoform level. IsoBayes is freely distributed as an R Bioconductor package, and is accompanied by an example vignette and a plotting function; this simplifies its usage, distribution and integration with other bioinformatics and proteomics pipelines.

We would also like to acknowledge some of the limitations of our work. First, while IsoBayes FDR mode is computationally on par with competitors, the PEP mode is more computationally intensive, which may be problematic in large datasets. Second, while we account for two major sources of measurement noise (i.e. shared peptides mapping to multiple isoforms, and erroneously detected peptides), other sources of technical uncertainty are neglected, such as missing peptides, peptide detectability levels (i.e. distinct peptides have a different likelihood of being detected), and inaccurate protein reference database. Third, our validation relies on a subset of reliable peptides (uniquely associated to an isoform, and with an FDR below 0.01); while we believe this is a reasonable and reliable approach, we are aware that a proper ground truth is missing.

Finally, we conclude with a look at future perspectives. In our view, this work is not a stand-alone project, but rather lays the foundations for the development of future methods for proteomics/proteogenomics inference at the isoform level. In particular, we aim to extend IsoBayes to embed multiple samples (i.e. biological replicates) within a Bayesian hierarchical framework. This will enable performing differential testing between experimental conditions (e.g. healthy versus diseased, or treated versus untreated) at the isoform-level, hence identifying changes, across groups, in protein isoform abundance, and alternative splicing patterns. A second extension concerns the application to single-cell proteomics data. In this case, we aim to investigate how protein isoform abundances vary across cell types, and study the fraction of cells each isoform is detected in.

Supplementary Material

btaf450_Supplementary_Data

Acknowledgements

We acknowledge the HPC at the data center of Engineering D.HUB in Pont-Saint-Martin.

Contributor Information

Jordy Bollon, Computational and Chemical Biology, Italian Institute of Technology, Genova 16163, Italy; Astronomical Observatory of the Autonomous Region of the Aosta Valley (OAVdA), Nus 11020, Italy.

Michael R Shortreed, Department of Chemistry, University of Wisconsin-Madison, Madison, WI 53706, United States.

Erin Jeffery, Department of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville, VA 22903, United States.

Ben T Jordan, Frederick National Laboratory for Cancer Research, Frederick, MD 21701, United States.

Rachel Miller, Department of Chemistry, University of Wisconsin-Madison, Madison, WI 53706, United States.

Andrea Cavalli, Computational and Chemical Biology, Italian Institute of Technology, Genova 16163, Italy; Centre Européen de Calcul Atomique et Moléculaire, École Polytechnique Fédérale de Lausanne, Lausanne 1015, Switzerland.

Lloyd M Smith, Department of Chemistry, University of Wisconsin-Madison, Madison, WI 53706, United States.

Colin N Dewey, Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI 53726, United States.

Gloria M Sheynkman, Department of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville, VA 22903, United States.

Simone Tiberi, Department of Statistical Sciences, University of Bologna, Bologna 40126, Italy.

Author contributions

Jordy Bollon (Formal analysis [lead], Software [supporting], Validation [lead], Writing—original draft [supporting], Writing—review & editing [supporting]), Michael R. Shortreed (Data curation [equal], Formal analysis [supporting]), Erin Jeffery (Data curation [equal], Writing—original draft [supporting], Writing—review & editing [supporting]), Ben Jordan (Data curation [equal]), Rachel Miller (Data curation [equal]), Andrea Cavalli (Writing—original draft [supporting]), Lloyd Smith (Writing—original draft [supporting]), Colin Dewey (Conceptualization [supporting], Writing—original draft [supporting], Writing—review & editing [supporting]), Gloria Sheynkman (Conceptualization [lead], Data curation [supporting], Funding acquisition [lead], Project administration [equal], Supervision [supporting], Writing—original draft [supporting], Writing—review & editing [supporting]), and Simone Tiberi (Conceptualization [lead], Formal analysis [lead], Methodology [lead], Project administration [equal], Software [lead], Supervision [lead], Validation [supporting], Writing—original draft [lead], Writing—review & editing [lead])

Supplementary data

Supplementary data is available at Bioinformatics online.

Conflict of interest: None declared.

Funding

This work was supported by National Institute of General Medical Sciences (NIGMS) grant R01GM147653 to M.R.S. and L.M.S., European Union (EU) European Social Found (ESF) to J.B.

Data availability

IsoBayes is freely available as a Bioconductor R package at: https://bioconductor.org/packages/IsoBayes. All scripts used for our analyses are available on GitHub (https://github.com/SimoneTiberi/IsoBayes_manuscript version v2) and Zenodo (DOI: 10.5281/zenodo.10203419). The processed jurkat and WTC-11 MS and RNA-seq datasets are available in figshare at https://figshare.com/projects/IsoBayes_paper_data/183988. The raw files are also available as follows: (i) jurkat short-read RNA-seq data (Sheynkman et al. 2013) via GEO, with accession GSE45428 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi); (ii) WTC-11 long-read RNA-seq data (de Souza et al. 2023) via ENCODE, with id ENCFF961HLO (https://www.encodeproject.org/files/ENCFF961HLO/); (iii) LC-MS jurkat data (Miller et al. 2019) via MASSIVE, with id MSV000083304 (https://massive.ucsd.edu/ProteoSAFe/dataset.jsp?accession=MSV000083304); (iv) LC-MS WTC-11 data via ProteomeXchange Consortium, through the PRIDE partner repository, with id PXD064794 (https://www.ebi.ac.uk/pride/archive/projects/PXD064794).

References

  1. Bray NL, Pimentel H, Melsted P  et al.  Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol  2016;34:525–7. [DOI] [PubMed] [Google Scholar]
  2. Bryan K, Jarboui M-A, Raso C  et al.  Hiquant: rapid postquantification analysis of large-scale MS-generated proteomics data. J Proteome Res  2016;15:2072–9. [DOI] [PubMed] [Google Scholar]
  3. Carlyle BC, Kitchen RR, Zhang J  et al.  Isoform-level interpretation of high-throughput proteomics data enabled by deep integration with RNA-seq. J Proteome Res  2018;17:3431–44. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Cox J, Mann M.  Maxquant enables high peptide identification rates, individualized ppb-range mass accuracies and proteome-wide protein quantification. Nat Biotechnol  2008;26:1367–72. [DOI] [PubMed] [Google Scholar]
  5. de Sousa Abreu R, Penalva LO, Marcotte EM  et al.  Global signatures of protein and mRNA expression levels. Mol Biosyst  2009;5:1512–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. de Souza VB, Jordan BT, Tseng E  et al.  Transformation of alignment files improves performance of variant callers for long-read RNA sequencing data. Genome Biol  2023;24:91. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Deveson IW, Brunck ME, Blackburn J  et al.  Universal alternative splicing of noncoding exons. Cell Syst  2018;6:245–55.e5. [DOI] [PubMed] [Google Scholar]
  8. Edfors F, Danielsson F, Hallström BM  et al.  Gene-specific correlation of RNA and protein levels in human cells and tissues. Mol Syst Biol  2016;12:883. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Gelfand AE, Smith AF.  Sampling-based approaches to calculating marginal densities. J Am Stat Assoc  1990;85:398–409. [Google Scholar]
  10. Geman S, Geman D.  Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell  1984;6:721–41. [DOI] [PubMed] [Google Scholar]
  11. Käll L, Canterbury JD, Weston J  et al.  Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat Methods  2007;4:923–5. [DOI] [PubMed] [Google Scholar]
  12. Liu Y, Beyer A, Aebersold R.  On the dependency of cellular protein levels on mRNA abundance. Cell  2016;165:535–50. [DOI] [PubMed] [Google Scholar]
  13. Liu Y, Gonzàlez-Porta M, Santos S  et al.  Impact of alternative splicing on the human proteome. Cell Rep  2017;20:1229–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Lu P, Vogel C, Wang R  et al.  Absolute protein expression profiling estimates the relative contributions of transcriptional and translational regulation. Nat Biotechnol  2007;25:117–24. [DOI] [PubMed] [Google Scholar]
  15. Ma C, Xu S, Liu G  et al.  Improvement of peptide identification with considering the abundance of mRNA and peptide. BMC Bioinformatics  2017;18:109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Ma W-T, Liu Z-Y, Chen X-Z  et al.  A protein identification algorithm for tandem mass spectrometry by incorporating the abundance of mRNA into a binomial probability scoring model. J Proteomics  2019;197:53–9. [DOI] [PubMed] [Google Scholar]
  17. Maier T, Güell M, Serrano L.  Correlation of mRNA and protein in complex biological samples. FEBS Lett  2009;583:3966–73. [DOI] [PubMed] [Google Scholar]
  18. Miller RM, Millikin RJ, Hoffmann CV  et al.  Improved protein inference from multiple protease bottom-up mass spectrometry data. J Proteome Res  2019;18:3429–38. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Miller RM, Jordan BT, Mehlferber MM  et al.  Enhanced protein isoform characterization through long-read proteogenomics. Genome Biol  2022;23:69. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Nesvizhskii AI, Keller A, Kolker E  et al.  A statistical model for identifying proteins by tandem mass spectrometry. Anal Chem  2003;75:4646–58. [DOI] [PubMed] [Google Scholar]
  21. Pfeuffer J, Sachsenberg T, Dijkstra TM  et al.  Epifany: a method for efficient high-confidence protein inference. J Proteome Res  2020;19:1060–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Ramakrishnan SR, Vogel C, Prince JT  et al.  Integrating shotgun proteomics and mRNA expression data to improve protein identification. Bioinformatics  2009;25:1397–403. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Röst HL, Sachsenberg T, Aiche S  et al.  Openms: a flexible open-source software platform for mass spectrometry data analysis. Nat Methods  2016;13:741–8. [DOI] [PubMed] [Google Scholar]
  24. Salovska B, Zhu H, Gandhi T  et al.  Isoform-resolved correlation analysis between mRNA abundance regulation and protein level degradation. Mol Syst Biol  2020;16:e9170. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Serang O, MacCoss MJ, Noble WS.  Efficient marginalization to compute protein posterior probabilities from shotgun mass spectrometry data. J Proteome Res  2010;9:5346–57. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Sheynkman GM, Shortreed MR, Frey BL  et al.  Discovery and mass spectrometric analysis of novel splice-junction peptides using RNA-Seq. Mol Cell Proteomics  2013;12:2341–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Solntsev SK, Shortreed MR, Frey BL  et al.  Enhanced global post-translational modification discovery with MetaMorpheus. J Proteome Res  2018;17:1844–51. [DOI] [PubMed] [Google Scholar]
  28. Tanner MA, Wong WH.  The calculation of posterior distributions by data augmentation. J Am Stat Assoc  1987;82:528–40. [Google Scholar]
  29. The M, MacCoss MJ, Noble WS  et al.  Fast and accurate protein false discovery rates on large-scale proteomics data sets with percolator 3.0. J Am Soc Mass Spectrom  2016;27:1719–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Uszkoreit J, Maerkens A, Perez-Riverol Y  et al.  Pia: an intuitive protein inference engine with a web-based user interface. J Proteome Res  2015;14:2988–97. [DOI] [PubMed] [Google Scholar]
  31. Vesvizhskii A, Aebersold A.  Interpretation of shotgun proteomic data: the protein interference problem. Mol Cell Proteomics  2005;4:1419–40. [DOI] [PubMed] [Google Scholar]
  32. Wang D, Eraslan B, Wieland T  et al.  A deep proteome and transcriptome abundance atlas of 29 healthy human tissues. Mol Syst Biol  2019;15:e8503. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Willyard C.  Expanded human gene tally reignites debate. Nature  2018;558:354–5. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btaf450_Supplementary_Data

Data Availability Statement

IsoBayes is freely available as a Bioconductor R package at: https://bioconductor.org/packages/IsoBayes. All scripts used for our analyses are available on GitHub (https://github.com/SimoneTiberi/IsoBayes_manuscript version v2) and Zenodo (DOI: 10.5281/zenodo.10203419). The processed jurkat and WTC-11 MS and RNA-seq datasets are available in figshare at https://figshare.com/projects/IsoBayes_paper_data/183988. The raw files are also available as follows: (i) jurkat short-read RNA-seq data (Sheynkman et al. 2013) via GEO, with accession GSE45428 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi); (ii) WTC-11 long-read RNA-seq data (de Souza et al. 2023) via ENCODE, with id ENCFF961HLO (https://www.encodeproject.org/files/ENCFF961HLO/); (iii) LC-MS jurkat data (Miller et al. 2019) via MASSIVE, with id MSV000083304 (https://massive.ucsd.edu/ProteoSAFe/dataset.jsp?accession=MSV000083304); (iv) LC-MS WTC-11 data via ProteomeXchange Consortium, through the PRIDE partner repository, with id PXD064794 (https://www.ebi.ac.uk/pride/archive/projects/PXD064794).


Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES