Semi-Supervised Non-Parametric Bayesian Modelling of Spatial Proteomics

Oliver M Crook; Kathryn S Lilley; Laurent Gatto; Paul DW Kirk

doi:10.1214/22-AOAS1603

. Author manuscript; available in PMC: 2022 Dec 9.

Published in final edited form as: Ann Appl Stat. 2022 Dec 1;16(4):22-aoas1603. doi: 10.1214/22-AOAS1603

Semi-Supervised Non-Parametric Bayesian Modelling of Spatial Proteomics

Oliver M Crook ^*,^‡,^*, Kathryn S Lilley ^†,^‡, Laurent Gatto ^‡,^§, Paul DW Kirk ^§,^‡,^†

PMCID: PMC7613899 EMSID: EMS143956 PMID: 36507469

Abstract

Understanding sub-cellular protein localisation is an essential component in the analysis of context specific protein function. Recent advances in quantitative mass-spectrometry (MS) have led to high resolution mapping of thousands of proteins to sub-cellular locations within the cell. Novel modelling considerations to capture the complex nature of these data are thus necessary. We approach analysis of spatial proteomics data in a non-parametric Bayesian framework, using K-component mixtures of Gaussian process regression models. The Gaussian process regression model accounts for correlation structure within a sub-cellular niche, with each mixture component capturing the distinct correlation structure observed within each niche. The availability of marker proteins (i.e. proteins with a priori known labelled locations) motivates a semi-supervised learning approach to inform the Gaussian process hyperparameters. We moreover provide an efficient Hamiltonian-within-Gibbs sampler for our model. Furthermore, we reduce the computational burden associated with inversion of covariance matrices by exploiting the structure in the covariance matrix. A tensor decomposition of our covariance matrices allows extended Trench and Durbin algorithms to be applied to reduce the computational complexity of inversion and hence accelerate computation. We provide detailed case-studies on Drosophila embryos and mouse pluripotent embryonic stem cells to illustrate the benefit of semi-supervised functional Bayesian modelling of the data.

Keywords and phrases: proteomics, Bayesian mixture models, semi-supervised learning

1. Introduction

Proteins are biomolecules that have a diverse set of functional roles within a cell enabling proliferation and survival. For a protein to be able to perform its function(s), it must interact with other binding partners and substrates, which requires it to localise to the correct sub-cellular compartment (Gibson, 2009). There is mounting evidence implicating aberrant protein localisation in disease, including cancer and obesity (Olkkonen and Ikonen, 2006; Laurila and Vihinen, 2009; Luheshi, Crowther and Dobson, 2008; De Matteis and Luini, 2011; Cody, Iampietro and Lécuyer, 2013; Kau, Way and Silver, 2004; Rodriguez, Au and Henderson, 2004; Latorre et al., 2005; Shin et al., 2013; Siljee et al., 2018). Mapping the location of proteins within the cell using high-resolution spatial proteomic approaches is thus of high utility in the characterisation of therapeutic targets and in determining pathobiological mechanisms (Cook and Cristea, 2019). To interrogate the sub-cellular locations of thousands of proteins per experiment, recent advances in high-throughput spatial proteomics (Christoforou et al., 2016; Mulvey et al., 2017; Geladaki et al., 2019), followed by rigorous data analysis (Gatto et al., 2010) can be applied. As we elaborate in our exposition below, the methodology relies on the observation that each organelle (or, more generally, each sub-cellular niche) can be characterised by a subcellular fractionation profile that is shared by the proteins that localise to that organelle (De Duve and Beaufay, 1981). Applications of spatial proteomics experiments and analyses have enabled organelle-specific localisation to be determined for many proteins in many systems (Dunkley et al., 2006; Tan et al., 2009; Hall et al., 2009; Breckels et al., 2013), including mouse pluripotent stem cells (mESCs) (Christoforou et al., 2016) and cancer cell lines (Thul et al., 2017). Mass spectrometry (MS) based spatial proteomics, which is what we consider here, has gained in popularity in recent years with several recent applications across many different organisms (Christoforou et al., 2016; Beltran, Mathias and Cristea, 2016; Jadot et al., 2017; Itzhak et al., 2017; Mendes et al., 2017; Hirst et al., 2018; Davies et al., 2018; Orre et al., 2019; Nightingale et al., 2019; Shin et al., 2019; Barylyuk et al., 2020).

An overview of a typical spatial proteomics experiment is provided in Figure 1A. First, cells are gently lysed to expose the cellular content while preserving the integrity of the organelles. The cellular content is then separated using, for example, differential centrifugation (Itzhak et al., 2016; Geladaki et al., 2019; Orre et al., 2019) or equilibrium density centrifugation (Dunkley et al., 2004, 2006; Christoforou et al., 2016), among others (Parsons, Fernández-Niño and Heazlewood, 2014; Heard et al., 2015). After centrifugation, the proteins present in the fractions generated by this process are then extracted. The protein abundance of each protein in each fraction is then determined experimentally using high accuracy mass-spectrometry. This gives, for each protein, an abundance profile across the sub-cellular fractions.

Fig 1 — (A) Cellular content is loaded onto a preformed iodixanol density gradient. The tube is then subject to centrifugation, typically at 10⁶g for 8 hours. After centrifugation organelles have migrated to their buoyant densities and proteins localised to these organelles will be more abundant in that part of the density gradient. (B) Discrete fractions are collected along the density gradient. Proteins localised to the same organelle share characteristic distributions across the fractions. (C) Organelles are assumed to be characterised by a smooth latent probability density function p(x). Example characteristic probability density shown for organelle B with fractions a, b and c indicated with assumed fixed depth Δ. (D) Observed abundance profile for a protein belonging to Organelle B, after high-accuracy mass-spectrometry. (E) Proteins with a priori known localisation are annotated. Proteins from the same sub-cellular niche share the same (median-centered) abundance profiles.

In the LOPIT (Localisation of Organelle Proteins by Isotope Tagging; Dunkley et al., 2004, 2006; Sadowski et al., 2006) and hyperLOPIT (Christoforou et al., 2016; Mulvey et al., 2017) approaches, cell lysis is proceeded by the separation of sub-cellular components along a continuous density gradient based on their buoyant density. Discrete fractions along this gradient are then collected, multiplexed using tandem mass tags (TMT) (Thompson et al., 2003) and protein distributions revealing organelle specific correlation profiles within the fractions are achieved using synchronous precursor selection mass-spectrometry (SPS-MS³). The resultant data are annotated using marker proteins; that is, proteins with unambiguous single localisations from the literature or appropriate databases such as the Human Protein Atlas (HPA) (Thul et al., 2017) or Gene Ontology (Ashburner et al., 2000); see Gatto et al. (2014a) for discussion. We therefore know a priori the unique sub-cellular niche to which each marker protein localises, and hence these proteins define a labelled training dataset comprising proteins for which we know the corresponding sub-cellular niche localisations (class labels). We denote by K the number of distinct sub-cellular niches that appear in this training dataset; i.e. K is the number of classes. Typical spatial proteomics experiments can now provide information on several thousands of proteins; for example, 5, 032 were quantified for the mESC application (Section 3.2). Modern experiments are expected to resolve all major sub-cellular niches, but the precise number depends on experimental design. Indeed, in the Drosophila application (Section 3.1) no cytosolic component is observed because the supernatant, enriched in cytosolic proteins, was discarded; i.e. all proteins belonging to the “cytosol” class were removed from the experiment. Furthermore, eukaryote cells with more complex subcellular organisation are likely to have more subcellular niches observed, if the data are sufficiently resolved (Barylyuk et al., 2020). The experimental design (and thus the organelle separation) may be validated prior to quantitative analysis using western blotting (Mulvey et al., 2017).

In work that contributed to the discovery of previously unknown organelles and the award of a Nobel prize, de Duve and colleagues (De Duve, 1969; De Duve and Beaufay, 1981; Blobel, 2013) observed that proteins belonging to the same organelle possessed very similar abundance profiles (Figure 1B). This motivates the following data analysis problem: given the abundance profiles of the marker proteins that are already known to localise to a particular sub-cellular niche (e.g. organelle), can we determine which other proteins might also localise to that niche? In many previous analyses, this problem has been addressed as a black-box classification problem, with partial least squares discriminant analysis (Dunkley et al., 2006; Tan et al., 2009) and the support vector machine (SVM)(Christoforou et al., 2016; Itzhak et al., 2016; Orre et al., 2019) being the most popular approaches. However, other approaches are also used, such as nearest neighbour classifiers (Groen et al., 2014), random forests (Ohta et al., 2010), naive Bayes (Nikolovski et al., 2012) and neural networks (Tardif et al., 2012; Beltran, Mathias and Cristea, 2016). We refer to Gatto et al. (2014a) for a review. Other advances include the use of transfer learning to incorporate additional sources of localisation information (Breckels et al., 2016), and the development and application of outlier detection techniques (Breckels et al., 2013). A recent review of the improvements in resolution of spatial proteomics experiments over the last decade is provided by Gatto, Breckels and Lilley (2019).

The classification approaches listed above have a number of major limitations. For example, they implicitly assume that all proteins can be robustly assigned to a primary location, which will often not be the case, since many proteins function in multiple cellular compartments. Other sources of uncertainty include the inherit stochastic processes involved in MS-based quantitation, as well as each protein’s physical properties, which influence how well it is quantified. Post-translation modifications and the presence of different protein isoforms also add to the challenge of protein quantification. Furthermore, many elements of the experimental procedure are variable and context specific; such as, cell lysis, formation of the density gradients and protein extraction. In addition, organelle integrity maybe disrupted during many of the downstream processing steps. Hence, there are many factors that contribute to the challenge of making protein-niche associations.

Crook et al. (2018) demonstrated the importance of uncertainty quantification in spatial proteomics analysis. This study developed a generative mixture model of MS spatial proteomics data and, using this model, computed posterior distributions of protein localisation probabilities. However, this model made a number of assumptions that simplified the analysis, but which do not accurately reflect the data generating process. In the present manuscript, we develop a generative model for the data that is more clearly motivated by the data generating process.

1.1. Model development

Let x be the spatial axis along which density gradient separation occurs (see Figure 1A), and let x₁ < x₂ be two distinct points along x. We assume that the k-th organelle may be characterised by a smooth latent probability density function, p_k(x) (Figure 1C), such that, for any protein i that uniquely localises to the k-th organelle, the (unobserved) absolute quantity of protein i in the region [x₁, x₂] after separation is given by:

q_{k} (x_{1}, x_{2}) = {\int^{​}}_{x = x_{1}}^{x_{2}} p_{k} (x) d x .

(1.1)

In a spatial proteomics experiment, quantification occurs in discrete fractions, which we assume to be of approximately the same depth, Δ. Thus, an idealised spatial proteomics experiment would provide us with the quantities q_k(x_j, x_j + Δ), where {x₁, … , x_D} is a grid of spatial coordinates. To simplify notation, we write q_k(x_j) to mean q_k(x_j, x_j + Δ), i.e. for any protein that uniquely localises to the k-th organelle, q_k(x_j) is the absolute quantity of that protein in the fraction spanning the region from x_j to x_j + Δ.

In practice, current spatial proteomics experiments are unable to determine absolute quantities. We assume that the abundances provided by current spatial proteomics experiments can be expressed as a continuous deterministic function, h, of the absolute quantities, such that the measured abundance, μ_k(x_j) of protein i in the interval from x_j to x_j + Δ can be expressed as μ_k(x_j) = h(q_k(x_j)); see Figure 1D. Since both h and q_k are unknown, we adopt a functional data analysis approach and treat μ_k as an unknown function to be inferred. We learn μ_k using data from proteins whose localisation to organelle k is already known (see Figure 1E), and use a semi-supervised approach to further improve the inference of μ_k using data from proteins whose allocations to organelles are unknown a priori (see Section 2.4.4). The number of tandem mass tags available limits the number, D, of discrete observations we can observe and hence the resolution of the experiment. For example, in the case studies that we consider later, D = 4 for the Drosophila example (Tan et al., 2009), whereas for the mouse embryonic stem cell (mESCs) example D = 10 (Christoforou et al., 2016). As TMT chemistry improves, it is expected that more complex designs will become available.

Functional data analysis concerns itself with the analysis of data, where the sampled data for each subject is a function (Ramsay, 2004). Wang, Chiou and Müller (2016) recently reviewed the current major approaches in functional data analysis, including functional principal component analysis (Jones and Rice, 1992), functional linear regression (Morris, 2015), functional clustering (James and Sugar, 2003) and functional classification (Preda, Saporta and Lévéder, 2007). For classification, the linear discriminant analysis method was extended to the functional setting using splines (James and Hastie, 2001). Mixture discriminant analysis in the functional setting applied to model bike sharing data was considered by Bouveyron et al. (2015), using a functional EM algorithm. Bayesian approaches to functional classification have also been considered; such as, the wavelet based functional mixed model approach (Zhu, Brown and Morris, 2012) and Bayesian variable selection has also been extended to the functional setting (Zhu, Vannucci and Cox, 2010). Rodríguez, Dunson and Gelfand (2009) use dependant Dirichlet processes in the non-parametric Bayesian setting to cluster functional data. The Gaussian process approach to analysing functional data in biomedical applications is extensive (Honkela et al., 2010; Liu et al., 2010; Stegle et al., 2010; Kalaitzis and Lawrence, 2011a; Heinonen et al., 2014; Topa et al., 2015)

We assume each quantitative protein profile can be described by some unknown function, with the uncertainty in this function captured using a Gaussian process (GP) prior. Each sub-cellular niche is described by distinct density-gradient profiles, which display a non-linear structure with no particular parametric assumption being suitable. The contrasting density-gradient profiles are captured as components in a mixture of Gaussian process regression models. Gaussian process regression models have been applied extensively and we refer to Rasmussen (2004) and Rasmussen and Williams (2006) for the general theory. In molecular biology and functional genomics the focus of many applications has been on expression time-series data, where sophisticated models have been developed (Kirk and Stumpf, 2009; Cooke et al., 2011; Kalaitzis and Lawrence, 2011b; Kirk et al., 2012; Hensman, Lawrence and Rattray, 2013). We remark that many of these applications consider unsupervised clustering problems. In contrast, here we have (partially) labelled data (since the localisations of the marker proteins are known prior to our experiments) and so we may consider semi-supervised approaches. We explore inference of GP hyperparameters in two ways: firstly, an empirical Bayes approach in which the hyperparameters are optimised by maximising a marginal likelihood; secondly, by placing priors over these GP hyperparameters and performing fully Bayesian inference using labelled and unlabelled data.

A number of computational aspects need to be considered if inference is to be applied to spatial proteomics data. The first is that correlation in the GP hyperparameters can lead to slow exploration of the posterior, thus we use Hamiltonian evolutions to propose global moves through our probability space (Duane et al., 1987) avoiding random walk nature evident in traditional symmetric random walk proposals (Metropolis et al., 1953; Beskos et al., 2013). Hamiltonian Monte-Carlo (HMC) has been explored previously for hyperparameter inference in GP regression (Williams and Rasmussen, 1996), and here we show that HMC can be up to an order of magnitude more efficient than a Metropolis-Hastings approach. Furthermore, a particular costly computation in our model is the computation of the marginal likelihood (and its gradient) associated with each mixture component, which involves the inversion of a large covariance matrix - even storage of such matrix can be challenging. We demonstrate that a simple tensor decomposition of the covariance matrix allows application of fast matrix algorithms for covariance inversion and low memory storage (Zhang, Leithead and Leith, 2005).

2. Methods

We provide an overview of the key modelling choices and considerations below. A comprehensive mathematical summary of the model and inference procedure is provided in the supplementary article (Crook et al., 2021)

2.1. Modelling protein abundances along the density gradient

In our experiment, we make discrete observations along a continuous density gradient y_i = [y_i(x₁), …, y_i(x_D)], where y_i(x_j) indicates the measurement of protein i in the fraction spanning the spatial region from x_j to x_j + Δ along the density gradient. We assume that protein intensity y_i varies smoothly with the distance along the density-gradient. We then define the following regression model for the measured abundance of protein i as a function of the spatial coordinate x:

y_{i} (x) = μ_{i} (x) + ϵ_{i},

(2.1)

where μ_i is an unknown deterministic function and ϵ_i a noise variable. We assume that $ϵ_{i} \sim_{i i d} N (0, σ_{i}^{2})$ , for simplicity and remark that more elaborate noise models could be chosen but at additional computational cost and greater model complexity. Proteins are grouped together according to their sub-cellular localisation, with all proteins associated with sub-cellular niche k = 1, …, K sharing the same regression model; that is, μ_i = μ_k and σ_i = σ_k for all proteins in the k-th sub-cellular niche. For clarity, we refer to sub-cellular structures, whether that be organelles, vesicles or large multi-protein complexes, as components. Thus proteins associated with component k can be modelled as i.i.d draws from a multivariate Gaussian random variable with mean vector μ_k = [μ_k(x₁), …, μ_k(x_D)] and covariance matrix $σ_{k}^{2} I_{D}$ . To perform inference for the unknown function μ_k, as is typical for spatial correlated data (Gelfand, Kottas and MacEachern, 2005; Steel and Fuentes, 2010), we specify a Gaussian Process (GP) prior for each μ_k:

μ_{k} (x) \sim G P (m_{k} (x), C_{k} (x, x^{'})) .

(2.2)

where m_k(x) is the mean function, and C_k(x, x′) is the covariance function (sometimes also known as the kernel function) of the GP prior; see Rasmussen and Williams (2006). Each component is thus captured by a Gaussian process regression model. The full complement of proteins is then modelled as a K-component mixture of Gaussian process regression models, plus an “outlier component” to model proteins that are not captured well by any of the K known sub-cellular components. We provide a brief overview of Bayesian K-compinent mixtures in the next section, describe the modelling of outliers in Section 2.3, and further discuss the specification of the GP prior, including hyperparameter inference, in Section 2.4.

2.2. Finite mixture models

This section provides a brief review of finite mixture models (see, for example (Lavine and West, 1992; Fraley and Raftery, 2007) for more details). Finite mixture models are of the form,

p (y ∣ π, θ) = {\sum^{​}}_{k = 1}^{K} π_{k} f (y ∣ θ_{k}),

(2.3)

where K is the number of mixture components, π_k are the mixture proportions, and f(y|θ_k) are the component densities. In our application, each mixture component corresponds to a distinct sub-cellular niche, and θ_k is shorthand for μ_k and σ_k, as described in Section 2.1 above. As described in the introduction, our data includes a set of marker proteins whose sub-cellular niche localisations are known a priori. Thus, K is known from the outset, since we assume that we have at least one marker protein localising to each sub-cellular niche; i.e. we assume that all classes are represented among our labelled data (see Crook et al., 2020, for a relaxation of this assumption).

We assume each component density to have the same parametric form, but with component specific parameters, θ_k. We denote the prior for these unknown component parameters by g₀(θ). We suppose that we have a collection of n data points, Y = {y₁, … , y_n} that we seek to model using equation (2.3). We associate with each of these data points a component indicator variable, z_i ∈ {1, … , K}, which indicates which component generated observation y_i. In our initial exposition, we consider the unsupervised case where all z_i are unknown, and then describe how we take into account the marker proteins, which are (labelled) proteins for which the z_i are known. As is common for mixture models, we perform Gibbs sampling for the z_i and π_k, by sampling from the conditionals described below.

Conditional for π: If we assign the mixture proportions a symmetric Dirichlet prior with concentration parameter α/K, then we may marginalise the π_k (Murphy, 2012), or sample them. Although sampling these parameters can lead to increased posterior variance (Gelfand and Smith, 1990; Casella and Robert, 1996), it can be computationally advantageous. Conjugacy of the Dirichlet prior and multinomial likelihood means that the conditional posterior distribution of the mixing proportions given the component indicator variables is also Dirichlet,

π ∣ z_{1}, \dots, z_{n}, α \sim Dir (α / K + n_{1}, \dots, α / K + n_{K}),

(2.4)

where n_k is the number of data points y_i for which z_i = k. Selecting an appropriate value for α can be challenging. A sensitivity analysis for the choice of α is provided in the supplement and we show that α = 1 is a good default choice in practice.

Conditional for z_i: Given the mixing proportions, the prior distribution of z_i is categorical with parameter vector π = [π₁, … , π_K],

P (z_{i} = k ∣ π) = π_{k},

(2.5)

, and the conditional posterior for z_i is:

P (z_{i} = k ∣ π, y_{i}, θ_{k}) \propto π_{k} f (y_{i} ∣ θ_{k}) .

(2.6)

In the present application, we have a number of labelled marker proteins for which the component labels are known. If protein j is a marker protein, it is unnecessary to perform inference for z_j by sampling from Equation (2.6), and instead z_j is fixed from the outset (see supplementary article for details; Crook et al., 2021).

2.3. Modelling outliers

Crook et al. (2018) demonstrated that many proteins are not captured well by any known sub-cellular component. This could be because of yet undiscovered biological novelty, technical variation or a manifestation of some proteins residing in multiple localisations. Modelling outliers in mixture models can be challenging (Hennig, 2004; Cooke et al., 2011; Coretto and Hennig, 2016; Murphy and Murphy, 2019). Here, we take the approach of Crook et al. (2018). Briefly, we introduce a binary latent variable φ so that for each protein y_i we have a φ_i ∈ {0, 1} indicating whether y_i is modelled by one of the known sub-cellular components or an outlier component. The augmented model becomes the following

\begin{array}{l} p (y_{i} ∣ π, θ, ϕ) & = & {\sum^{​}}_{k = 1}^{K} π_{k} f {(y_{i} ∣ θ_{k})}^{ϕ_{i}} g {(y_{i} ∣ Φ)}^{1 - ϕ_{i}} \\ = & {\sum^{​}}_{k = 1}^{K} π_{k} (ϕ_{i} f (y_{i} ∣ θ_{k}) + (1 - ϕ_{i}) g (y_{i} ∣ Φ)) \end{array},

(2.7)

where g is the density of the outlier component. In our case, we specify g as the density of a multivariate T distribution with degrees of freedom κ = 4, mean M and scale matrix V . M is taken as the empirical global mean of the data and the scale matrix V as half the empirical covariance of the data. These choices are motivated by considering a Gaussian component with the same mean and covariance but with heavier tails to better capture dispersed proteins. We remark that other choices of g and parameters may be suitable and can be tailored to the application at hand. In typical Bayesian fashion, we specify a prior for φ as p₀(φ_i = 0) = ϵ, where ϵ ∼ 𝓑(u, v). Marginalising φ in equation 2.7 leads to the following mixture of mixtures (Malsiner-Walli, Frühwirth-Schnatter and Grün, 2017)

p (y_{i} ∣ π, θ) = {\sum^{​}}_{k = 1}^{K} π_{k} ((1 - ϵ) f (y_{i} ∣ θ_{k}) + ϵ g (y_{i} ∣ Φ)) .

(2.8)

We can also rewrite the above equation in the following way

p (y_{i} ∣ π, θ) = {\sum^{​}}_{k = 1}^{K} {\tilde{π}}_{k} f (y_{i} ∣ θ_{k}) + {\tilde{π}}_{0} g (y_{i} ∣ Φ),

(2.9)

where ${\tilde{π}}_{k} = π_{k} (1 - ϵ)$ for k = 1, …, K and ${\tilde{π}}_{0} = π_{k} ϵ$ and evidently ${\sum^{}}_{k = 0}^{K} {\tilde{π}}_{k} = 1$ . Thus, ϵ can be interpreted as the prior proportion of outliers. The Jeffreys prior in this scenario would set the parameters u and v of the 𝓑(u, v) prior to be u = v = $\frac{1}{2}$ (Jeffreys, 1946), while a uniform prior corresponds to u = v = 1. We prefer to specify a weakly informative prior based on prior data. From independent microscopy data, up to 50% of proteins do not have robust single localisations (Thul et al., 2017) and it is unlikely that there are extremely few outliers (Christoforou et al., 2016). These considerations lead us to placing small probability mass on the upper tail ϵ > 0.5 and small probability around the lower tail close to 0. Since our prior information comes from different experiments, we only consider a weakly informative Beta prior: ϵ ∼ B(2, 10). A sensitivity analysis is performed in the supplement. All hyperparameter choices are stated in the appendix.

2.4. Gaussian Process prior specification

A Gaussian Process (GP) is a continuous stochastic process such that any finite collection of these random variables is jointly Gaussian. A Gaussian Process prior is uniquely specified by a mean function m and covariance function C, which determine the mean vectors and covariance matrices of the associated multivariate Gaussian distributions. To elaborate, assuming a GP prior for the function μ_k(x) means that at spatial coordinates x₁, …, x_D, the joint prior of μ_k = [μ_k(x₁), …, μ_k(x_D)]^T, is multivariate Gaussian with mean vector m_k = [m_k(x₁), …, m_k(x_D)] and covariance matrix C_k(i, j) = C_k(x_i, x_j). Given no prior belief about symmetry or periodicity in our deterministic function, we assume our GP is centred with squared exponential covariance function

C_{k} (x_{i}, x_{j}) = a_{k}^{2} \exp (- \frac{| | x_{i} - x_{j} | |_{2}^{2}}{l_{k}}) .

(2.10)

2.4.1. Marginalising the unknown function

Having adopted a GP prior with component specific parameters a_k and l_k for each unknown function μ_k, we let observations associated with component k be denoted by Y_k = {y_i₁, …, y_{i_{n_k}} }, where i₁, … , i_{n_k} ∈ {1, … , n} are the indices for which z_i₁ = … = z_{i_n_k} = k. Our model tells us that

Y_{k} ∣ μ_{k}, σ_{k} \sim N (μ_{k}, σ_{k}^{2} I_{D}) .

(2.11)

Then, we can write this as

Y_{k} (x_{1}), \dots, Y_{k} (x_{D}) ∣ μ_{k}, σ_{k} \sim N (μ_{k} (x_{1}), \dots, μ_{k} (x_{D}), \dots, μ_{k} (x_{1}), \dots, μ_{k} (x_{D}), σ_{k}^{2} I_{n_{k} D}),

(2.12)

where μ_k(x₁), …, μ_k(x_D) is repeated n_k times. Our GP prior tell us

μ_{k} (x_{1}), \dots, μ_{k} (x_{D}), \dots, μ_{k} (x_{1}), \dots, μ_{k} (x_{D}) ∣ a_{k}, l_{k} \sim N (0, C_{k}),

(2.13)

, where C_k is an n_kD × n_kD matrix. This matrix is organised into n_k × n_k square blocks each of size D. The (i, j)^th block of C_k being A_k, where A_k is the covariance function for the k^th component evaluated at τ = {x₁, …, x_D}.

C_{k} = [\begin{matrix} A_{k} & A_{k} & \dots & A_{k} \\ A_{k} & A_{k} & \dots & A_{k} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ A_{k} & A_{k} & \dots & A_{k} \end{matrix}] .

(2.14)

Letting $ρ_{k} = {a_{k}^{2}, l_{k}}$ , we can then marginalise μ_k to obtain,

Y_{k} (x_{1}), \dots, Y_{k} (x_{D}) ∣ ρ_{k}, σ_{k}^{2} \sim N (0, C_{k} + σ_{k}^{2} I_{n_{k} D}),

(2.15)

thus avoiding inference of μ_k. Let Y_k(τ) denote the vector of length n_k × D equal to [y₁(x₁), …, y₁(x_D), … , y_{n_k}(x₁), …, y_nk(x_D)]. Then we may rewrite equation 2.6 by marginalising μ_k to obtain:

P (z_{i} = k ∣ z_{- i}) \propto π_{k} \int^{​} p (y_{i} ∣ μ_{k}) p (μ_{k} ∣ ρ_{k}, Y_{- i, k} (τ)) d μ_{k},

(2.16)

where Y_−i,k(τ) is equal to Y_k(τ) with observation i removed.

2.4.2. Tensor decomposition of the covariance matrix for fast inference

Our covariance matrix has a particularly simple structure allowing us to exploit extended Trench and Durbin algorithms for fast matrix computations (Zhang, Leithead and Leith, 2005). Full derivations and step by step algorithms for computing this inverse and determinant can be found in the supplementary article (Crook et al., 2021).

2.4.3. Sampling the underlying function

Whilst it is often mathematically convenient to marginalise the unknown function μ_k from a computational perspective it is not always advantageous to do so. To be precise, marginalising μ_k induces dependencies among the observations; that is, we cannot exploit the conditional independence structure given the underlying function μ_k. After marginalising, Gibbs moves must be made sequentially for each protein in turn and this can slow down computation.

The alternative approach is to sample the underlying function and exploit conditional independence. Once a sample is obtained from the GP posterior on μ_k, conditional independence allows us to compute the likelihood for all proteins at once, exploiting vectorisation. If there are a particularly large number of observation in each component it is also possible to parallelize computation over the components k = 1, …, K.

2.4.4. Gaussian process hyperparameter inference

To complete the specification of the GP prior, we need either to fix the hyperparameters $a_{k}^{2}$ , l_k and $σ_{k}^{2}$ at the outset, or to perform inference for these quantities. We consider 2 strategies for dealing with the hyperparameters: supervised optimisation and semi-supervised inference.

Supervised approach: optimising the hyperparameters

Our first strategy is to fix the hyperparameters at the outset, via a maximum marginal likelihood using only the labelled data. The marginal likelihood can be obtained quickly by recalling that

Y_{k} (x_{1}), \dots, Y_{k} (x_{D}) ∣ ρ_{k}, σ_{k}^{2} \sim N (0, C_{k} + σ_{k}^{2} I_{n_{k} D}) .

(2.17)

Thus the log marginal likelihood is given by

\log p (Y_{k} ∣ τ, ρ_{k}, σ_{k}^{2}) = - \frac{1}{2} Y_{k} (τ) {(C_{k} + σ_{k}^{2} I_{n_{k} D})}^{- 1} Y_{k} (τ)^{T} - \frac{1}{2} \log | C_{k} + σ_{k}^{2} I_{n_{k} D} | - \frac{n_{k} D}{2} \log 2 π .

(2.18)

For convenience of notation set ${\hat{C}}_{k} = C_{k} + σ_{k}^{2} I_{n_{k} D}$ . To maximise the marginal likelihood given equation 2.18, we find the partial derivatives with respect to the parameters (Rasmussen, 2004). Hence, we can use a gradient based optimisation procedure. Positivity constraints on $a_{k}^{2}$ , l_k, $σ_{k}^{2}$ are dealt with by re-parametrisation and so, dropping the dependence on k for notational convenience, and abusing notation, we set l = exp(ν₁), a² = exp(2ν₂) and σ² = exp(2ν₃). Application of the quasi-Newton L-BFGS algorithm (Liu and Nocedal, 1989) for numerical optimisation of the marginal likelihood with respect to the hyperparameters is now straightforward. The L-BFGS can only find a local optimum and so we initialise over a grid of values. We terminate the algorithm when successive iterations of the gradient are less than 10⁻⁸. We make extensive use of high performance R packages to interface with C++ (Eddelbuettel and Francois, 2011; Eddelbuettel and Sanderson, 2014).

Semi-supervised approach: Bayesian inference of the hyperparameters

The advantage of adopting a Bayesian approach to hyperparameter inference is that we can quantify uncertainty in these hyperparameters. Uncertainty quantification in GP hyperparameter inference is important, since different hyperparameters can have a strong effect on the GP posterior (Rasmussen, 2004). Furthermore, we consider a semi-supervised approach to hyperparameter inference. By a semi-supervised approach we mean that the hyperparameters are inferred using both the labelled and unlabelled data, rather than just the labelled data.

Consider at some iteration of our MCMC algorithm the data associated to the k^th component Y_k. We can partition this data into the unlabelled (U) and labelled data (L); in particular, $Y_{k} = [Y_{k}^{(L)}, Y_{k}^{(U)}]$ . To clarify, the indicators z_i are known for $Y_{k}^{(L)}$ prior to inference, whilst allocations z_i for $Y_{k}^{(U)}$ are sampled at each iteration of our MCMC algorithm. In our semi-supervised approach to hyperparameter inference, we use the set Y_k of all data (labelled and unlabelled) currently associated with the k^th component. We consider a Hamiltonian Monte Carlo (HMC) sampler for performing inference for these hyperparameters, as described in the supplementary article (Crook et al., 2021), where we also compare to a Metropolis Hastings sampler.

2.5. MCMC algorithm for posterior Bayesian computation

Full details of the procedure(s) for performing inference in our model are provided in the supplementary article (Crook et al., 2021).

2.6. Summarising uncertainty in posterior localisation probabilities

Summarising uncertainty quantified by Bayesian analysis in an interpretable way can be challenging. As always, we can summarise uncertainty using credible intervals or regions (Gelman et al., 1995). One particularly challenging quantity of interest to summarise is the uncertainty in posterior allocations. Whilst, each individual allocation of a protein to a sub-cellular niche can be summarised by a credible interval it is not clear what is the best way to summarise the posterior over all possible localisations for each individual protein. As in previous work (Crook et al., 2018), we propose to summarise this uncertainty in an information-theoretic approach by computing the Shannon entropy of the localisation probabilities (Shannon, 1948) at each iteration of the MCMC algorithm

{H_{i k}^{(t)} = - {\sum^{​}}_{k = 1}^{K} p_{i k}^{(t)} \log p_{i k}^{(t)}}_{t = 1}^{T},

(2.19)

where $p_{i k}^{(t)}$ is the probability that protein i belong to component k at iteration t. We can then summarise this by a Monte-Carlo average:

H_{i k} \approx \frac{1}{T} {\sum^{​}}_{t = 1}^{T} H_{i k}^{(t)} .

(2.20)

We note that larger values of a Shannon entropy correspond to greater uncertainty in allocations.

2.7. Proper scoring rules

The primary goal of spatial proteomics is to assign proteins with unknown localisations to subcellular niches based on their quantitative functional measurements. Secondary goals include inference of organelle specific parameters and uncertainty quantification, because organelles have overlapping biochemical properties. To measure the ability of methodologies to correctly assign proteins to organelles, we desire a strictly proper and symmetric scoring rule (Gneiting and Raftery, 2007). The symmetry is a requirement because ruling out protein localisations are as important as confident assignments. The quadratic (Brier) loss, spherical loss and logarithmic loss are usually appropriate candidates (Gneiting and Raftery, 2007). We put equal value on whether probabilities are over or under estimated and so the quadratic loss is appropriate, since the spherical loss puts more weight on lower entropy predictions (penalises under-confident predictions) and the log loss higher entropy predictions (rewards erring on the side caution) (Gneiting and Raftery, 2007; Machete, 2013). The unboundedness of the log loss is also problematic, since assigning potentially infinite penalty to an incorrect prediction is not useful in practice. We define the quadratic loss for a set of probabilistic forecasts p as:

B (p) = {\sum^{​}}_{i = 1}^{n} {\sum^{​}}_{j = 1}^{K} ‖ δ_{i j} - p_{i j} ‖_{2}^{2},

(2.21)

where δ_ij = 1 if protein i localises to component j and is 0 otherwise. It is useful to note a penalty of size 2 is incurred for completely incorrect predictions; that is, forecasting probability 1 for the wrong component. A smaller penalty is incurred for agnostic prediction amongst several classes. For example, suppose protein i localises to organelle 1, but we predict it belongs to organelle 2, 3, 4 and 5 with equal probability - the penalty incurred is 1.25. This is important in practice, because we favour methodologies that avoid us performing erroneous validation experiments.

3. Results

3.1. Case Study I: Drosophila melanogaster embryos

3.1.1. Application

The first case study is the Drosophila melanogaster (common fruit fly) embryos (Tan et al., 2009), in which we compare the supervised and semi-supervised approaches for updating the model hyperparameters. In particular, we explore the effect on the component specific noise term σ², by adopting different inference approaches. For each sub-cellular niche, we learn the hyperparameters by either maximising their marginal likelihood or sampling from their posterior using MCMC. The posterior distribution for the hyperparameters can either be found solely using the labelled data for each component or by making use of labelled and unlabelled data.

Figure 2 demonstrates several phenomena. Reassuringly, the estimates of the noise parameters $σ_{k}^{2}$ for k = 1, …, K obtained by using the L-BFGS algorithm to maximise the marginal likelihood coincide with the posterior distributions of the noise parameters, inferred using only the labelled data for each component. However, when we perform inference in a semi-supervised way, by using both the labelled and unlabelled data to make inferences, we make several important observations.

Fig 2 — Posterior distributions for the log noise parameter σ² on the Drosophila data. In general, we observe a shift towards 0, indicating that the labelled data underestimates the value of the noise term for each component. We also observe increased posterior shrinkage for many components with the variance of the noise parameters reduced in the semi-supervised setting.

Firstly, in many cases, the posterior using both the labelled and unlabelled data is shifted right towards 0. Recalling that we are working with the log of the hyperparameters, this indicates that the noise parameters is smaller when solely using the labelled data. This is likely a manifestation of experimental bias, since it is reasonable to believe that proteins with known prior locations are those which have less variable localisations and are therefore easier to experimentally validate. A semi-supervised approach is able to overcome these issues, by adapting to proteins in a dense region of space. In some cases the shift is pronounced, with posteriors of the parameters using labelled and unlabelled data found in the tails of the posterior only using the labelled distribution. Furthermore, we notice shrinkage in the posterior distribution of the noise parameter in the semi-supervised setting. The reduction in variance reduces our uncertainty about the underlying true value of $σ_{k}^{2}$ for k = 1, …, K. This variance reduction is observed in most cases even when these is little difference in the mean of the posteriors.

The primary goal of spatial proteomics is to predict the localisation of unknown proteins from data. Our modelling approach allows the allocation probability of each protein to each component to be used to predict the localisation of unknown proteins. Proteins may reside in multiple locations and some sub-cellular niches are challenging to separate because of confounding biochemical properties, leading to uncertainty in a proteins localisation. Thus adopting a Bayesian approach and quantifying this uncertainty is of great importance. Our methods allow point-estimates as well as interval estimates to be obtained for the posterior localisation probabilities. Figure 3 demonstrates the results of applying our method. Each protein in this PCA plot is scaled according to mean of the Monte-Carlo samples from the posterior localisation probability. To visualise the allocation probabilities for proteins across organelles, we produce a heatmap, M, where the (i, j)^th entry of M is the Monte-Carlo estimate of the allocation probability of the i^th protein to organelle j (figure 4).

Fig 3 — A pca plot for the Drosophila data where points, representing proteins, are coloured by the component of greatest probability. The pointer for each protein is scaled according to membership probability with larger/smaller points indicating greater/lower allocation probabilities.

Fig 4 — A heatmap of organelles by proteins, where the (i, j)^th entry is the Monte-Carlo estimate of the probability that a protein i belongs to organelle j. Allowing us to visualise the range of probabilities for each protein. Proteins are allocated to their most probable class and these allocations are shown in the colour bar on the left.

Further visualisation of the model and data are possible. We plot two representative example of gradient-density profiles for two components the endoplasmic reticulum (ER) and the nucleus, in figure 5. We plot both the labelled proteins, in colour, which were assigned to each component before our analysis. In grey, for both components, we plot the unlabelled proteins which have been allocated to these components probabilistically. We observe that they have the same gradient-density shape as the labelled proteins - in line with our beliefs about the underlying biology: that proteins from the same components should co-fractionate and therefore have similar density gradient profiles. In addition, we overlay the posterior predictive distribution for these components and observe they represent the data well.

Fig 5 — A plot of the gradient-density profiles for the ER and Nucleus with labelled proteins in colour and protein probabilistically assigned to those components in grey. The profiles of the assigned proteins closely match the profiles of the components. The predictive posterior of these components is also overlayed

3.1.2. Sensitivity analysis for hyper-prior specification

We use the Drosophila melanogaster dataset to test for sensitivity of the hyper-prior specification. To test for sensitivity, we see if predictive performance is affected by changes in the choice of hyper-prior. The following cross-validation schema assesses whether predictive performance is affected by choice of hyper-prior. We split the labelled data for each experiment into class-stratified training (80%) and test (20%) partitions, with the separation formed at random. The true classes of the test profiles are withheld from the classifier, whilst MCMC is performed. This 80/20 data stratification is performed 100 times in order produce a distribution of scores. We compare the ability of the methods to probabilistically infer the true classes using the quadratic loss, also referred to as the Brier score (Gneiting and Raftery, 2007). Thus a distribution of quadratic losses is obtained for each method, with the preferred method minimising the quadratic loss. Each method is run for 10, 000 MCMC iterations with 1000 iterations for burn-in. We vary the mean of the standard normal hyper-prior for each hyperparameter in turn for a grid of values $\tilde{m} = (0, - 1, - 2, - 3, - 4)$ , keeping the hyper-prior for the other variable held the same as a standard normal distribution. The results are displayed in figure 6.

We observe only minor sensitivity to the choice of hyper-prior, with no significant difference in performance noted (KS test, threshold = 0.01). Sensitivity analysis for hyperparameters of GPs is vital, since these hyperparameters have a strong effect on the posterior of the GP (Rasmussen, 2004). The observed lack of sensitivity in our case is advantageous, since prior information can be included without fear of over fitting. However, practitioners should always take care when specifying priors, especially for variance/covariance parameters as many authors have noted sensitivity of Bayesian models to these parameters (Gelman et al., 1995; Lunn et al., 2000; Gelman et al., 2006; Wang and Dunson, 2011; Schuurman, Grasman and Hamaker, 2016)

3.2. Case Study II: mouse pluripotent embryonic stems cells

3.2.1. Application

Our main case study is the mouse pluripotent E14TG2a stem cell dataset of Christoforou et al. (2016). This dataset contains 5032 quantitative protein profiles, and resolves 14 sub-cellular niches. We first plot the density-gradient profiles of the marker proteins for each sub-cellular niche in figure 7. We fit a Gaussian process prior regression model for each sub-cellular niche with the hyperparameters found by maximising the marginal likelihood. A table of unconstrained log hyperparameter values found by maximising the marginal likelihood is found in the supplementary article Crook et al. (2021). Alternatively, placing standard normal priors on each of the log hyperparameters and using a Metropolis-Hastings update we can infer the distributions over these hyperparameters. We perform 20, 000 iterations for each sub cellular niche and discard 15, 000 iterations for burn-in and proceed to thin the remaining samples by 20. We summarise the Monte-Carlo sample by the expected value as well as the 95% equi-tailed credible interval, which can also be found in the supplementary article Crook et al. (2021).

Fig 7 — Quantitative profiles of protein markers for each sub-cellular niche. A GP prior regression model is fitted to these data and the predictive distribution is displayed. We observe distinct distributions for each sub-cellular niche generated by the unique density-gradient properties of each sub-cellular niche.

We go further to predict proteins with unknown localisation to annotated components using our proposed mixture of GP regression models. As before, we adopt a semi-supervised approach to hyperparameter inference. Again we place standard normal hyper-priors on the log of the hyperparameters. We run our MCMC algorithm for 20, 000 iterations with half taken as burnin and thin by 5, as well as using HMC to update the hyperparameters. The PCA plot in figure 8 visualises our results. Each pointer represent a single protein and is scaled either to the probability of membership to the coloured component (left) or scaled with the Shannon entropy (right). As before, we also visualise the allocation probabilities for proteins across organelles in a heatmap (figure 9). In these plots we observe regions of high-probability and confidence to each organelle, as well as obtaining a global view of uncertainty. In this example, we observe regions of uncertainty, as measured by the Shannon entropy, concentrating where components overlap. We also observe uncertainty in regions where there is no dominant component. This Bayesian analysis provides a wealth of information on the global patterns of protein localisation in mouse pluripotent embryonic stem cells.

Fig 8 — A pca plot for the mouse pluripotent embryonic stem cell data where points, representing proteins, are coloured by the component of greatest probability. The pointer for each protein is scaled with membership probability (left). (right) The pointer for each protein is scaled with the Monte-Carlo averaged Shannon Entropy.

Fig 9 — A heatmap of organelles by proteins, where the (i, j)^th entry is the Monte-Carlo estimate of the probability that a protein i belongs to organelle j. Allowing us to visualise the range of probabilities for each protein. Proteins are allocated to their most probable class and these allocations are shown in the colour bar on the left.

3.3. Assessing predictive performance

We compare the predictive performance of the methods proposed here, as well as against the fully Bayesian TAGM model of Crook et al. (2018), where sub-cellular niches are described by multivariate Gaussian distributions rather than GPs. The following cross-validation schema is used to compare the classifiers. We split the labelled data for each experiment into class-stratified training (80%) and test (20%) partitions, with the separation formed at random. The true classes of the test profiles are withheld from the classifier, whilst MCMC is performed. This 80/20 data stratification is performed 100 times in order produce a distribution of scores. We compare the ability of the methods to probabilistically infer the true classes using the quadratic loss, also referred to as the Brier score (Gneiting and Raftery, 2007). Thus a distribution of quadratic losses is obtained for each method, with the preferred method minimising the quadratic loss. Each method is run for 10, 000 MCMC iterations with 1000 iterations for burn-in. For fair comparison we held priors the same across all datasets. Prior specifications are stated in the supplementary article Crook et al. (2021).

We compare across 5 different spatial proteomics datasets across three different organisms. The datasets we compare our methods on are Drosophila melanogaster embryos from Tan et al. (2009), the mouse pluripotent embroyonic stem cell dataset of Christoforou et al. (2016), the HeLa cell line dataset of Itzhak et al. (2016), the mouse primary neuron dataset of Itzhak et al. (2017) and finally a CRISPR-CAS9 knock-out coupled to spatial proteomics analysis dataset (AP5Z1-KO1) of Hirst et al. (2018). The results are found in figure 10.

Fig 10 — Boxplots of quadratic losses comparing predictive performance of the TAGM against the two semi-supervised Gaussian process models described here, where either an empirical Bayes (EB) approach or fully Bayesian (FB) approach is used for hyperparameter inference. That is (EB) denotes the model where hyperparameters are fixed and learned for the labelled data only, using L-BFGS to optimise the hyperparameters with respect to the marginal likelihood. (FB) denotes the semi-supervised model where hyperparameters are given priors and the unlabelled data are allowed in the inference of the hyperparameters.

We see that our in four out five datasets there is an improvement of the GP models over the TAGM model (Kolmogorov-Smirnov (KS) two-sample test p < 0.0001), because the GP model is provided with more explicit correlation structure of the data. The empirical Bayes slightly method outperforms the fully Bayesian approach in three of the data sets ((KS) two-sample test p < 0.01). These are the mouse pluripotent embryonic stem cell dataset, the HeLa data set of Itzhak et al. (2016) and the HeLA AP5Z1 knock-out dataset of Hirst et al. (2018). However, the size of these difference is small, and there is at most a 6 point difference. This corresponds to better assignments for at most 3 proteins, which we do not believe to be worth the loss in uncertainty quantification in the GP hyperparameters and the lost ability to provide expert prior information on the GP hyperparameters, both of which are provided by the fully Bayesian approach. Meanwhile, the improvement of the GP methods over the TAGM model is marked in the 4 datasets where we see improvement. Improvements range from score differences of roughly 16 to almost 80, which corresponds to 8 to 40 proteins with better allocations. We moreover note that the GP methods have only 3 parameters for the structured covariance to be inferred, whilst the TAGM model requires inference of full unstructured covariance matrices.

We observe, the TAGM model outperforms the GP methods in the Itzhak et al. (2016) dataset. The authors of this study used differential centrifugation to separate cellular content and curated a “large protein complex” class. This class could contain multiple sub-cellular structures such as ribosomes, as well as cytosolic and nuclear proteins. In any case, our modelling assumptions are violated in both models and this is issue is exacerbated by parametrising the covariance structure. One solution to this would be to model this mixture of large protein complexes as its own class. However, as this class contains a quite diverse set of sub-cellular compartments, it is difficult to predict behaviour. This class could be itself a mixture of GPs, however the number of components of the class would be unknown and this would have to be carefully modelled, perhaps using reversible jump methods (Richardson and Green, 1997) or Dirichlet process approaches (Escobar and West, 1995).

4. Discussion

This article presents semi-supervised non-parametric Bayesian methods to model spatial proteomics data. Sub-cellular niches display unique signatures along subcellular fractions and we exploit this information to construct GP regression models for each niche. The full complement of sub-cellular proteins is then described as mixture of GP regression models, with outliers captured by an additional component in our mixture. This provides cell biologists with a fully Bayesian method to analyse spatial proteomics data in the non-parametric framework that more closely reflects the biochemical process used to generate the data. This greatly increases model interpretation and allows us to make more biological sound inferences from our model.

We compared the proposed semi-supervised models to the state-of-the-art model on 5 different spatial proteomics datasets. Modelling the correlation structure along the subcellular fractions leads to competitive predictive performance over state-of-the-art models. Empirical Bayes procedures perform either equally well or better than the fully Bayesian approach, at the loss of uncertainty quantification in the hyperparameters. Though this performance improvement should not be over interpreted, since cross-validation assessment is only performed on the labelled data and will not reflect any biased sampling mechanisms that could be at play.

To accelerate computation in our model, we note that the structure of our covariance matrix admits a tensor decomposition, which can be exploited so that fast algorithms for matrix inversion of toeplitz matrices can be employed. These decomposition can then be used to derive formulae for fast computation of the likelihood and gradient of a GP. A stand-alone R-package implementing these methods using high-performance C++ libraries is available at https://github.com/ococrook/toeplitz. These algorithms and associate formulae are useful to those outside the spatial proteomics community to anyone using GPs with equally spaced observations, even in the unsupervised case.

We demonstrated that in the presence of labelled data there are two approach to hyperparameter inference. This first is to use empirical-Bayes to optimise the hyperparameters; the other a fully-Bayesian approach, taking into account the uncertainty in these hyperparameters. We propose to use HMC to update these hyperparameters, since highly correlated hyperparameters can induce high autocorrelation and exacerbate issues with random-walk MH updates. We demonstrate that, in the situation presented here, HMC updates can be up to an order of magnitude more efficient than MH updates. We further explored the sensitivity of our model to hyper-prior specification, which gives practitioners good default choices.

In two case-studies, we highlighted the value of taking a semi-supervised approach to hyperparameter inference, allowing us to explore the uncertainty in our hyperparameters. In a fully Bayesian approach the uncertainty in the hyperparameters is reflected in the uncertainty of the localisation of proteins to components. Quantifying uncertainty provide cell biologists with a wealth of information to make quantifiable inference about protein sub-cellular localisation.

We plan to disseminate our method via the Bioconductor project (Gentleman et al., 2004; Huber et al., 2015) and include our code in pRoloc package (Gatto et al., 2014b). The pRoloc package includes methods for visualisation, processing data and disseminating code in a unified framework. All spatial proteomics data used here is freely available within the Bioconductor package pRolocdata (Gatto, Crook and Breckels, 2018).

One potential source of uncertainty in protein localisation is that they can be residents of multiple sub-cellular compartments. We believe that by proposing a model which more closely reflects the underlying biochemical rationale for the experiment we can facilitate models which can infer proteins with multiple locations with greater confidence. This is the subject of further work.

Supplementary Material

EMS143956-supplement-1.pdf^{(19.3MB, pdf)}

Contributor Information

Oliver M. Crook, Email: omc25@cam.ac.uk, MRC Biostatistics Unit, School of Clinical Medicine, University of Cambridge, Cambridge Institute of Public Health, Forvie Site, Cambridge CB2 0SR.

Kathryn S. Lilley, Email: k.s.lilley@bioc.cam.ac.uk, Cambridge Centre for Proteomics, Department of Biochemistry, University of Cambridge, Cambridge.

Laurent Gatto, Email: laurent.gatto@uclouvain.be, DE Duve Institute, UCLouvain, Brussels.

Paul D.W. Kirk, Email: paul.kirk@mrc-bsu.cam.ac.uk, MRC Biostatistics Unit, School of Clinical Medicine, University of Cambridge, Cambridge Institute of Public Health, Forvie Site, Cambridge CB2 0SR.

References

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene Ontology: tool for the unification of biology. Nature genetics. 2000;25:25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
Barylyuk K, Koreny L, Ke H, Butterworth S, Crook OM, Lassadi I, Gupta V, Tromer EC, Mourier T, Stevens TJ, et al. A subcellular atlas of Toxoplasma reveals the functional context of the proteome. bioRxiv. 2020 doi: 10.1016/j.chom.2020.09.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
Beltran PMJ, Mathias RA, Cristea IM. A portrait of the human organelle proteome in space and time during cytomegalovirus infection. Cell systems. 2016;3:361–373. doi: 10.1016/j.cels.2016.08.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
Beskos A, Pillai N, Roberts G, Sanz-Serna J-M, Stuart A. Optimal tuning of the hybrid Monte Carlo algorithm. Bernoulli. 2013;19:1501–1534. [Google Scholar]
Blobel G. Christian de Duve (1917-2013) 2013. [DOI] [PubMed]
Bouveyron C, Côme E, Jacques J, et al. The discriminative functional mixture model for a comparative analysis of bike sharing systems. The Annals of Applied Statistics. 2015;9:1726–1760. [Google Scholar]
Breckels LM, Gatto L, Christoforou A, Groen AJ, Lilley KS, Trotter MW. The effect of organelle discovery upon sub-cellular protein localisation. Journal of proteomics. 2013;88:129–140. doi: 10.1016/j.jprot.2013.02.019. [DOI] [PubMed] [Google Scholar]
Breckels LM, Holden SB, Wojnar D, Mulvey CM, Christoforou A, Groen A, Trotter MW, Kohlbacher O, Lilley KS, Gatto L. Learning from heterogeneous data sources: an application in spatial proteomics. PLoS computational biology. 2016;12:e1004920. doi: 10.1371/journal.pcbi.1004920. [DOI] [PMC free article] [PubMed] [Google Scholar]
Casella G, Robert CP. Rao-Blackwellisation of sampling schemes. Biometrika. 1996;83:81–94. [Google Scholar]
Christoforou A, Mulvey CM, Breckels LM, Geladaki A, Hurrell T, Hayward PC, Naake T, Gatto L, Viner R, Arias AM, et al. A draft mapofthe mouse pluripotent stem cell spatial proteome. Nature communications. 2016;7:9992. doi: 10.1038/ncomms9992. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cody NA, Iampietro C, Lécuyer E. The many functions ofmRNA localization during normal development and disease: from pillar to post. Wiley Interdisciplinary Reviews: Developmental Biology. 2013;2:781–796. doi: 10.1002/wdev.113. [DOI] [PubMed] [Google Scholar]
Cook KC, Cristea IM. Location is everything: protein translocations as a viral infection strategy. Current opinion in chemical biology. 2019;48:34–43. doi: 10.1016/j.cbpa.2018.09.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cooke EJ, Savage RS, Kirk PD, Darkins R, Wild DL. Bayesian hierarchical clustering for microarray time series data with replicates and outlier measurements. BMC bioinformatics. 2011;12:399. doi: 10.1186/1471-2105-12-399. [DOI] [PMC free article] [PubMed] [Google Scholar]
Coretto P, Hennig C. Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust Gaussian clustering. Journal of the American Statistical Association. 2016;111:1648–1659. [Google Scholar]
Crook OM, Mulvey CM, Kirk PDW, Lilley KS, Gatto L. A Bayesian mixture modelling approach for spatial proteomics. PLOS Computational Biology. 2018;14:1–29. doi: 10.1371/journal.pcbi.1006516. [DOI] [PMC free article] [PubMed] [Google Scholar]
Crook OM, Geladaki A, Nightingale DJ, Vennard O, Lilley KS, Gatto L, Kirk PD. A semi-supervised Bayesian approach for simultaneous protein sub-cellular localisation assignment and novelty detection. PLoS computational biology. 2020;16:e1008288. doi: 10.1371/journal.pcbi.1008288. [DOI] [PMC free article] [PubMed] [Google Scholar]
Crook OM, Lilley KS, Gatto L, Kirk P. Supplement to “Semi-supervised non-parametric Bayesian modelling of spatial proteomics”. 2021 doi: 10.1214/22-AOAS1603. [DOI] [PMC free article] [PubMed] [Google Scholar]
Davies AK, Itzhak DN, Edgar JR, Archuleta TL, Hirst J, Jackson LP, Robinson MS, Borner GH. AP-4 vesicles contribute to spatial control of autophagy via RUSC-dependent peripheral delivery of ATG9A. Nature Communications. 2018;9:3958. doi: 10.1038/s41467-018-06172-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
De Duve C. The peroxisome: a new cytoplasmic organelle. Proceedings of the Royal Society of London. Series B. Biological Sciences. 1969;173:71–83. [PubMed] [Google Scholar]
De Duve C, Beaufay H. A short history of tissue fractionation. The Journal of cell biology. 1981;91:293. doi: 10.1083/jcb.91.3.293s. [DOI] [PMC free article] [PubMed] [Google Scholar]
De Matteis MA, Luini A. Mendelian disorders of membrane trafficking. New England Journal of Medicine. 2011;365:927–938. doi: 10.1056/NEJMra0910494. [DOI] [PubMed] [Google Scholar]
Duane S, Kennedy AD, Pendleton BJ, Roweth D. Hybrid monte carlo. Physics letters B. 1987;195:216–222. [Google Scholar]
Dunkley TP, Watson R, Griffin JL, Dupree P, Lilley KS. Localization of organelle proteins by isotope tagging (LOPIT) Molecular & Cellular Proteomics. 2004;3:1128–1134. doi: 10.1074/mcp.T400009-MCP200. [DOI] [PubMed] [Google Scholar]
Dunkley TP, Hester S, Shadforth IP, Runions J, Weimar T, Hanton SL, Griffin JL, Bessant C, Brandizzi F, Hawes C, et al. Mapping the Arabidopsis organelle proteome. Proceedings of the National Academy of Sciences. 2006;103:6518–6523. doi: 10.1073/pnas.0506958103. [DOI] [PMC free article] [PubMed] [Google Scholar]
Eddelbuettel D, Francois R. Rcpp: Seamless R and C++ Integration. Journal of Statistical Software, Articles. 2011;40:1–18. [Google Scholar]
Eddelbuettel D, Sanderson C. RcppArmadillo: Accelerating R with High-performance C++ Linear Algebra. Comput Stat Data Anal. 2014;71:1054–1063. [Google Scholar]
Escobar MD, West M. Bayesian density estimation and inference using mixtures. Journal of the american statistical association. 1995;90:577–588. [Google Scholar]
Fraley C, Raftery AE. Bayesian regularization for normal mixture estimation and model-based clustering. Journal of Classification. 2007;24:155–181. [Google Scholar]
Gatto L, Breckels LM, Lilley KS. Assessing sub-cellular resolution in spatial proteomics experiments. Current Opinion in Chemical Biology. 2019;48:123–149. doi: 10.1016/j.cbpa.2018.11.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gatto L, Crook OM, Breckels LM. pRolocdata: Data accompanying the pRoloc package R package version 1.19.1. 2018 [Google Scholar]
Gatto L, Vizcaìno JA, Hermjakob H, Huber W, Lilley KS. Organelle proteomics experimental designs and analysis. Proteomics. 2010;10:3957–3969. doi: 10.1002/pmic.201000244. [DOI] [PubMed] [Google Scholar]
Gatto L, Breckels LM, Burger T, Nightingale DJ, Groen AJ, Campbell C, Mulvey CM, Christoforou A, Ferro M, Lilley KS. A foundation for reliable spatial proteomics data analysis. Molecular & Cellular Proteomics. 2014a:mcp–M113. doi: 10.1074/mcp.M113.036350. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gatto L, Breckels LM, Wieczorek S, Burger T, Lilley KS. Massspectrometry based spatial proteomics data analysis using Proloc and Prolocdata. Bioinformatics. 2014b doi: 10.1093/bioinformatics/btu013. [DOI] [PMC free article] [PubMed] [Google Scholar]
Geladaki A, Britovsek NK, Breckels LM, Smith TSOLV, Mulvey CM, Crook OM, Gatto L, Lilley KS. Combining LOPIT with differential ultracentrifugation for high-resolution spatial proteomics. Nature Communications. 2019;10:331. doi: 10.1038/s41467-018-08191-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gelfand AE, Kottas A, MacEachern SN. Bayesian nonparametric spatialmodeling with Dirichlet process mixing. Journal of the American Statistical Association. 2005;100:1021–1035. [Google Scholar]
Gelfand AE, Smith AF. Sampling-based approaches to calculating marginal densities. Journal of the American statistical association. 1990;85:398–409. [Google Scholar]
Gelman A, et al. Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper) Bayesian analysis. 2006;1:515–534. [Google Scholar]
Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis. Chapman & Hall; London: 1995. [Google Scholar]
Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome biology. 2004;5:R80. doi: 10.1186/gb-2004-5-10-r80. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gibson TJ. Cell regulation: determined to signal discrete cooperation. Trends in biochemical sciences. 2009;34:471–482. doi: 10.1016/j.tibs.2009.06.007. [DOI] [PubMed] [Google Scholar]
Gneiting T, Raftery AE. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association. 2007;102:359–378. [Google Scholar]
Groen AJ, Sancho-André G, Breckels LM, Gatto L, Aniento F, Lilley KS. Identification of trans-Golgi network proteins in Arabidopsis thaliana root tissue. Journal of proteome research. 2014;13:763–776. doi: 10.1021/pr4008464. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hall SL, Hester S, Griffin JL, Lilley KS, Jackson AP. The organelle proteome of the DT40 lymphocyte cell line. Molecular & Cellular Proteomics. 2009;8:1295–1305. doi: 10.1074/mcp.M800394-MCP200. [DOI] [PMC free article] [PubMed] [Google Scholar]
Heard W, Sklenář J, Tome DF, Robatzek S, Jones AM. Identification of regulatory and cargo proteins of endosomal and secretory pathways in Arabidopsis thaliana by proteomic dissection. Molecular & Cellular Proteomics. 2015;14:1796–1813. doi: 10.1074/mcp.M115.050286. [DOI] [PMC free article] [PubMed] [Google Scholar]
Heinonen M, Guipaud O, Milliat F, Buard V, Micheau B, Tarlet G, Benderitter M, Zehraoui F, d’Alché Buc F. Detecting time periods of differential gene expression using Gaussian processes: an application to endothelial cells exposed to radiotherapy dose fraction. Bioinformatics. 2014;31:728–735. doi: 10.1093/bioinformatics/btu699. [DOI] [PubMed] [Google Scholar]
Hennig C. Breakdown points for maximum likelihood estimators of location-scale mixtures. Annals of Statistics. 2004:1313–1340. [Google Scholar]
Hensman J, Lawrence ND, Rattray M. Hierarchical Bayesian modelling of gene expression time series across irregularly sampled replicates and clusters. BMC bioinformatics. 2013;14:252. doi: 10.1186/1471-2105-14-252. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hirst J, Itzhak DN, Antrobus R, Borner GH, Robinson MS. Role of the AP-5 adaptor protein complex in late endosome-to-Golgi retrieval. PLoS biology. 2018;16:e2004411. doi: 10.1371/journal.pbio.2004411. [DOI] [PMC free article] [PubMed] [Google Scholar]
Honkela A, Girardot C, Gustafson EH, Liu Y-H, Furlong EE, Lawrence ND, Rattray M. Model-based method for transcription factor target identification with limited data. Proceedings of the National Academy of Sciences. 2010;107:7793–7798. doi: 10.1073/pnas.0914285107. [DOI] [PMC free article] [PubMed] [Google Scholar]
Huber W, Carey VJ, Gentleman R, Anders S, Carlson M, Carvalho BS, Bravo HC, Davis S, Gatto L, Girke T, et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nature methods. 2015;12:115. doi: 10.1038/nmeth.3252. [DOI] [PMC free article] [PubMed] [Google Scholar]
Itzhak DN, Tyanova S, Cox J, Borner GH. Global, quantitative and dynamic mapping of protein subcellular localization. Elife. 2016;5:e16950. doi: 10.7554/eLife.16950. [DOI] [PMC free article] [PubMed] [Google Scholar]
Itzhak DN, Davies C, Tyanova S, Mishra A, Williamson J, Antrobus R, Cox J, Weekes MP, Borner GH. A Mass Spectrometry-Based Approach for Mapping Protein Subcellular Localization Reveals the Spatial Proteome of Mouse Primary Neurons. Cell reports. 2017;20:2706–2718. doi: 10.1016/j.celrep.2017.08.063. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jadot M, Boonen M, Thirion J, Wang N, Xing J, Zhao C, Tannous A, Qian M, Zheng H, Everett JK, et al. Accounting for protein subcellular localization: A compartmental map of the rat liver proteome. Molecular & Cellular Proteomics. 2017;16:194–212. doi: 10.1074/mcp.M116.064527. [DOI] [PMC free article] [PubMed] [Google Scholar]
James GM, Hastie TJ. Functional linear discriminant analysis for irregularly sampled curves. Journal of the Royal Statistical Society Series B (Statistical Methodology) 2001;63:533–550. [Google Scholar]
James GM, Sugar CA. Clustering for sparsely sampled functional data. Journal of the American Statistical Association. 2003;98:397–408. [Google Scholar]
Jeffreys H. An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London Series A. Mathematical and Physical Sciences. 1946;186:453–461. doi: 10.1098/rspa.1946.0056. [DOI] [PubMed] [Google Scholar]
Jones M, Rice JA. Displaying the important features of large collections of similar curves. The American Statistician. 1992;46:140–145. [Google Scholar]
Kalaitzis AA, Lawrence ND. A simple approach to ranking differentially expressed gene expression time courses through Gaussian process regression. BMC bioinformatics. 2011a;12:180. doi: 10.1186/1471-2105-12-180. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kalaitzis AA, Lawrence ND. A simple approach to ranking differentially expressed gene expression time courses through Gaussian process regression. BMC bioinformatics. 2011b;12:180. doi: 10.1186/1471-2105-12-180. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kau TR, Way JC, Silver PA. Nuclear transport and cancer: from mechanism to intervention. Nature Reviews Cancer. 2004;4:106–117. doi: 10.1038/nrc1274. [DOI] [PubMed] [Google Scholar]
Kirk PD, Stumpf MP. Gaussian process regression bootstrapping: exploring the effects of uncertainty in time course data. Bioinformatics. 2009;25:1300–1306. doi: 10.1093/bioinformatics/btp139. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kirk P, Griffin JE, Savage RS, Ghahramani Z, Wild DL. Bayesian correlated clustering to integrate multiple datasets. Bioinformatics. 2012;28:3290–3297. doi: 10.1093/bioinformatics/bts595. [DOI] [PMC free article] [PubMed] [Google Scholar]
Latorre IJ, Roh MH, Frese KK, Weiss RS, Margolis B, Javier RT. Viral oncoprotein-induced mislocalization of select PDZ proteins disrupts tight junctions and causes polarity defects in epithelial cells. Journal of cell science. 2005;118:4283–4293. doi: 10.1242/jcs.02560. [DOI] [PMC free article] [PubMed] [Google Scholar]
Laurila K, Vihinen M. Prediction of disease-related mutations affecting protein localization. BMC genomics. 2009;10:122. doi: 10.1186/1471-2164-10-122. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lavine M, West M. A Bayesian method for classification and discrimination. Canadian Journal of Statistics. 1992;20:451–461. [Google Scholar]
Liu DC, Nocedal J. On the limited memory BFGS method for large scale optimization. Mathematical programming. 1989;45:503–528. [Google Scholar]
Liu Q, Lin KK, Andersen B, Smyth P, Ihler A. Estimating replicate time shifts using Gaussian process regression. Bioinformatics. 2010;26:770–776. doi: 10.1093/bioinformatics/btq022. [DOI] [PMC free article] [PubMed] [Google Scholar]
Luheshi LM, Crowther DC, Dobson CM. Protein misfolding and disease: from the test tube to the organism. Current opinion in chemical biology. 2008;12:25–31. doi: 10.1016/j.cbpa.2008.02.011. [DOI] [PubMed] [Google Scholar]
Lunn DJ, Thomas A, Best N, Spiegelhalter D. WinBUGS-a Bayesian modelling framework: concepts, structure, and extensibility. Statistics and computing. 2000;10:325–337. [Google Scholar]
Machete RL. Contrasting probabilistic scoring rules. Journal of Statistical Planning and Inference. 2013;143:1781–1790. [Google Scholar]
Malsiner-Walli G, Frühwirth-Schnatter S, Grün B. Identifying mixtures of mixtures using Bayesian estimation. Journal of Computational and Graphical Statistics. 2017;26:285–295. doi: 10.1080/10618600.2016.1200472. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mendes M, Peláez-García A, López-Lucendo M, Bartolomé RA, Calviño E, Barderas R, Casal JI. Mapping the Spatial Proteome of Metastatic Cells in Colorectal Cancer. proteomics. 2017;17:1700094. doi: 10.1002/pmic.201700094. [DOI] [PubMed] [Google Scholar]
Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, TELLEr E. Equation of state calculations by fast computing machines. The journal of chemical physics. 1953;21:1087–1092. [Google Scholar]
Morris JS. Functional regression. Annual Review of Statistics and Its Application. 2015;2:321–359. [Google Scholar]
Mulvey CM, Breckels LM, Geladaki A, Britovšek NK, Nightingale DJ, Christo-forou A, Elzek M, Deery MJ, Gatto L, Lilley KS. Using hyperLOPIT to perform high-resolution mapping of the spatial proteome. Nature Protocols. 2017;12:1110–1135. doi: 10.1038/nprot.2017.026. [DOI] [PubMed] [Google Scholar]
Murphy KP. Machine learning: a probabilistic perspective. The MIT Press; 2012. [Google Scholar]
Murphy K, Murphy TB. Parsimonious Model-Based Clustering with Covariates. Advances in Data Analysis and Classification. 2019 [Google Scholar]
Nightingale DJ, Geladaki A, Breckels LM, Oliver SG, Lilley KS. The subcellular organisation of Saccharomyces cerevisiae. Current opinion in chemical biology. 2019;48:86–95. doi: 10.1016/j.cbpa.2018.10.026. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nikolovski N, Rubtsov D, Segura MP, Miles GP, Stevens TJ, Dunkley TP, Munro S, Lilley KS, Dupree P. Putative glycosyltransferases and other plant Golgi apparatus proteins are revealed by LOPIT proteomics. Plant physiology. 2012;160:1037–1051. doi: 10.1104/pp.112.204263. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ohta S, Bukowski-WiLLs J-C, Sanchez-Pulido L, De Lima Alves F, Wood L, Chen ZA, Platani M, Fischer L, Hudson DF, Ponting CP, et al. The protein composition of mitotic chromosomes determined using multiclassifier combinatorial proteomics. Cell. 2010;142:810–821. doi: 10.1016/j.cell.2010.07.047. [DOI] [PMC free article] [PubMed] [Google Scholar]
Olkkonen VM, Ikonen E. When intracellular logistics fails-genetic defects in membrane trafficking. Journal of cell science. 2006;119:5031–5045. doi: 10.1242/jcs.03303. [DOI] [PubMed] [Google Scholar]
Orre LM, Vesterlund M, Pan Y, Arslan T, Zhu Y, Woodbridge AF, Frings O, Fredlund E, Lehtiö J. SubCellBarCode: Proteome-wide Mapping of Protein Localization and Relocalization. Molecular Cell. 2019;73:166–182.:e7. doi: 10.1016/j.molcel.2018.11.035. [DOI] [PubMed] [Google Scholar]
Parsons H, Fernández-Ñino S, Heaziewood J. Separation of the plant Golgi apparatus and endoplasmic reticulum by free-flow electrophoresis. Methods in molecular biology (Clifton, NJ) 2014;1072:527. doi: 10.1007/978-1-62703-631-3_35. [DOI] [PubMed] [Google Scholar]
Preda C, Saporta G, Lévéder C. PLS classification of functional data. Computational Statistics. 2007;22:223–235. [Google Scholar]
Ramsay JO. Functional data analysis. Encyclopedia of Statistical Sciences. 2004;4 [Google Scholar]
Rasmussen CE. Advanced lectures on machine learning. Springer; 2004. Gaussian processes in machine learning; pp. 63–71. [Google Scholar]
Rasmussen CE, Wiiiiams CK. Gaussian processes for machine learning. MIT Press; 2006. [Google Scholar]
Richardson S, Green PJ. On Bayesian analysis of mixtures with an unknown number of components (with discussion) Journal of the Royal Statistical Society series B (statistical methodology) 1997;59:731–792. [Google Scholar]
Rodriguez JA, Au WW, Henderson BR. Cytoplasmic mislocalization of BRCA1 caused by cancer-associated mutations in the BRCT domain. Experimental cell research. 2004;293:14–21. doi: 10.1016/j.yexcr.2003.09.027. [DOI] [PubMed] [Google Scholar]
Rodríguez A, Dunson DB, Gelfand AE. Bayesian nonparametric functional data analysis through density estimation. Biometrika. 2009;96:149–162. doi: 10.1093/biomet/asn054. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sadowski PG, Dunkiey TP, Shadforth IP, Dupree P, Bessant C, Griffin JL, Liiiey KS. Quantitative proteomic approach to study subcellular localization ofmembrane proteins. Nature protocols. 2006;1:1778–1789. doi: 10.1038/nprot.2006.254. [DOI] [PubMed] [Google Scholar]
Schuurman N, Grasman R, Hamaker E. A comparison of inverse-wishart prior specifications for covariance matrices in multilevel autoregressive models. Multivariate Behavioral Research. 2016;51:185–206. doi: 10.1080/00273171.2015.1065398. [DOI] [PubMed] [Google Scholar]
Shannon CE. A mathematical theory of communication. The Bell System Technical Journal. 1948;27:379–423. [Google Scholar]
Shin SJ, Smith JA, Rezniczek GA, Pan S, Chen R, Brentnaii TA, Wiche G, Keiiy KA. Unexpected gain of function for the scaffolding protein plectin due to mislocalization in pancreatic cancer. Proceedings of the National Academy of Sciences. 2013;110:19414–19419. doi: 10.1073/pnas.1309720110. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shin JJ, Crook OM, Borgeaud A, Cattin-Ortoiá J, Peak-Chew S-Y, Chadwick J, Liiiey KS, Munro S. Determining the content of vesicles captured by golgin tethers using LOPIT-DC. bioRxiv. 2019:841965 [Google Scholar]
Siljee JE, Wang Y, Bernard AA, Ersoy BA, Zhang S, Mariey A, Von Zastrow M, Reiter JF, Vaisse C. Subcellular localization of MC4R with ADCY3 at neuronal primary cilia underlies a common pathway for genetic predisposition to obesity. Nat Genet. 2018 doi: 10.1038/s41588-017-0020-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Steel MF, Fuentes M. Non-gaussian and nonparametric models for continuous spatial data. CRC press; 2010. [Google Scholar]
Stegle O, Denby KJ, Cooke EJ, Wiid DL, Ghahramani Z, Borgwardt KM. A robust Bayesian two-sample test for detecting intervals of differential gene expression in microarray time series. Journal of Computational Biology. 2010;17:355–367. doi: 10.1089/cmb.2009.0175. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tan DJ, Dvinge H, Christoforou A, Bertone P, Martinez Arias A, Liiiey KS. Mapping organelle proteins and protein complexes in drosophila melanogaster. Journal of proteome research. 2009;8:2667–2678. doi: 10.1021/pr800866n. [DOI] [PubMed] [Google Scholar]
Tardif M, Atteia A, Specht M, Cogne G, Roiiand N, BrugiEre S, Hippier M, Ferro M, Bruiey C, Peitier G, et al. PredAlgo: a new subcellular localization prediction tool dedicated to green algae. Molecular biology and evolution. 2012;29:3625–3639. doi: 10.1093/molbev/mss178. [DOI] [PubMed] [Google Scholar]
Thompson A, Schäfer J, Kuhn K, Kienie S, Schwarz J, Schmidt G, Neumann T, Hamon C. Tandem mass tags: a novel quantification strategy for comparative analysis of complex protein mixtures by MS/MS. Analytical chemistry. 2003;75:1895–1904. doi: 10.1021/ac0262560. [DOI] [PubMed] [Google Scholar]
Thul PJ, Åkesson L, Wiking M, Mahdessian D, Geiadaki A, Ait Biai H, Aim T, Asplund A, Björk L, Breckels LM, Bäckström A, et al. A subcellular map of the human proteome. Science. 2017 doi: 10.1126/science.aal3321. [DOI] [PubMed] [Google Scholar]
Topa H, Jónás Á, Kofier R, Kosioi C, Honkeia A. Gaussian process test for high-throughput sequencing time series: application to experimental evolution. Bioinformatics. 2015;31:1762–1770. doi: 10.1093/bioinformatics/btv014. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang J-L, Chiou J-M, Müller H-G. Functional data analysis. Annual Review of Statistics and Its Application. 2016;3:257–295. [Google Scholar]
Wang L, Dunson DB. Fast Bayesian inference in Dirichlet process mixture models. Journal of Computational and Graphical Statistics. 2011;20:196–216. doi: 10.1198/jcgs.2010.07081. [DOI] [PMC free article] [PubMed] [Google Scholar]
Williams CK, Rasmussen CE. Gaussian processes for regression. In Advances in neural information processing systems. 1996:514–520. [Google Scholar]
Zhang Y, Leithead WE, Leith DJ. Time-series Gaussian process regression based on Toeplitz computation of O (N 2) operations and O (N)-level storage; Decision and Control, 2005 and 2005 European Control Conference. CDC-ECC’05. 44th IEEE Conference on 3711-3716; 2005. [Google Scholar]
Zhu H, Brown PJ, Morris JS. Robust classification of functional and quantitative image data using functional mixed models. Biometrics. 2012;68:1260–1268. doi: 10.1111/j.1541-0420.2012.01765.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhu H, Vannucci M, Cox DD. A Bayesian hierarchical model for classification with selection of functional predictors. Biometrics. 2010;66:463–473. doi: 10.1111/j.1541-0420.2009.01283.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

EMS143956-supplement-1.pdf^{(19.3MB, pdf)}

[R1] Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene Ontology: tool for the unification of biology. Nature genetics. 2000;25:25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Barylyuk K, Koreny L, Ke H, Butterworth S, Crook OM, Lassadi I, Gupta V, Tromer EC, Mourier T, Stevens TJ, et al. A subcellular atlas of Toxoplasma reveals the functional context of the proteome. bioRxiv. 2020 doi: 10.1016/j.chom.2020.09.011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Beltran PMJ, Mathias RA, Cristea IM. A portrait of the human organelle proteome in space and time during cytomegalovirus infection. Cell systems. 2016;3:361–373. doi: 10.1016/j.cels.2016.08.012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Beskos A, Pillai N, Roberts G, Sanz-Serna J-M, Stuart A. Optimal tuning of the hybrid Monte Carlo algorithm. Bernoulli. 2013;19:1501–1534. [Google Scholar]

[R5] Blobel G. Christian de Duve (1917-2013) 2013. [DOI] [PubMed]

[R6] Bouveyron C, Côme E, Jacques J, et al. The discriminative functional mixture model for a comparative analysis of bike sharing systems. The Annals of Applied Statistics. 2015;9:1726–1760. [Google Scholar]

[R7] Breckels LM, Gatto L, Christoforou A, Groen AJ, Lilley KS, Trotter MW. The effect of organelle discovery upon sub-cellular protein localisation. Journal of proteomics. 2013;88:129–140. doi: 10.1016/j.jprot.2013.02.019. [DOI] [PubMed] [Google Scholar]

[R8] Breckels LM, Holden SB, Wojnar D, Mulvey CM, Christoforou A, Groen A, Trotter MW, Kohlbacher O, Lilley KS, Gatto L. Learning from heterogeneous data sources: an application in spatial proteomics. PLoS computational biology. 2016;12:e1004920. doi: 10.1371/journal.pcbi.1004920. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Casella G, Robert CP. Rao-Blackwellisation of sampling schemes. Biometrika. 1996;83:81–94. [Google Scholar]

[R10] Christoforou A, Mulvey CM, Breckels LM, Geladaki A, Hurrell T, Hayward PC, Naake T, Gatto L, Viner R, Arias AM, et al. A draft mapofthe mouse pluripotent stem cell spatial proteome. Nature communications. 2016;7:9992. doi: 10.1038/ncomms9992. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Cody NA, Iampietro C, Lécuyer E. The many functions ofmRNA localization during normal development and disease: from pillar to post. Wiley Interdisciplinary Reviews: Developmental Biology. 2013;2:781–796. doi: 10.1002/wdev.113. [DOI] [PubMed] [Google Scholar]

[R12] Cook KC, Cristea IM. Location is everything: protein translocations as a viral infection strategy. Current opinion in chemical biology. 2019;48:34–43. doi: 10.1016/j.cbpa.2018.09.021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Cooke EJ, Savage RS, Kirk PD, Darkins R, Wild DL. Bayesian hierarchical clustering for microarray time series data with replicates and outlier measurements. BMC bioinformatics. 2011;12:399. doi: 10.1186/1471-2105-12-399. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Coretto P, Hennig C. Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust Gaussian clustering. Journal of the American Statistical Association. 2016;111:1648–1659. [Google Scholar]

[R15] Crook OM, Mulvey CM, Kirk PDW, Lilley KS, Gatto L. A Bayesian mixture modelling approach for spatial proteomics. PLOS Computational Biology. 2018;14:1–29. doi: 10.1371/journal.pcbi.1006516. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Crook OM, Geladaki A, Nightingale DJ, Vennard O, Lilley KS, Gatto L, Kirk PD. A semi-supervised Bayesian approach for simultaneous protein sub-cellular localisation assignment and novelty detection. PLoS computational biology. 2020;16:e1008288. doi: 10.1371/journal.pcbi.1008288. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Crook OM, Lilley KS, Gatto L, Kirk P. Supplement to “Semi-supervised non-parametric Bayesian modelling of spatial proteomics”. 2021 doi: 10.1214/22-AOAS1603. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Davies AK, Itzhak DN, Edgar JR, Archuleta TL, Hirst J, Jackson LP, Robinson MS, Borner GH. AP-4 vesicles contribute to spatial control of autophagy via RUSC-dependent peripheral delivery of ATG9A. Nature Communications. 2018;9:3958. doi: 10.1038/s41467-018-06172-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] De Duve C. The peroxisome: a new cytoplasmic organelle. Proceedings of the Royal Society of London. Series B. Biological Sciences. 1969;173:71–83. [PubMed] [Google Scholar]

[R20] De Duve C, Beaufay H. A short history of tissue fractionation. The Journal of cell biology. 1981;91:293. doi: 10.1083/jcb.91.3.293s. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] De Matteis MA, Luini A. Mendelian disorders of membrane trafficking. New England Journal of Medicine. 2011;365:927–938. doi: 10.1056/NEJMra0910494. [DOI] [PubMed] [Google Scholar]

[R22] Duane S, Kennedy AD, Pendleton BJ, Roweth D. Hybrid monte carlo. Physics letters B. 1987;195:216–222. [Google Scholar]

[R23] Dunkley TP, Watson R, Griffin JL, Dupree P, Lilley KS. Localization of organelle proteins by isotope tagging (LOPIT) Molecular & Cellular Proteomics. 2004;3:1128–1134. doi: 10.1074/mcp.T400009-MCP200. [DOI] [PubMed] [Google Scholar]

[R24] Dunkley TP, Hester S, Shadforth IP, Runions J, Weimar T, Hanton SL, Griffin JL, Bessant C, Brandizzi F, Hawes C, et al. Mapping the Arabidopsis organelle proteome. Proceedings of the National Academy of Sciences. 2006;103:6518–6523. doi: 10.1073/pnas.0506958103. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Eddelbuettel D, Francois R. Rcpp: Seamless R and C++ Integration. Journal of Statistical Software, Articles. 2011;40:1–18. [Google Scholar]

[R26] Eddelbuettel D, Sanderson C. RcppArmadillo: Accelerating R with High-performance C++ Linear Algebra. Comput Stat Data Anal. 2014;71:1054–1063. [Google Scholar]

[R27] Escobar MD, West M. Bayesian density estimation and inference using mixtures. Journal of the american statistical association. 1995;90:577–588. [Google Scholar]

[R28] Fraley C, Raftery AE. Bayesian regularization for normal mixture estimation and model-based clustering. Journal of Classification. 2007;24:155–181. [Google Scholar]

[R29] Gatto L, Breckels LM, Lilley KS. Assessing sub-cellular resolution in spatial proteomics experiments. Current Opinion in Chemical Biology. 2019;48:123–149. doi: 10.1016/j.cbpa.2018.11.015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Gatto L, Crook OM, Breckels LM. pRolocdata: Data accompanying the pRoloc package R package version 1.19.1. 2018 [Google Scholar]

[R31] Gatto L, Vizcaìno JA, Hermjakob H, Huber W, Lilley KS. Organelle proteomics experimental designs and analysis. Proteomics. 2010;10:3957–3969. doi: 10.1002/pmic.201000244. [DOI] [PubMed] [Google Scholar]

[R32] Gatto L, Breckels LM, Burger T, Nightingale DJ, Groen AJ, Campbell C, Mulvey CM, Christoforou A, Ferro M, Lilley KS. A foundation for reliable spatial proteomics data analysis. Molecular & Cellular Proteomics. 2014a:mcp–M113. doi: 10.1074/mcp.M113.036350. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] Gatto L, Breckels LM, Wieczorek S, Burger T, Lilley KS. Massspectrometry based spatial proteomics data analysis using Proloc and Prolocdata. Bioinformatics. 2014b doi: 10.1093/bioinformatics/btu013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] Geladaki A, Britovsek NK, Breckels LM, Smith TSOLV, Mulvey CM, Crook OM, Gatto L, Lilley KS. Combining LOPIT with differential ultracentrifugation for high-resolution spatial proteomics. Nature Communications. 2019;10:331. doi: 10.1038/s41467-018-08191-w. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] Gelfand AE, Kottas A, MacEachern SN. Bayesian nonparametric spatialmodeling with Dirichlet process mixing. Journal of the American Statistical Association. 2005;100:1021–1035. [Google Scholar]

[R36] Gelfand AE, Smith AF. Sampling-based approaches to calculating marginal densities. Journal of the American statistical association. 1990;85:398–409. [Google Scholar]

[R37] Gelman A, et al. Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper) Bayesian analysis. 2006;1:515–534. [Google Scholar]

[R38] Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis. Chapman & Hall; London: 1995. [Google Scholar]

[R39] Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome biology. 2004;5:R80. doi: 10.1186/gb-2004-5-10-r80. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] Gibson TJ. Cell regulation: determined to signal discrete cooperation. Trends in biochemical sciences. 2009;34:471–482. doi: 10.1016/j.tibs.2009.06.007. [DOI] [PubMed] [Google Scholar]

[R41] Gneiting T, Raftery AE. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association. 2007;102:359–378. [Google Scholar]

[R42] Groen AJ, Sancho-André G, Breckels LM, Gatto L, Aniento F, Lilley KS. Identification of trans-Golgi network proteins in Arabidopsis thaliana root tissue. Journal of proteome research. 2014;13:763–776. doi: 10.1021/pr4008464. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] Hall SL, Hester S, Griffin JL, Lilley KS, Jackson AP. The organelle proteome of the DT40 lymphocyte cell line. Molecular & Cellular Proteomics. 2009;8:1295–1305. doi: 10.1074/mcp.M800394-MCP200. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] Heard W, Sklenář J, Tome DF, Robatzek S, Jones AM. Identification of regulatory and cargo proteins of endosomal and secretory pathways in Arabidopsis thaliana by proteomic dissection. Molecular & Cellular Proteomics. 2015;14:1796–1813. doi: 10.1074/mcp.M115.050286. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] Heinonen M, Guipaud O, Milliat F, Buard V, Micheau B, Tarlet G, Benderitter M, Zehraoui F, d’Alché Buc F. Detecting time periods of differential gene expression using Gaussian processes: an application to endothelial cells exposed to radiotherapy dose fraction. Bioinformatics. 2014;31:728–735. doi: 10.1093/bioinformatics/btu699. [DOI] [PubMed] [Google Scholar]

[R46] Hennig C. Breakdown points for maximum likelihood estimators of location-scale mixtures. Annals of Statistics. 2004:1313–1340. [Google Scholar]

[R47] Hensman J, Lawrence ND, Rattray M. Hierarchical Bayesian modelling of gene expression time series across irregularly sampled replicates and clusters. BMC bioinformatics. 2013;14:252. doi: 10.1186/1471-2105-14-252. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] Hirst J, Itzhak DN, Antrobus R, Borner GH, Robinson MS. Role of the AP-5 adaptor protein complex in late endosome-to-Golgi retrieval. PLoS biology. 2018;16:e2004411. doi: 10.1371/journal.pbio.2004411. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] Honkela A, Girardot C, Gustafson EH, Liu Y-H, Furlong EE, Lawrence ND, Rattray M. Model-based method for transcription factor target identification with limited data. Proceedings of the National Academy of Sciences. 2010;107:7793–7798. doi: 10.1073/pnas.0914285107. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R50] Huber W, Carey VJ, Gentleman R, Anders S, Carlson M, Carvalho BS, Bravo HC, Davis S, Gatto L, Girke T, et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nature methods. 2015;12:115. doi: 10.1038/nmeth.3252. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R51] Itzhak DN, Tyanova S, Cox J, Borner GH. Global, quantitative and dynamic mapping of protein subcellular localization. Elife. 2016;5:e16950. doi: 10.7554/eLife.16950. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R52] Itzhak DN, Davies C, Tyanova S, Mishra A, Williamson J, Antrobus R, Cox J, Weekes MP, Borner GH. A Mass Spectrometry-Based Approach for Mapping Protein Subcellular Localization Reveals the Spatial Proteome of Mouse Primary Neurons. Cell reports. 2017;20:2706–2718. doi: 10.1016/j.celrep.2017.08.063. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R53] Jadot M, Boonen M, Thirion J, Wang N, Xing J, Zhao C, Tannous A, Qian M, Zheng H, Everett JK, et al. Accounting for protein subcellular localization: A compartmental map of the rat liver proteome. Molecular & Cellular Proteomics. 2017;16:194–212. doi: 10.1074/mcp.M116.064527. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R54] James GM, Hastie TJ. Functional linear discriminant analysis for irregularly sampled curves. Journal of the Royal Statistical Society Series B (Statistical Methodology) 2001;63:533–550. [Google Scholar]

[R55] James GM, Sugar CA. Clustering for sparsely sampled functional data. Journal of the American Statistical Association. 2003;98:397–408. [Google Scholar]

[R56] Jeffreys H. An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London Series A. Mathematical and Physical Sciences. 1946;186:453–461. doi: 10.1098/rspa.1946.0056. [DOI] [PubMed] [Google Scholar]

[R57] Jones M, Rice JA. Displaying the important features of large collections of similar curves. The American Statistician. 1992;46:140–145. [Google Scholar]

[R58] Kalaitzis AA, Lawrence ND. A simple approach to ranking differentially expressed gene expression time courses through Gaussian process regression. BMC bioinformatics. 2011a;12:180. doi: 10.1186/1471-2105-12-180. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R59] Kalaitzis AA, Lawrence ND. A simple approach to ranking differentially expressed gene expression time courses through Gaussian process regression. BMC bioinformatics. 2011b;12:180. doi: 10.1186/1471-2105-12-180. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R60] Kau TR, Way JC, Silver PA. Nuclear transport and cancer: from mechanism to intervention. Nature Reviews Cancer. 2004;4:106–117. doi: 10.1038/nrc1274. [DOI] [PubMed] [Google Scholar]

[R61] Kirk PD, Stumpf MP. Gaussian process regression bootstrapping: exploring the effects of uncertainty in time course data. Bioinformatics. 2009;25:1300–1306. doi: 10.1093/bioinformatics/btp139. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R62] Kirk P, Griffin JE, Savage RS, Ghahramani Z, Wild DL. Bayesian correlated clustering to integrate multiple datasets. Bioinformatics. 2012;28:3290–3297. doi: 10.1093/bioinformatics/bts595. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R63] Latorre IJ, Roh MH, Frese KK, Weiss RS, Margolis B, Javier RT. Viral oncoprotein-induced mislocalization of select PDZ proteins disrupts tight junctions and causes polarity defects in epithelial cells. Journal of cell science. 2005;118:4283–4293. doi: 10.1242/jcs.02560. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R64] Laurila K, Vihinen M. Prediction of disease-related mutations affecting protein localization. BMC genomics. 2009;10:122. doi: 10.1186/1471-2164-10-122. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R65] Lavine M, West M. A Bayesian method for classification and discrimination. Canadian Journal of Statistics. 1992;20:451–461. [Google Scholar]

[R66] Liu DC, Nocedal J. On the limited memory BFGS method for large scale optimization. Mathematical programming. 1989;45:503–528. [Google Scholar]

[R67] Liu Q, Lin KK, Andersen B, Smyth P, Ihler A. Estimating replicate time shifts using Gaussian process regression. Bioinformatics. 2010;26:770–776. doi: 10.1093/bioinformatics/btq022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R68] Luheshi LM, Crowther DC, Dobson CM. Protein misfolding and disease: from the test tube to the organism. Current opinion in chemical biology. 2008;12:25–31. doi: 10.1016/j.cbpa.2008.02.011. [DOI] [PubMed] [Google Scholar]

[R69] Lunn DJ, Thomas A, Best N, Spiegelhalter D. WinBUGS-a Bayesian modelling framework: concepts, structure, and extensibility. Statistics and computing. 2000;10:325–337. [Google Scholar]

[R70] Machete RL. Contrasting probabilistic scoring rules. Journal of Statistical Planning and Inference. 2013;143:1781–1790. [Google Scholar]

[R71] Malsiner-Walli G, Frühwirth-Schnatter S, Grün B. Identifying mixtures of mixtures using Bayesian estimation. Journal of Computational and Graphical Statistics. 2017;26:285–295. doi: 10.1080/10618600.2016.1200472. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R72] Mendes M, Peláez-García A, López-Lucendo M, Bartolomé RA, Calviño E, Barderas R, Casal JI. Mapping the Spatial Proteome of Metastatic Cells in Colorectal Cancer. proteomics. 2017;17:1700094. doi: 10.1002/pmic.201700094. [DOI] [PubMed] [Google Scholar]

[R73] Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, TELLEr E. Equation of state calculations by fast computing machines. The journal of chemical physics. 1953;21:1087–1092. [Google Scholar]

[R74] Morris JS. Functional regression. Annual Review of Statistics and Its Application. 2015;2:321–359. [Google Scholar]

[R75] Mulvey CM, Breckels LM, Geladaki A, Britovšek NK, Nightingale DJ, Christo-forou A, Elzek M, Deery MJ, Gatto L, Lilley KS. Using hyperLOPIT to perform high-resolution mapping of the spatial proteome. Nature Protocols. 2017;12:1110–1135. doi: 10.1038/nprot.2017.026. [DOI] [PubMed] [Google Scholar]

[R76] Murphy KP. Machine learning: a probabilistic perspective. The MIT Press; 2012. [Google Scholar]

[R77] Murphy K, Murphy TB. Parsimonious Model-Based Clustering with Covariates. Advances in Data Analysis and Classification. 2019 [Google Scholar]

[R78] Nightingale DJ, Geladaki A, Breckels LM, Oliver SG, Lilley KS. The subcellular organisation of Saccharomyces cerevisiae. Current opinion in chemical biology. 2019;48:86–95. doi: 10.1016/j.cbpa.2018.10.026. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R79] Nikolovski N, Rubtsov D, Segura MP, Miles GP, Stevens TJ, Dunkley TP, Munro S, Lilley KS, Dupree P. Putative glycosyltransferases and other plant Golgi apparatus proteins are revealed by LOPIT proteomics. Plant physiology. 2012;160:1037–1051. doi: 10.1104/pp.112.204263. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R80] Ohta S, Bukowski-WiLLs J-C, Sanchez-Pulido L, De Lima Alves F, Wood L, Chen ZA, Platani M, Fischer L, Hudson DF, Ponting CP, et al. The protein composition of mitotic chromosomes determined using multiclassifier combinatorial proteomics. Cell. 2010;142:810–821. doi: 10.1016/j.cell.2010.07.047. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R81] Olkkonen VM, Ikonen E. When intracellular logistics fails-genetic defects in membrane trafficking. Journal of cell science. 2006;119:5031–5045. doi: 10.1242/jcs.03303. [DOI] [PubMed] [Google Scholar]

[R82] Orre LM, Vesterlund M, Pan Y, Arslan T, Zhu Y, Woodbridge AF, Frings O, Fredlund E, Lehtiö J. SubCellBarCode: Proteome-wide Mapping of Protein Localization and Relocalization. Molecular Cell. 2019;73:166–182.:e7. doi: 10.1016/j.molcel.2018.11.035. [DOI] [PubMed] [Google Scholar]

[R83] Parsons H, Fernández-Ñino S, Heaziewood J. Separation of the plant Golgi apparatus and endoplasmic reticulum by free-flow electrophoresis. Methods in molecular biology (Clifton, NJ) 2014;1072:527. doi: 10.1007/978-1-62703-631-3_35. [DOI] [PubMed] [Google Scholar]

[R84] Preda C, Saporta G, Lévéder C. PLS classification of functional data. Computational Statistics. 2007;22:223–235. [Google Scholar]

[R85] Ramsay JO. Functional data analysis. Encyclopedia of Statistical Sciences. 2004;4 [Google Scholar]

[R86] Rasmussen CE. Advanced lectures on machine learning. Springer; 2004. Gaussian processes in machine learning; pp. 63–71. [Google Scholar]

[R87] Rasmussen CE, Wiiiiams CK. Gaussian processes for machine learning. MIT Press; 2006. [Google Scholar]

[R88] Richardson S, Green PJ. On Bayesian analysis of mixtures with an unknown number of components (with discussion) Journal of the Royal Statistical Society series B (statistical methodology) 1997;59:731–792. [Google Scholar]

[R89] Rodriguez JA, Au WW, Henderson BR. Cytoplasmic mislocalization of BRCA1 caused by cancer-associated mutations in the BRCT domain. Experimental cell research. 2004;293:14–21. doi: 10.1016/j.yexcr.2003.09.027. [DOI] [PubMed] [Google Scholar]

[R90] Rodríguez A, Dunson DB, Gelfand AE. Bayesian nonparametric functional data analysis through density estimation. Biometrika. 2009;96:149–162. doi: 10.1093/biomet/asn054. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R91] Sadowski PG, Dunkiey TP, Shadforth IP, Dupree P, Bessant C, Griffin JL, Liiiey KS. Quantitative proteomic approach to study subcellular localization ofmembrane proteins. Nature protocols. 2006;1:1778–1789. doi: 10.1038/nprot.2006.254. [DOI] [PubMed] [Google Scholar]

[R92] Schuurman N, Grasman R, Hamaker E. A comparison of inverse-wishart prior specifications for covariance matrices in multilevel autoregressive models. Multivariate Behavioral Research. 2016;51:185–206. doi: 10.1080/00273171.2015.1065398. [DOI] [PubMed] [Google Scholar]

[R93] Shannon CE. A mathematical theory of communication. The Bell System Technical Journal. 1948;27:379–423. [Google Scholar]

[R94] Shin SJ, Smith JA, Rezniczek GA, Pan S, Chen R, Brentnaii TA, Wiche G, Keiiy KA. Unexpected gain of function for the scaffolding protein plectin due to mislocalization in pancreatic cancer. Proceedings of the National Academy of Sciences. 2013;110:19414–19419. doi: 10.1073/pnas.1309720110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R95] Shin JJ, Crook OM, Borgeaud A, Cattin-Ortoiá J, Peak-Chew S-Y, Chadwick J, Liiiey KS, Munro S. Determining the content of vesicles captured by golgin tethers using LOPIT-DC. bioRxiv. 2019:841965 [Google Scholar]

[R96] Siljee JE, Wang Y, Bernard AA, Ersoy BA, Zhang S, Mariey A, Von Zastrow M, Reiter JF, Vaisse C. Subcellular localization of MC4R with ADCY3 at neuronal primary cilia underlies a common pathway for genetic predisposition to obesity. Nat Genet. 2018 doi: 10.1038/s41588-017-0020-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R97] Steel MF, Fuentes M. Non-gaussian and nonparametric models for continuous spatial data. CRC press; 2010. [Google Scholar]

[R98] Stegle O, Denby KJ, Cooke EJ, Wiid DL, Ghahramani Z, Borgwardt KM. A robust Bayesian two-sample test for detecting intervals of differential gene expression in microarray time series. Journal of Computational Biology. 2010;17:355–367. doi: 10.1089/cmb.2009.0175. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R99] Tan DJ, Dvinge H, Christoforou A, Bertone P, Martinez Arias A, Liiiey KS. Mapping organelle proteins and protein complexes in drosophila melanogaster. Journal of proteome research. 2009;8:2667–2678. doi: 10.1021/pr800866n. [DOI] [PubMed] [Google Scholar]

[R100] Tardif M, Atteia A, Specht M, Cogne G, Roiiand N, BrugiEre S, Hippier M, Ferro M, Bruiey C, Peitier G, et al. PredAlgo: a new subcellular localization prediction tool dedicated to green algae. Molecular biology and evolution. 2012;29:3625–3639. doi: 10.1093/molbev/mss178. [DOI] [PubMed] [Google Scholar]

[R101] Thompson A, Schäfer J, Kuhn K, Kienie S, Schwarz J, Schmidt G, Neumann T, Hamon C. Tandem mass tags: a novel quantification strategy for comparative analysis of complex protein mixtures by MS/MS. Analytical chemistry. 2003;75:1895–1904. doi: 10.1021/ac0262560. [DOI] [PubMed] [Google Scholar]

[R102] Thul PJ, Åkesson L, Wiking M, Mahdessian D, Geiadaki A, Ait Biai H, Aim T, Asplund A, Björk L, Breckels LM, Bäckström A, et al. A subcellular map of the human proteome. Science. 2017 doi: 10.1126/science.aal3321. [DOI] [PubMed] [Google Scholar]

[R103] Topa H, Jónás Á, Kofier R, Kosioi C, Honkeia A. Gaussian process test for high-throughput sequencing time series: application to experimental evolution. Bioinformatics. 2015;31:1762–1770. doi: 10.1093/bioinformatics/btv014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R104] Wang J-L, Chiou J-M, Müller H-G. Functional data analysis. Annual Review of Statistics and Its Application. 2016;3:257–295. [Google Scholar]

[R105] Wang L, Dunson DB. Fast Bayesian inference in Dirichlet process mixture models. Journal of Computational and Graphical Statistics. 2011;20:196–216. doi: 10.1198/jcgs.2010.07081. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R106] Williams CK, Rasmussen CE. Gaussian processes for regression. In Advances in neural information processing systems. 1996:514–520. [Google Scholar]

[R107] Zhang Y, Leithead WE, Leith DJ. Time-series Gaussian process regression based on Toeplitz computation of O (N 2) operations and O (N)-level storage; Decision and Control, 2005 and 2005 European Control Conference. CDC-ECC’05. 44th IEEE Conference on 3711-3716; 2005. [Google Scholar]

[R108] Zhu H, Brown PJ, Morris JS. Robust classification of functional and quantitative image data using functional mixed models. Biometrics. 2012;68:1260–1268. doi: 10.1111/j.1541-0420.2012.01765.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R109] Zhu H, Vannucci M, Cox DD. A Bayesian hierarchical model for classification with selection of functional predictors. Biometrics. 2010;66:463–473. doi: 10.1111/j.1541-0420.2009.01283.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Semi-Supervised Non-Parametric Bayesian Modelling of Spatial Proteomics

Oliver M Crook

Kathryn S Lilley

Laurent Gatto

Paul DW Kirk

Abstract

1. Introduction

Fig 1. An overview of the experimental design of a spatial proteomics experiment using density-gradient centrifugation.

1.1. Model development

2. Methods

2.1. Modelling protein abundances along the density gradient

2.2. Finite mixture models

2.3. Modelling outliers

2.4. Gaussian Process prior specification

2.4.1. Marginalising the unknown function

2.4.2. Tensor decomposition of the covariance matrix for fast inference

2.4.3. Sampling the underlying function

2.4.4. Gaussian process hyperparameter inference

Supervised approach: optimising the hyperparameters

Semi-supervised approach: Bayesian inference of the hyperparameters

2.5. MCMC algorithm for posterior Bayesian computation

2.6. Summarising uncertainty in posterior localisation probabilities

2.7. Proper scoring rules

3. Results

3.1. Case Study I: Drosophila melanogaster embryos

3.1.1. Application

Fig 2.

Fig 3.

Fig 4.

Fig 5.

3.1.2. Sensitivity analysis for hyper-prior specification

Fig 6. Boxplots of quadratic losses to asses the sensitivity of semi-supervised hyperparameter inference to hyper-prior choices.

3.2. Case Study II: mouse pluripotent embryonic stems cells

3.2.1. Application

Fig 7.

Fig 8.

Fig 9.

3.3. Assessing predictive performance

Fig 10.

4. Discussion

Supplementary Material

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases