Abstract
Necrotic enteritis (NE) is a serious disease of poultry caused by the bacterium C. perfringens. To identify proteins of C. perfringens that confer virulence with respect to NE, the protein secretions of four NE disease-producing strains and one baseline non-disease-producing strain of C. perfringens were examined. The problem then becomes a clustering task, for the identification of two extreme groups of proteins that were produced at either concordantly higher or concordantly lower levels across all four disease-producing strains compared to the baseline, when most of the proteins do not exhibit significant change across all strains. However, the existence of some nuisance proteins of discordant change may severely distort any biologically meaningful cluster pattern. We develop a tailored multivariate clustering approach to robustly identify the proteins of concordant change. Using a three-component normal mixture model as the skeleton, our approach incorporates several constraints to account for biological expectations and data characteristics. More importantly, we adopt a sparse mean-shift parameterization in the reference distribution, coupled with a regularized estimation approach, to flexibly accommodate proteins of discordant change. We explore the connections and differences between our approach and other robust clustering methods, and resolve the issue of unbounded likelihood under an eigenvalue-ratio condition. Simulation studies demonstrate the superior performance of our method compared with a number of alternative approaches. Our protein analysis along with further biological investigations may shed light on the discovery of the complete set of virulence factors in NE.
Keywords: Clustering, Multivariate mixture model, Penalized estimation, Proteomics, Robust estimation
1 Introduction
Necrotic enteritis (NE) is a serious and often fatal disease of chickens and turkeys, with reports of up to 50% mortality (Wijewanta and Seneviratna, 1971) before effective control measures were introduced. The disease is caused by the bacterium Clostridium perfringens (C. perfringens), which proliferates in the small intestine and damages the gut lining, leading to extensive necrosis (death of tissue) of the mucosa and inflammation (Figure 1). NE is one of the most important diseases of commercially raised poultry and poses major welfare and economic concerns for the broiler chicken and turkey industries (Lovland and Kaldhusdal, 2001). Even with widespread use of antibiotic growth promoters (AGPs), NE was estimated to cost $0.05 per broiler chicken in the U.S. in the year 2000 (McDevitt et al., 2006), which would translate to a loss of more than $400 million to the U.S. chicken industry in 2014.
Figure 1.
Left panel: normal jejunum. Right panel: jejunum affected by necrotic enteritis. The uniformly pink material in the upper half is dead tissue (necrotic) which is coated with, and admixed with pockets of, the bacterium, C. perfringens (blue). Both images have been captured at the same magnification.
AGPs have been banned in many countries and their use is under threat in U.S. poultry production, not least because of public perception and increased organic production of birds. If AGPs are removed, NE is expected to become a much bigger problem in the U.S.. Therefore, there is an urgent need to develop alternative non-antibiotic based preventative measures for NE. To do this, it is critically important to understand the mechanisms by which disease-producing strains of C. perfringens become established and how they produce disease. C. perfringens has a large genome and produces hundreds of proteins (Shimizu et al., 2002). NetB, a novel pore-forming toxin, has been strongly incriminated as the key virulence factor in development of NE in chickens (Keyburn et al., 2008). However, Smyth and Martin (2010) discovered that while NetB appears to be essential for virulence, some other as yet unidentified virulence factors may also be involved. Comparison of the proteins produced by this netB-positive non-disease-producing strain to those produced by the disease-producing strains should allow identification of additional virulence factors.
The objective of this study is thus to develop a sound statistical approach to identify proteins of C. perfringens that could confer virulence with respect to NE, using available protein relative intensity data produced from mass spectrometry experiments. We examined the entire secretome of four NE-producing strains as well as one non-NE-producing strain of C. perfringens, all of which are netB positive. The relative intensity levels of over 2,500 proteins were measured in three replicated experiments. Sometimes a protein was not observed in all three replications, mainly due to technological limitations, e.g., low-abundance proteins will not reach the detection threshold set by the protein identification software. Our primary interest is to identify proteins with concordantly increased or concordantly decreased intensity levels across all the disease-producing strains compared to the non-disease-producing strain, which can be, biologically, considered as candidate virulence factors.
Indeed, biological expectations and empirical evidence obtained in the present analysis both suggest that the proteins mainly fall into three categories: (i) similar intensity levels across all the strains, (ii) lower intensity levels across all the disease-producing strains compared to the non-disease-producing strain, and (iii) higher intensity levels across all the disease-producing strains compared to the non-disease-producing strain. Moreover, in the observed data it appears that the protein intensity levels in the four disease-producing strains are positively related. In order to identify the proteins of concordant change, it is thus both necessary and desirable to analyze all the strains simultaneously. Using a four-dimensional vector with each component being the log intensity ratio (LIR) between each of the disease-producing strains and the non-disease-producing strain, the task can then be treated as a clustering problem. For the proteins belonging to category (i), it is expected that their observed four-dimensional vectors of LIR values tend to be centered around the zero vector, which may be modeled collectively by a multivariate distribution of zero mean. Similarly, for proteins belonging to category (ii) or (iii), their LIR profiles can be modeled by a certain multivariate distribution with a mean vector of negative components or a mean vector of positive components, respectively.
This motivates us to consider a multivariate mixture model (Everitt and Hand, 1981; Lindsay and Lesperance, 1995; McLachlan and Peel, 2000) for clustering the proteins into three categories based on the replicated but incomplete LIR data. Arguably, the most popular model-based mixture model is the finite normal mixture model. The Expectation-Maximization (EM) algorithm (Dempster et al., 1977) can be used to obtain maximum likelihood estimates for the mixture-model parameters. Many other mixture models and estimation approaches have been developed, mainly due to robustness concerns. To list a few, Peel and McLachlan (2000) studied a mixture model of multivariate t distributions, which provides a heavy-tailed alternative to the normal mixture; see also, Andrews et al. (2011), Lee and McLachlan (2013) and Bai et al. (2016). Song et al. (2014) considered multivariate mixture model of Laplace distributions. Markatou (2000) developed a weighted likelihood approach to robustify mixture model estimation, Neykov et al. (2007) proposed to fit robust mixtures using trimmed likelihood, and Yu et al. (2015) proposed a mean-shift penalization method to robustify the univariate normal mixture model. In the case of univariate data, a mixture model of three components is commonly used for modeling heavy-tailed distributions arising from multiple testing or multiple comparison problems (Allison et al., 2002; Bar and Lillard, 2012; Bar et al., 2012), in which one component represents the reference/null distribution and the other two are designated to accommodate the extreme observations at the two tails.
As soon as we venture into the multivariate data analysis territory, the mixture modeling task becomes much more complicated. In particular, a three-component multivariate mixture model mentioned above, consisting of a reference distribution at the center and two distributions of concordant change at the “tails”, may become far from adequate to characterize the entire heavy-tailed behavior of a multivariate distribution. This is simply because the two extra components can only cover two specific corners of the multivariate feature space. This phenomenon in our protein clustering problem is revealed by the existence of proteins of discordant change, i.e., the intensity of the protein increases in some of the disease-producing strains but decreases or stays the same in others. Such proteins are nuisances as they are not of direct interest to biologists, but failing to accommodate them may jeopardize both the model estimation and the detection of the proteins of concordant change. This issue is certainly not trivial. Generally it is not feasible to address the problem by simply introducing more components into the mixture model, as the proteins of discordant change are expected to be relatively uncommon and, more importantly, they could appear in any corner of the multivariate feature space. Another potential approach is to use a mixture of linear mixed models with certain protein-specific random effects; see, e.g., Celeux et al. (2005) and Ng et al. (2006) where such models were proposed to cluster gene expression profiles across experimental conditions and replications. However, the individual random effects were usually modeled by a continuous distribution such as normal in these mixed-effects models. The distribution is unsuitable for our problem, because while the few proteins of discordant change may exhibit drastic individual outlying effects, the majority of proteins are expected to follow the three-component mixture model so that their individual effects are negligible.
We develop a new multivariate clustering approach, tailored to robustly identify the proteins of concordant intensity change using replicated LIR data. Our method uses the simple three-component multivariate normal mixture model as its skeleton, but we incorporate several parameter constraints and adopt a mean-shift reference distribution, in order to inherit the simplicity of the normal mixture and yet make it suitable for handling real-world multivariate clustering problems. As the protein experiments were replicated three times but a protein may not be observed in all replications, our method is built to handle data with missing replicates. Built upon the classical normal mixture model with certain application-motivated parameter constraints, we further adopt a mean-shift model formulation coupled with regularized estimation, to conveniently accommodate the observations of discordant change. Motivated by She and Owen (2011), Lee et al. (2012) and Yu et al. (2015), a protein-specific mean-shift vector is added to the mean structure of each multivariate observation in the reference component of the mixture model. If the mean-shift vector of a protein is a zero vector, the model for this protein reduces to the regular three-component mixture model. On the other hand, if a mean-shift vector is nonzero, it suggests that the corresponding protein deviates from the three specified components, and its deviation is automatically adjusted by the inclusion of the mean-shift vector. Therefore, sparse estimation of the mean-shift vectors realizes the robustness of our tailored multivariate mixture approach. We propose a general penalized likelihood estimation criterion, and develop an efficient generalized EM algorithm for model estimation. The connection between our approach and the trimmed likelihood method (García-Escudero et al., 2008) is also explored.
We describe the protein data and the motivations behind using a clustering approach in Section 2. In Section 3 we develop our proposed constrained multivariate mixture model with sparse mean-shift, and present details on the regularized estimation criterion and its optimization. The efficacy of the proposed method is further demonstrated via simulation studies in Section 4. The protein clustering problem is thoroughly studied in Section 5. We conclude the paper with a brief discussion in Section 6.
2 Protein Data and Problem Setup
The data consist of three sets of relative intensity levels of over 2,500 proteins along the entire secretomes of four disease-producing strains as well as one non-disease-producing strain of C. perfringens, all of which are netB positive. The experiment was replicated 3 times. Each set represents a biological replicate. A protein may not be observed in all three replications. The relative intensity was observed as zero in a small fraction of observations; these proteins were examined by the biologists but were not used in the statistical analysis, to avoid unbounded log intensity ratios to be introduced later. There were a total of n = 2, 492 proteins in our statistical investigation. More details on the data are provided in the Supplementary Materials.
Let zi,j,s be the relative intensity level of the ith protein in the jth strain at the sth replicated experiment, for s ∈ 𝒮i, j = 0, …, m, and i = 1, …, n. Here m = 4 with j = 0 corresponding to the non-disease-producing strain, and 𝒮i ⊂ {1, …, S} with S = 3; that is, 𝒮i is the index set of the replicates where protein i is observed. We have for each s ∈ 𝒮i and i = 1, …, n. We profile each protein using its vector of log intensity ratios (LIR) between the four disease-producing strains and the non-disease-producing strain, as the latter establishes the baseline for examining the intensity changes. Specifically, the LIR value, denoted as yi,j,s, is defined as
Denote i·s = (yi,1,s, …, yi,m,s) ∈ ℝm, as the observed LIR vector of the ith protein in the sth experiment, for s ∈ 𝒮i, i = 1, …, n. Let |𝒮i| denote the cardinality of the index set 𝒮i, so that the total sample size of the multivariate LIR vector is . The percentage of missing replications, i.e., 1 − N/(nS), is 12.6%.
Figure 2(a) displays the normal QQ-plots of the average LIR values (averaged over replications) for the four different disease-producing strains. The pairwise scatterplots, marginal density plots and pairwise correlation coefficients are shown in Figure 2(b). The LIR values are mostly concentrated around zero, indicating that in each disease-producing strain, the intensity levels of the majority of the proteins are similar to those in the non-disease-producing strain. In each of the four strains, the marginal distribution exhibits much heavier tails than a normal distribution. The LIR values appear to be statistically positively correlated, and in all the pairwise plots, it appears that the dominating variation indeed tends to be approximately along the 45° line, i.e., the direction of concordant change.
Figure 2.
Protein relative intensity analysis: (a) normal QQ-plots and (b) pairwise scatterplots of log intensity ratio by strains. The plots are constructed by averaged data across replications. The marginal density plots and the parirwise correlation coefficients are also shown in (b).
Our primary interest is to detect proteins of concordant change, i.e., the proteins with concordantly increased or decreased intensity levels across all the disease-producing strains. Indeed, the above empirical evidence suggests that most of the proteins fall into the aforementioned three classes (i)–(iii), so the problem becomes how to infer the class memberships of the proteins by jointly analyzing all the strains and proteins together. Naturally, the task can be thought as a clustering problem with the multivariate data {i·s; s ∈ 𝒮i, i = 1, …, n.}.
3 Multivariate Mixture Approach for Robust Protein Clustering
3.1 A Naive Normal Mixture Model with Replicated Data
Consider a multivariate normal mixture model, where the joint probability density function (pdf) of i·s ∈ ℝm, s ∈ 𝒮i, is given by
| (1) |
Here K is the number of mixture components, πk’s are the mixing proportion parameters satisfying , αs’s are the overall replication effects, ϕ(·;, ) denotes the pdf of the normal distribution N(, ) with mean and a positive definite (p.d.) covariance matrix, and = {αs; πk;k ;k } denotes the collection of all the parameters in the model. This model assumes that the LIR vectors observed on the ith protein, i.e., i·s, s ∈ 𝒮i, after adjusting for the overall replication effects, are considered as independent samples from the normal distribution N(k,k ) with probability πk, k = 1, …, K. Since most of the protein profiles are expected to belong to either the reference component or one of the two components of concordant change, in the sequel the number of components is chosen as K = 3 unless otherwise noted.
The normal mixture likelihood is given by , where f is defined in (1). The EM algorithm (Dempster et al., 1977; Meng and Rubin, 1993) is commonly used to conduct maximum likelihood estimation (MLE), which usually results in a local solution of the problem. It is well known that the unconstrained normal mixture likelihood is unbounded, which may make the MLE unstable. In practice one can still run the EM algorithm, and with reasonable initial values, the sequence of iterates will stay away from singularities on the boundary of the parameter space. Alternatively, several authors considered imposing certain constraints on the scales of the clusters to address the issue. For example, Chen and Li (2009) discussed the drawbacks of a normal mixture model (in addition to the unbounded likelihood, they show that the Fisher information can be infinite, and the strong identifiability condition is not satisfied.) They proposed a modified log likelihood function in order to solve these problems. In this paper, we adopt an eigenvalue-ratio (ER) constraint (Hathaway, 1985; García-Escudero et al., 2008):
-
(A0)
Let δjk = δj(k), j = 1, …, m, be the eigenvalues of k, for k = 1, …, K. Let , and . Assume with a fixed constant c ≥ 1, it holds that δmax/δmin ≤ c.
This condition implies that the k’s are positive definite. It was shown that under A0, the constrained maximum likelihood estimator becomes well-defined, as long as the observed data points are not concentrated on only K points. The constant c allows flexible control for handling the different scaling of the clusters.
The cluster membership of each protein is unknown, but can be inferred probabilistically. Let zik = 1 if the ith protein is from the kth component, and zik = 0 otherwise. Then given the data, the conditional probability of the ith protein being from the kth mixture component (k = 1, …, K) is given by
To cluster the proteins, the cluster membership of the ith protein can be determined based upon Bayes’ rule, i.e.,
The model in (1) is a naive multivariate extension of the univariate analysis, in which each strain is separately analyzed using a univariate mixture model. Although empirical evidence suggests that most of the proteins fall into three categories (i)–(iii), there are severe drawbacks to using a naive three-component normal mixture model as (1). The normal mixture model is not robust against data outliers, namely, proteins of discordant change. Due to nonconvexity, maximum likelihood estimation of its parameters could be unstable. Since there are many parameters, including the mixing proportions, the mean vectors and the covariance matrices, this further makes reliable estimation challenging.
3.2 A Tailored Multivariate Mixture Model with Sparse Mean Shift
To facilitate the protein analysis application, we develop a tailored multivariate clustering approach to robustly perform mixture model estimation and to identify the proteins of concordant intensity change.
Our proposed model still uses the three-component multivariate normal mixture model in (1) as its skeleton, under the same eigenvalue-ratio condition given in A0. It is important to take into account the prior biological expectations as well as some unique features of the observed data, in order to reduce the number of free parameters and enhance the model stability and iterpretability. Building upon (1), we introduce several structural constraints on the parameters. We set
-
(A1)
1 =,
so that the first component represents the reference distribution of the proteins with no significantly different intensity levels across all the strains. For the reference component, we allow 1 to be an arbitrary positive definite covariance matrix, to take advantage of the potential correlation among the disease-producing strains. The other two components are meant to capture the proteins of concordant change. Without loss of generality, we assume
-
(A2)
2 < and 3 >, i.e., μ2j > 0, μ3j > 0 for all j = 1, …, m.
Henceforce, we refer to the reference component, the component of concordantly decreased intensities, and the component of concordantly increased intensities as components 1, 2 and 3, respectively. Now consider the setup of the covariance matrices. As the proteins in components 2 and 3 are considered as anomalies to the reference component, there is no reason to believe that their LIRs have exactly the same covariance structure as that of the reference; this can be seen from the observed data patterns shown in Section 2. On the other hand, allowing the two covariance matrices to be completely arbitrary may cause instability in estimation, since the total number of proteins in the components of concordant change is expected to be much smaller than that in the reference. Furthermore, the estimation of an unconstrained covariance matrix becomes even less stable as m (the number of strains) increases. Therefore, we consider
-
(A3)
2 and 3 are diagonal matrices.
While the variances of the tail observations are allowed to vary across both the strains and the components, the correlation structures at the two tails are no longer incorporated; consequently, the potential extreme behaviors of the tail observations do not have any direct effect on the estimation of the correlation pattern in the reference distribution. Assumptions A1–A3 are all motivated by the problem of interest and are based on strong empirical evidence as shown in Section 2. As the clustering problem in general is nonconvex and may admit multiple local solutions, the adoption of the above assumptions can help guide the clustering analysis to recover the desired mixture pattern. (We remark that our implementation is able to handle general forms of linear constraints on the mean vectors, e.g., 2 = −3, and other forms of covariance patterns, e.g., shared correlation structure in all components, which could be more desirable in some other applications.)
At first glance, with the proposed constraints, model (1) appears to be well suited to handle the protein clustering problem. However, it is soon realized that the constrained three-component model still does not meet the needs of the application due to its inability to fully characterize the potential heavy-tailed behavior of a multivariate distribution. As discussed in Section 1, the three-component model may easily fail when there exist proteins whose multivariate intensity profiles severely deviate from the three major profiles described above. Indeed, in our application, due to intrinsic heterogeneity among the disease-producing strains, it is plausible that the intensity of a certain protein may significantly increase in some strains and decrease in the others. These proteins in general are not of direct interest to biologists, but leaving them to enter either one of the three specified components may severely distort model estimation and inference. The presence of such points may inflate the estimated variance terms, which in turn may decrease the power of discovering the proteins of interest. Furthermore, the reference distribution may not necessarily be normal in real applications, so even without proteins of discordant change, it may still exhibit certain heavy-tailed behaviors. To this end, introducing extra components into the mixture model may also easily fail, because there are essentially infinite numbers of corners or directions in the multivariate space where the extreme but uncommon observations may appear.
To resolve these difficulties in the context of our protein clustering problem, we further generalize model (1) to the following mean-shift multivariate mixture model,
| (2) |
where we have defined = {i; i = 1, …, n} to collect the extra mean-shift parameters. We assume that the sets of observations {i·s, s ∈ 𝒮i}, i = 1, …, n for different proteins are independent of each other. All the other settings are exactly the same as in model (1), as well as the assumptions on k and k presented in A1–A3.
Here an extra mean-shift term, i, is added to the mean structure in the reference distribution. Without any additional assumptions on i’s, it is apparent that the model is severely over-parameterized. The key is to assume that the i vectors admit certain low-dimensional structures. Specifically, in our protein problem, it is expected that most of the i vectors are zero while a small fraction of them contain nonzero values. Since the majority of observed protein profiles fit either the reference profile or the two tails of concordant change, it is adequate to use the regular three-component mixture model to characterize these proteins, and consequently their corresponding mean-shift vectors are all zero. On the other hand, when a protein belongs to neither the reference distribution nor to the two distributions of concordant change, its LIR profile is then modeled as deviating from the reference distribution with a nonzero mean shift i, which in turn automatically captures and adjusts for the outlying effects of that protein. Our method thus generalizes the robust mixture model proposed by Yu et al. (2015), in order to handle general multivariate clustering learning tasks with replicated data, missing values and flexible parameter constraints. Moreover, the mean-shift is only applied in the reference distribution in our model rather than in all the cluster components, which is motivated by the protein application.
3.3 Estimation Criterion
Using model (2), the problem of accommodating the nuisance but extreme observations becomes the sparse estimation problem of the mean-shift vectors. Let 𝒴 = {i·s, s ∈ 𝒮i; i = 1, …, n} denote the observed data. We propose to conduct model estimation by maximizing a penalized log-likelihood criterion of the form
| (3) |
where is the log-likelihood function with the mixture density f defined in (2), and ρ(·; λ) is a sparsity inducing penalty function with λ > 0 being a tuning parameter (Tibshirani, 1996; Fan and Li, 2001). The sparse learning and regularization estimation approaches have undergone exciting developments; see, e.g., Bühlmann and van de Geer (2009), Huang et al. (2008) and Fan and Lv (2011) for some comprehensive reviews. In the above criterion, the joint log-likelihood function of the observed data reflects the goodness of fit of the model, while the regularization term reflects the model complexity in terms of the sparsity pattern or other low-dimensional structures of the i’s. The tuning parameter λ controls the amount of regularization and thus weights the model goodness of fit and model complexity.
There are many choices of the penalty function in (3). For inducing entry-wise sparsity, we may consider the lasso penalty of the form (Tibshirani, 1996) or the ℓ0 penalty of the form , where we use || · ||q to denote the ℓq norm for q ≥ 0 and use I(·) to denote an indicator function. These penalties penalize each entry of a mean-shift vector, which induces sparsity of the vector. Adopting such entrywise penalties may be helpful in examining exactly from which dimensions a discordant protein deviates from the reference. In the protein application, since the main purpose of penalized estimation is to accommodate the nuisance proteins of discordant intensity change for facilitating the identification of the other components of biological interests, it is preferable to directly induce the vectorwise sparsity of each i. Specifically, we focus on the group ℓ0 penalty (She, 2012)
| (4) |
When ||i||2 is penalized as zero, the entire mean-shift vector i becomes a zero vector; that is, the profile of protein i decidedly follows the regular three-component mixture model in (1). When ||i||2 is not zero, its contribution to the penalty term is always λ2/2, which means that its estimate can be chosen to fully adjust for the outlying effects of the protein. In contrast, when a convex penalty such as group ℓ1 is used, the penalty grows with the magnitude of i as measured by its ℓ1 norm, so that the outlying effect of a discordant protein may not be fully adjusted. We refer to Yu et al. (2015) and She and Chen (2017) for more discussions on the drawbacks of using a convex penalty for handling gross outliers due to its non-vanishing shrinkage effect.
3.4 Computational Algorithm
We derive a generalized EM algorithm to maximize the proposed regularized log-likelihood criterion in (3), with vectorwise sparsity-inducing penalties given in (4). Here we shall mainly consider the three-component model setup used in the protein problem with constraints A0–A3. The method can be readily extended to other settings and penalty forms; some examples are given in the Supplementary Materials.
Recall that we have defined
Let i = (zi1, …, ziK) and 𝒵 = {i; i = 1, …, n.}. The complete-data loglikelihood function given (𝒴,𝒵) can be written as
| (5) |
In the E-step of the EM algorithm, we compute the conditional expectation of the complete-data log-likelihood given 𝒴 and the current parameter estimates (c,c ), i.e., Q(, |c,c ) ≡ 𝔼{ℓc(, ; 𝒴,𝒵) |c,c, 𝒴}. The problem boils down to computing :
| (6) |
It then follows that
| (7) |
In the M-step, to update, the following penalized criterion needs to be maximized,
From (7), it is clear that the criterion is separable in the mixing parameters πk’s, for which the maximizers are obtained explicitly,
| (8) |
For updating other parameters, the problem becomes
| (9) |
subject to constraints A1–A3, where
The above problem has several blocks of parameters, and the presence of both multiple constraints and nonsmooth regularization makes the optimization even more challenging. Here we consider a blockwise coordinate ascent algorithm to iteratively optimize each block of parameters while holding others fixed at their most current solutions.
With other parameters held fixed, the replication effects αs are updated as
| (10) |
From A1, we always set 1 =, in order to place the reference distribution centered at the origin. The problem (9) is quadratic in each k; without any constraints, they are updated as
| (11) |
When the sign constraints in A2 are imposed, the problem remains a convex quadratic programming problem, for which many optimization methods are available. In the protein application, we find that the unconstrained solutions in (11) almost always satisfy the sign constraints, indicating that A2 is compatible with the observed data structure.
Now we deal with the estimation of the covariance matrices. The problem simplifies to minimizing
| (12) |
where and . Under A3, when and , where diag{·} denotes a diagonal matrix with the enclosed elements on its diagonal, all three covariance matrices admit explicit solutions:
| (13) |
Recall that an ER condition A0 is also imposed to avoid a degenerate solution. In the protein application and our numerical studies, we have used c = 10, and it turns out that the “naive update” above in (13) always satisfies A0. However, in practice, it is possible that the above solution violates A0. A fast algorithm was proposed by Fritz et al. (2013) for solving (12) under A0, and here we briefly outline the method under the additional constraint A3. Let k =k Δkk be the spectral decomposition of k, where Δk is a diagonal matrix with eigenvalues δjk’s on its diagonal. Note that under A3, , for k = 2, …, K. Also let §1 =1 11 be the spectral decompostion of §1. When all the eigenvalues δjk are fixed, minimizing (12) becomes the estimation problem of the eigenvectors 1; it can be readily shown that the optimal solution is simply the eigenvectors 1 of §1. Then, when fixing 1 =1, (12) becomes
| (14) |
where dj1’s are the eigenvalues of §1, but djk’s, k = 2, …, K, are the corresponding diagonal elements of §k due to A3; and the parameter space is a cone as Ωδ = {(δ11, …, δmK); δjk ≤ cδj′k′, for any 1 ≤ j, j′ ≤ m and 1 ≤ k, k′ ≤ K}. This problem can be solved by Dykstra’s alternating projection algorithm (Boyle and Dykstra, 1986), which, however, may be time consuming. Fast computation can be achieved by restricting the djk’s to take a specific form, such as certain thresholded versions of the djk’s; we refer to Fritz et al. (2013) for details. In either case, the update of the covariance matrices ensures that the M-step objective function in (9) is non-decreasing.
It remains to consider the optimization of the shift vectors i, i = 1, …, n. The problem is separable in each i,
After some algebra, the problem can be equivalently expressed in the following penalized regression form:
| (15) |
where
and denotes the square root matrix of so that . When there is no penalty, i.e., λ = 0, the least square solution is given by , the sample mean of the observations on the ith protein after adjusting for the replication effects. With the penalty in (4), i.e., the group ℓ0 penalty, the solution of (15) is explicitly given by a simple groupwise thresholding rule (She, 2012),
| (16) |
We summarize the derived (generalized) EM algorithm in Algorithm 1. In the M-step, a blockwise coordinate ascent algorithm is used to iteratively optimize each block of parameters while holding other parameters fixed at their most current solutions. The M-step optimization can be solved either fully or partially, i.e., iterating the inner steps either until convergence or for only a few times. The latter approach leads to a generalized EM algorithm (Dempster et al., 1977), or more precisely an Expectation Conditional Maximization (ECM) algorithm (Meng and Rubin, 1993), which may be even faster than the standard EM. The proposed algorithm monotonically non-decreases the penalized log-likelihood criterion in (3), and thus its convergence is guaranteed. Due to the non-convex nature of the mixture problem, Algorithm 1 is designed to search for a local solution of (3). Based on our experience, the solution produced by the algorithm is always satisfactory and the algorithm is fast and stable. In some applications it may be desirable to use other parameter constraints and penalty forms; we discuss these briefly in the Supplementary Materials.
Algorithm 1.
Multivariate Mixture Model with Sparse Mean Shift
| Initialize (0) and (0) (e.g., obtain (0) from K-means clustering and set ). |
| Set l ← 0. |
| repeat |
| E-Step: obtain the conditional expectation of the complete-data log-likelihood Q(, |(l),(l) ) in (7), i.e., compute by (6). |
| M-Step: Update by (8) and update other parameters by (9) via blockwise coordinate ascent: |
| repeat |
| a. Update αs, s = 1, …, S, by (10); |
| b. Update k, k = 1, …, K, by (11); |
| c. Update k, k = 1, …, K, by (13) if the solution satisfies A0; otherwise update k by solving (14). |
| d. Update i, i = 1, …, n, by (15). |
| output: and . |
| l ← l + 1. |
| until convergence, e.g., ||(l+1) − (l) ||2/||(l+1)||2 < ε with ε = 10−4. |
The proposed Algorithm 1 works for any fixed tuning parameter. A solution path is obtained by fitting models with a sequence of possible values, covering a whole spectrum of possible models, and the optimal solution then needs to be determined along the path. In practice, certain prior knowledge about the model structure may be available, which can be used to facilitate model selection. For example, it is expected that only a small proportion of proteins exhibit discordant intensity changes. This suggests that the main focus shall be on the sparse models, in which a small fraction of mean-shift vectors are non-zero. More generally, we use a corrected Akaike information criterion (AICc) (Cavanaugh, 1997; Flynn et al., 2013) for tuning parameter selection,
| (17) |
where (λ) and (λ) are the regularized estimates of the parameters with penalty level λ, δ1 is the degrees of freedom of the three-component mixture model without mean-shift, which does not depend on the amount of regularization, and δ2(λ) is the degrees of freedom specifically due to the inclusion of the mean-shift vectors, which depends on λ through i(λ). In our application, with the model constraints imposed in A1–A3, we estimate δ1 by counting the number of free parameters in αs’s, 2, 3, j’s, and πk’s:
As for δ2, following Yuan and Lin (2006), we estimate it by
which ranges from 0 to nm. For simplicity, we have omitted the expression λ in i(λ) and other similar expressions. We note that 1 = from A1.
In all our numerical studies, we use Bayes’ rule for clustering based on the probabilities pik produced from the model selected by AICc, i.e., ki = arg maxk pik, for i = 1, …, n. The proteins of discordant change are easily distinguished from the reference distribution by checking whether ||i|| = 0 holds.
We have shown that our method closely relates to the robust mixture model approach proposed by Fraley and Raftery (1998), in which an additional mixture component is introduced to account for the noise, and the trimmed clustering approach proposed by García-Escudero et al. (2008), in which the “worst” points are trimmed. Therefore, our method provides a new perspective on conducting robust clustering through the celebrated regularized estimation, in which the determination of the trimming proportion naturally translates to the problem of tuning parameter selection. Moreover, our formulation provides more flexibility in incorporating parameter constraints and controlling for the appearance of extreme observations based on application needs; in the protein application, we are able to restrict the anomaly detection only around the reference component, in order to account for the discordant proteins. We have also shown that the problem of unbounded likelihood in our model setup is resolved under A0 and some mild conditions similar to García-Escudero et al. (2008), so the solution of the proposed method is well defined. To save space, the details and all the proofs are provided in the Supplementary Materials.
4 Simulation
4.1 Simulation Setup
In our simulation setup, we set n = 2000, m = 4, and S = 3, to mimic the protein application. The data are generated based on a three-component mixture model. We set 1 =, so that the first component corresponds to the reference distribution. The entries of 2 are generated from Unif(−2.5, −1.5) and the entries of 3 are generated from Unif(1.5, 2.5), where Unif(a, b) stands for the uniform distribution on the interval [a, b]. The covariance matrix 1 has a compound symmetry structure with diagonal elements 1 and off diagonal elements 0.5, and we let 2 =3= 2. The parameter values here are similar to the estimates from the actual protein application to be elaborated upon in Section 5.
We assume there are three replicated experiments, i.e., S = 3, and the replication effects are given by = (−0.5, 0, 0.5). For each i = 1, …, n, we first generate the component label zi according to = (0.8, 0.1, 0.1). When zi = 2, we generate S independent samples from N(2+αs,2), for s = 1, …, S. Similarly, when zi = 3, S independent samples are generated from N(3+αs,3), for s = 1, …, S. When zi = 1, i.e., when the protein comes from the reference distribution, there is η% probability that the data may experience a mean shift. Specifically, when the mean shift occurs, instead of generating S samples from N(1+αs,1), we generate them from N(1 + i + αs, Σ1), where i ∈ ℝ4 is randomly picked as one of the six permuted vectors of {−β, −β, β, β}, where β is a constant indicating the magnitude of the shift. As such, these mean-shifted observations represent discordant change from the reference distribution. Finally, for each protein, there is 50% chance that one of its three replicated observations is randomly removed, so that the replicated data become incomplete similar to the actual protein data. We consider different shift proportions, i.e., η% ∈ {5%, 10%, 15%}, and different shift magnitudes, i.e., β ∈ {1.0, 1.5, 2.0, 2.5}. The simulation under each setting is repeated 500 times.
4.2 Performance Measures and Methods
We compare the estimation and clustering performance of various methods. To evaluate the estimation performance, we report the mean squared error (MSE) of estimated mean vectors, i.e., , and the MSE of estimated covariance matrices, i.e, , averaged over all the replications. The MSEs of estimated replication effects or the mixing probabilities reveal a similar message, so we omit them for simplicity. For the clustering performance, we focus on the detection of the two components of concordant change. We report the false positive rate (FPR), defined as the percentage of incorrectly identified proteins in all the proteins of no concordant change, and the false negative rate (FNR), defined as the percentage of undetected proteins in all the true proteins of concordant change, averaged over all the replications.
Several mixture modeling and clustering methods are compared with our tailored penalized normal mixture model approach (PenN-Mix). We first consider a commonly-used univariate mixture model approach (Uni-Mix). A three-component normal mixture model with fixed 1 = 0 is fitted with data from each strain separately, and each pair of tail components is used to detect the proteins of either increased or decreased intensity levels in each strain. The set of proteins of concordant change is then identified as the intersection of the sets identified from these m univariate mixture analyses. For multivariate methods, we consider the constrained normal mixture model without mean-shift (PenN-Mix(0)), the K-means clustering (K-Means), a multivariate normal mixture model (N-Mix), a trimmed multivariate normal mixture model (TrimN-Mix), and a multivariate t-mixture model (T-Mix). Except for PenN-Mix(0), the existing implementations of the above methods do not directly handle replicated and incomplete data, so they are applied on the averaged data: the original data are first adjusted for the replication effects by centering within replication, and then the adjusted data are averaged over the replications. The K-Means algorithm is a commonly-used model-free clustering method; since K-Means does not directly produce covariance estimates, they are estimated by the within-cluster sample covariance matrices. We have also used the K-Means solution to initialize our PenN-Mix method. The TrimN-Mix method is designed to capture major cluster components in the data in the presence of data contamination, which has been implemented in the R package tclust; the proportion of trimmed observations needs to be specified, for which we use the true proportion of discordant proteins. The T-Mix and N-Mix methods have been implemented in the R package teigen, in which a parsimonious eigen-decomposed covariance structure is applied. The T-Mix method estimates the degrees of freedom values of the multivariate t components and allows them to vary. In mixture modeling, the “label-switching” problem is well known. In our numerical study, we determine the component labels by comparing the fitted components with the true components to minimize the number of mismatched proteins. Finally, we also include an oracle method, i.e., the three-component normal mixture model presented in Section 3.1 is fitted using the clean data without the proteins of discordant change.
4.3 Simulation Results
Table 1 reports the simulation results for β = 1.5 and η% ∈ {5%, 10%, 15%}. The results for other settings and some graphical displays are given in the Supplementary Materials. The performance of PenN-Mix is stable across all scenarios and is close to that of the oracle procedure. The method outperforms all the other alternative methods in both model estimation and protein discovery. For the other two methods based on the normal distribution, N-Mix and PenN-Mix(0) may work only when both the shift magnitude and shift proportions are small, and otherwise they lead to much higher FNR/FPR; comparing N-Mix, PenN-Mix(0) and PenN-Mix shows that the gain from using PenN-Mix is not only from the tailored parameter constraints, but also from the sparse mean-shift as it indeed plays a key role in alleviating the influence of the discordant proteins. The K-means method has very high FPR in general. Due to the use of the heavy-tailed t distribution, the T-Mix method performs well when the shift proportion and the shift magnitude are low; otherwise the method fails and its estimation becomes quite unstable as seen from the enlarged standard errors. The TrimN-Mix method is the most competitive to PenN-Mix among the alternative methods. It performs well in estimating the mean centers and has very low FPR. The main difference between PenN-Mix and TrimN-Mix is that the latter tries to capture data contamination uniformly over the entire mixture data cloud, and consequently it always leads to high FNR because many concordant proteins at the tails may also be trimmed. In contrast, PenN-Mix is more flexible in restricting the potential mean shift to the reference distribution.
Table 1.
Simulation: the magnitude of the mean-shift is set to β = 1.5 and the probability that the data experiences a mean shift is varied, η% ∈ {5%, 10%, 15%}.
| Oracle | PenN-Mix | PenN-Mix(0) | Uni-Mix | K-Means | T-Mix | N-Mix | TrimN-Mix | ||
|---|---|---|---|---|---|---|---|---|---|
| η% = 5%, β = 1.5 | |||||||||
|
| |||||||||
| MSE() | mean | 0.005 | 0.022 | 0.107 | 0.263 | 0.057 | 0.218 | 0.333 | 0.007 |
| sd | 0.002 | 0.015 | 0.038 | 0.204 | 0.109 | 0.083 | 0.092 | 0.004 | |
| MSE() | mean | 0.003 | 0.017 | 0.066 | 0.252 | 0.299 | 0.351 | 0.346 | 0.325 |
| sd | 0.001 | 0.010 | 0.023 | 0.016 | 0.015 | 0.095 | 0.080 | 0.017 | |
| FNR | mean | 2.38% | 1.80% | 2.12% | 65.09% | 0.82% | 2.49% | 3.08% | 9.08% |
| sd | 0.96% | 0.80% | 0.95% | 8.59% | 0.52% | 1.25% | 1.58% | 2.61% | |
| FPR | mean | 0.44% | 2.80% | 5.40% | 0.18% | 8.24% | 7.81% | 7.67% | 0.90% |
| sd | 0.19% | 0.98% | 0.65% | 0.20% | 5.97% | 0.91% | 0.83% | 0.33% | |
|
| |||||||||
| η% = 10%, β = 1.5 | |||||||||
|
| |||||||||
| MSE() | mean | 0.004 | 0.019 | 0.328 | 0.409 | 0.090 | 0.877 | 0.877 | 0.008 |
| sd | 0.002 | 0.013 | 0.082 | 0.225 | 0.267 | 0.893 | 0.879 | 0.004 | |
| MSE() | mean | 0.003 | 0.014 | 0.151 | 0.244 | 0.288 | 0.666 | 0.646 | 0.350 |
| sd | 0.001 | 0.009 | 0.034 | 0.016 | 0.015 | 0.680 | 0.693 | 0.018 | |
| FNR | mean | 2.50% | 2.03% | 2.57% | 60.78% | 0.83% | 9.40% | 8.74% | 13.77% |
| sd | 1.06% | 0.83% | 1.14% | 9.70% | 0.49% | 15.94% | 15.03% | 3.54% | |
| FPR | mean | 0.41% | 2.94% | 10.61% | 0.29% | 10.11% | 12.80% | 12.89% | 0.74% |
| sd | 0.17% | 1.08% | 0.81% | 0.28% | 12.90% | 1.83% | 1.72% | 0.32% | |
|
| |||||||||
| η% = 15%, β = 1.5 | |||||||||
|
| |||||||||
| MSE() | mean | 0.004 | 0.015 | 0.525 | 0.527 | 0.113 | 1.374 | 1.049 | 0.017 |
| sd | 0.003 | 0.010 | 0.094 | 0.245 | 0.319 | 0.990 | 0.861 | 0.096 | |
| MSE() | mean | 0.003 | 0.011 | 0.186 | 0.240 | 0.278 | 0.685 | 0.673 | 0.375 |
| sd | 0.001 | 0.006 | 0.033 | 0.016 | 0.015 | 0.556 | 0.512 | 0.017 | |
| FNR | mean | 2.55% | 2.26% | 2.88% | 55.10% | 0.87% | 21.43% | 10.96% | 19.21% |
| sd | 1.16% | 0.95% | 1.39% | 9.24% | 0.47% | 19.18% | 16.07% | 4.07% | |
| FPR | mean | 0.43% | 2.77% | 15.53% | 0.58% | 11.72% | 30.32% | 15.34% | 1.31% |
| sd | 0.18% | 1.00% | 0.86% | 0.59% | 16.15% | 30.56% | 2.70% | 6.85% | |
5 Identification of Proteins of Concordant Intensity Change
We apply the proposed multivariate mixture modeling approach to analyze the protein LIR data, for identifying proteins with concordantly changed intensity levels in all four disease-producing strains when compared to the non-disease-producing strain. Both the group ℓ0 penalty and the group ℓ1 penalty are considered, and the AICc criterion in (17) is used for tuning parameter selection and model comparison. The AICc values of the best selected models from using ℓ0, ℓ1 penalties and the model without mean-shift penalization are 62528.81, 64911.00, and 68431.83, respectively. As expected, the mean-shift penalization indeed significantly improves model estimation, and the ℓ0 penalization performs better than the ℓ1. These results agree with our simulation studies. Therefore, the final model is chosen as the fitted model using the group ℓ0 penalty.
The parameter estimation of the mixture model supports the presumed model structures and reveals some interesting general behaviors of the proteins. The estimated mixing proportions are = (0.859, 0.080, 0.061), confirming that only a small fraction of proteins experienced concordant intensity changes across all four disease-producing strains. The estimated replication effects are = (−0.302, −0.336, −0.342), which are small and similar to each other. We have fixed 1 = in model estimation (see A1), and the other two mean vectors are estimated as
This shows that the degree of intensity change varies across the four strains: for all the proteins belonging to component 2, the mean folds of decrease in their intensity levels (on the original scale) are 3.177, 7.402, 6.448 and 15.461 in the four disease-producing strains compared to the non-disease-producing strain; while for component 3, the mean folds of increase in their intensity levels (on the original scale) are 6.921, 9.189, 3.460, and 10.256. The estimated component variances are (0.287, 0.527, 0.523, 0.524), (2.103, 0.638, 2.255, 2.319), and (1.731, 1.526, 2.176, 2.606), and the estimated correlation matrix for component 1 is
The variabilities in component 1 are smaller than those in the two tail components, which is partly due to the adjustments of the extreme observations in component 1 from the sparse mean shift. Disease-producing Strains 2, 3 and 4 are moderately or highly correlated, but they appear to be less correlated with disease-producing Strain 1. This finding corresponds well with the fact that disease-producing Strains 2, 3 and 4 all have high disease-producing capabilities while that of strain 1 is moderate (Smyth and Martin, 2010). The results confirm that multivariate clustering can be beneficial here as it allows information-borrowing from correlated data.
We now turn our attention to the estimated cluster patterns. Here we mostly focus on the proteins with concordantly increased intensity levels. Figure 3 shows the cluster patterns obtained from our tailored penalized normal mixture model approach either with mean-shift (PenN-Mix) or without mean-shift (PenN-Mix(0)), a multivariate t mixture model (T-Mix) with either three components or with the number of components selected by AIC, and a trimmed multivariate normal mixture model (TrimN-Mix) with 10% trimming. (The 3D plots can only show patterns of three strains at a time; the other views deliver a similar message and hence are omitted.) There are 377 proteins identified by PenN-Mix(0). In contrast, using the proposed PenN-Mix method, only 151 proteins are identified. Two proteins are selected by PenN-Mix only, namely, H1CPH9 and Q0TP85. Among the 228 proteins identified by PenN-Mix(0) but not by PenN-Mix, most have both positive and negative LIR values. For example, 199 proteins have more than 25% negative values in their LIR values, indicating they are most likely not proteins of concordant change and hence are false positives. Among the 228 proteins, only 12 proteins have all positive observed LIR values; however, their mean (median) fold-change is 3.23 (3.24), which is significantly lower than that of the 151 proteins identified by PenN-Mix which is 9.02 (5.92). Therefore, while these 12 proteins could be proteins of concordant change, they are indeed boundary cases. Since both T-Mix and TrimN-Mix do not allow replicated data or parameter constraints, we use the averaged LIR values over three replications to fit an unconstrained model, and the mean vectors and the cluster patterns estimated from our proposed PenN-Mix method are used for initialization. It is clear from Figure 3 that the cluster pattern obtained from neither T-Mix nor TrimN-Mix fulfills our needs in this problem.
Figure 3.
Protein relative intensity analysis: 3D plots of protein clustering patterns by various methods. The top panels show the results from the proposed multivariate normal mixture models with sparse mean shift (left) and without sparse mean shift (right). The bottom panels show the results from a multivariate t mixture model with predetermined three components (left) and with AIC-selected five components (middle), and a three-component trimmed normal mixture model (right). The strains 2, 3, and 4 are used in constructing the 3D plots. The gray line in each panel indicates the 45° direction of concordant change. The proteins clustered to different components are indicated by their labels in number. In the top-left panel, the “*” labels represent the proteins with non-zero sparse mean shift in the proposed method, while in the bottom-right panel, the “*” labels represent the proteins that are trimmed in the trimmed normal mixture model.
In Figure 4, a heatmap of the LIR values of some randomly selected proteins is shown, categorized based on the clustering results from our PenN-Mix approach. A strong contrast between the proteins in components 2 and 3 is apparent. The proteins of no intensity change have very light color in general as their LIR values are relatively close to the origin. For the proteins with non-zero mean shift, the positive and negative values appear to be mixed and there exist some large entries, which indicate discordant intensity changes.
Figure 4.
Protein relative intensity analysis: heatmap of LIR values of proteins, categorized based on the proposed multivariate normal mixture model with sparse mean shift. The four blocks from left to right show proteins of no intensity change, proteins exhibiting mean shift from the reference, proteins of concordantly decreased intensity levels and proteins of concordantly increased intensity levels; 100 randomly selected proteins are shown in each category. The horizontal lines correspond to strains 1 to 4, from top to bottom; in each horizontal block, the LIR vectors from the three replications for each protein are shown as a single long column vector, with the missing replications indicated by the gray color. All the original LIR values are divided by their maximum absolute value, so that the scaled values are between −1 and 1.
In practice, the proteins identified from the mixture model analysis can be further ranked and selected, to facilitate subsequent biological investigations. The inference can be made based on the estimated conditional probabilities. Alternatively, since the effect size, i.e., the fold of change of protein intensity, is more informative to biologists, here we rank the identified proteins by PenN-Mix based on the overall fold-change in the intensity level comparing the four disease-producing strains to the non-disease-producing strain, computed as Σs∈𝒮i(yi,j,s − αs)/|𝒮i|, for j = 1, …, 4, where αss are the estimated replication effects by PenN-Mix. Table 2 reports 15 proteins with the largest fold-changes with either concordantly increased or decreased intensities compared to the non-disease-producing strain. Their estimated conditional probabilities of belonging to the corresponding component of concordant change are all very close to 1, and hence are omitted. The present work has identified proteins that serve as potential targets for further investigation for their role in the virulence of C. perfringens with respect to NE in poultry. To our knowledge, most of these proteins have not been previously considered as potential virulence factors in NE in poultry. For example, three of the listed proteins with larger fold change (Table 2) have functions related to adhesion/binding (H1CNB6, B1RBR0, G5DS6). These adhesion/binding proteins are implicated in the virulence of other bacteria like Streptococcus, Clostridium difficile (Wren, 1991; Banas et al., 1990; Ferretti et al., 1987) and Staphylococcus aureus (Deivanayagam et al., 2000). The adhesion/binding mechanism can be crucial for virulence of C. perfringens with respect to NE in poultry. In this context, it is interesting that there is independent evidence that ability of C. perfringens to bind was associated with virulence with respect to NE (Martin and Smyth, 2010). In vitro and in vivo studies (e.g., gene knockout studies) would then be required to establish these proteins as virulence factors. Once virulence factors are definitively identified, these can become targets for development of control strategies based on vaccination.
Table 2.
Protein relative intensity analysis: identified proteins of concordantly increased or decreased intensities from the multivariate normal mixture model with sparse mean shift. Columns 1–2 give gene names and UniProt IDs. Columns 3–6 report the average log-intensity ratio for each protein in each disease-producing strain adjusted and averaged over replications, computed as Σs∈𝒮i (yi,j,s − αs)/|𝒮i|, for j = 1, …, 4, where αss are the estimated replication effects by the PenN-Mix method. Column 7 reports the overall average log-intensity ratio for each protein averaged over replications and strains (and its s.d. in Column 8), computed as {Σj Σs∈𝒮i(yi,j,s − αs)}/(m|𝒮i|). Column 9 reports the overall fold-change in the original scale of the intensity, computed as the exponential of the values in Column 7. 15 proteins with the largest overall fold-changes are shown for groups of proteins of either concordantly increased or decreased intensities.
| Gene Name | UniProt ID | Strain 1 | Strain 2 | Strain 3 | Strain 4 | Overall | ||
|---|---|---|---|---|---|---|---|---|
| mean | mean | mean | mean | mean | s.d. | fold | ||
| HMPREF9476_02666 | H1CVU5 | 5.06 | 3.74 | 4.81 | 5.34 | 4.74 | 0.70 | 113.90 |
| HMPREF9476_00037 | H1CNB6 | 3.45 | 1.90 | 5.44 | 4.74 | 3.88 | 1.53 | 48.47 |
| AC1_0754 | B1RBR0 | 2.54 | 1.39 | 5.54 | 5.18 | 3.66 | 2.00 | 38.88 |
| cna pBeta2_00013 | G5DS62 | 5.26 | 3.42 | 2.64 | 3.11 | 3.61 | 1.27 | 36.81 |
| HMPREF9476_00044 | H1CNC3 | 1.60 | 1.66 | 5.44 | 5.23 | 3.48 | 2.14 | 32.46 |
| CPF_1489 | Q0TR08 | 3.40 | 3.45 | 3.04 | 3.15 | 3.26 | 0.48 | 26.09 |
| CPF_2364 | Q0TNK3 | 3.77 | 2.61 | 4.61 | 1.86 | 3.21 | 1.71 | 24.79 |
| HMPREF9476_00886 | H1CQR5 | 0.71 | 3.96 | 4.51 | 3.62 | 3.20 | 1.63 | 24.48 |
| AC1_0760 | B1RBV1 | 3.18 | 0.44 | 4.63 | 4.07 | 3.08 | 1.71 | 21.73 |
| HMPREF9476_00045 | H1CNC4 | 4.17 | 1.50 | 1.30 | 5.21 | 3.04 | 1.90 | 20.98 |
| AC7_1062 | B1RMG8 | 0.30 | 4.87 | 0.94 | 6.06 | 3.04 | 2.65 | 20.95 |
| HMPREF9476_02600 | H1CVM9 | 4.51 | 3.12 | 1.00 | 3.24 | 2.97 | 1.39 | 19.40 |
| AC1_2779 | B1RA50 | 3.39 | 3.25 | 1.15 | 4.06 | 2.96 | 1.17 | 19.33 |
| CJD_1615 | B1V549 | −0.24 | 2.92 | 2.18 | 6.74 | 2.90 | 2.89 | 18.14 |
| mtnN AC7_1301 | B1RRI1 | 3.59 | 1.82 | 1.99 | 3.92 | 2.83 | 1.07 | 16.91 |
|
| ||||||||
| Gene name | UniProt ID | Strain 1 | Strain 2 | Strain 3 | Strain 4 | Overall | ||
| mean | mean | mean | mean | mean | s.d. | fold | ||
|
| ||||||||
| CPC_2588 | B1BMX8 | −4.34 | −3.46 | −6.52 | −7.36 | −5.42 | 1.70 | 225.27 |
| AC1_A0288 | B1R9D0 | −4.22 | −1.97 | −6.80 | −8.05 | −5.26 | 2.71 | 192.12 |
| AC5_0521 | B1RDD4 | −4.24 | −3.01 | −5.94 | −5.96 | −4.79 | 1.38 | 120.16 |
| pyrF AC1_1576 | B1R5U6 | −4.34 | −3.01 | −4.20 | −6.06 | −4.40 | 1.26 | 81.73 |
| comEC | Q4ZFT0 | −4.22 | −2.72 | −5.40 | −5.21 | −4.39 | 1.28 | 80.39 |
| HMPREF9476_02060 | H1CUB1 | −3.73 | −2.29 | −4.94 | −6.59 | −4.38 | 1.83 | 80.20 |
| pmbA | Q8XNJ3 | −3.98 | −2.83 | −5.69 | −5.03 | −4.38 | 1.25 | 79.95 |
| AC3_2280 | B1BW51 | −3.74 | −2.50 | −5.53 | −5.66 | −4.36 | 1.52 | 78.02 |
| CJD_1905 | B1V1K7 | −4.48 | −3.25 | −1.85 | −5.64 | −3.80 | 1.54 | 44.89 |
| HA1_04034 | H7CTN9 | −3.77 | −2.83 | −4.36 | −3.86 | −3.71 | 0.64 | 40.68 |
| prtP CPR_0968 | Q0SUB6 | −3.29 | −2.41 | −4.20 | −4.88 | −3.70 | 2.06 | 40.31 |
| AC3_1208 | B1BQF1 | −3.83 | −1.95 | −4.98 | −3.86 | −3.65 | 1.26 | 38.60 |
| HMPREF9476_01453 | H1CSD2 | −3.02 | −2.83 | −3.86 | −4.45 | −3.54 | 0.94 | 34.38 |
| murF AC5_A0061 | B1RKF8 | −3.48 | −2.82 | −3.74 | −3.88 | −3.48 | 0.50 | 32.43 |
| CPC_0476 | B1BGT0 | −2.83 | −2.68 | −4.32 | −3.95 | −3.45 | 1.13 | 31.33 |
6 Discussion
Typically, detecting differential proteins involves a two-group comparison, in which one tests for a treatment effect, when compared to a control group. However, in this paper we discussed a different setting, where the “treatment group” consists of four different strains which are expected to be correlated. To the best of our knowledge, no other method tackles this setting directly. Urfer et al. (2006) described a simple approach which relies on testing one protein at a time (after a suitable normalization step), and then correcting for multiple testing. Their approach does not induce information-borrowing across proteins, nor does it account for the correlation between the disease-producing strains. To apply it to the setting discussed in this paper, one would have to compare each strain to the control group, and then find which proteins are differential in all 4 comparisons. Our simulations show that this approach lacks power. A more general approach appears in Oberg et al. (2008). In that paper the authors used an ANOVA model which allows for multiple treatment groups and controls for peptide effects. However, the treatments are assumed to be independent.
Our approach is related to Booth et al. (2011) which also uses a mixture model approach. Their model is applied to the spectral counts, and it uses a Bayesian modeling and inference approach. Thus, like our method, the Booth et al. (2011) method does not require an adjustment for multiple testing, and it induces information-borrowing across proteins, which leads to greater statistical power. However, it is only applicable to the two-group scenario, and in order to apply it to the situation described here one would have to either perform four two-group comparisons and combine the results, or combine the four disease-inducing strains into one group.
Our multivariate mixture model approach allows us to borrow information across proteins and strains and account for the correlation between treatment groups. This results in increased power and accuracy. More importantly, we are able to eliminate nuisance proteins whose relative abundance levels are only different from the control for a subset of the strains. Thus, our method is optimized for situations in which the objective is to detect proteins of concordant change, which is motivated by the observed data, which we believe is relevant to a broader set of experiments, and can be adapted to detect different patterns of change between the groups.
There are many directions for future research. Depending on the needs in applications, we can extend the model to other mixture distributions, such as multivariate t mixture with a mean shift. We will consider how to extend the method to handle a larger number of strains. In principle, assumptions A0–A3 and the penalization form make the model parsimonious, and the estimation is not expected to be more difficult except that additional regularization may be desirable in estimating the association structure among the strains. Another useful extension is to implement a “repeated measures” model, for the analysis of time dependent protein expression. In our approach, the likelihood function is built only from the observed intensity profiles, so a protein with missing replications naturally contributes less to the likelihood and in turn there is more uncertainty associated with the inference of its cluster membership. When more information regarding the mechanism of missing is available, explicit modeling of the missingness could be more beneficial. Finally, the proposed mixture model with sparse mean shift approach can be extended to the robust mixture regression setup, in order to incorporate additional knowledge available on the proteins, e.g., certain covariate information, grouping structure or association pattern among the proteins.
Supplementary Material
Acknowledgments
Kun Chen’s research was partially supported by the National Science Foundation grant DMS-1613295 and the National Institutes of Health (NIH) grant #U01HL114494. Haim Bar’s research was partially supported by the National Science Foundation grant DMS-1612625. M.-H. Chen’s research was partially supported by NIH grants #GM70335 and #P01CA142538. The authors gratefully acknowledge funding from the U. S. Poultry & Egg Association which enabled the generation of the proteomics data used in this study (Project #F052). The authors are grateful to the Editor, the Associate Editor, and the three referees for their valuable comments and suggestions, which have led to significant improvement of the paper.
Footnotes
The online Supplementary Materials include an R implementation of the proposed approach, the log-intensity dataset, and a document showing the use of other constraint types and penalty forms, the connection to trimmed likelihood clustering, the remedy of unbounded likelihood problem, more details on data generation, and additional simulation results.
References
- Allison DB, Gadbury GL, Heo M, Fernndez JR, Lee CK, Prolla TA, Weindruch R. A mixture model approach for the analysis of microarray gene expression data. Computational Statistics & Data Analysis. 2002;39:1–20. [Google Scholar]
- Andrews JL, McNicholas PD, Subedi S. Model-based classification via mixtures of multivariate -distributions. Computational Statistics & Data Analysis. 2011;55:520–529. [Google Scholar]
- Bai X, Chen K, Yao W. Mixture of linear mixed models using multivariate t distribution. Journal of Statistical Computation and Simulation. 2016;86:771–787. [Google Scholar]
- Banas JA, Russell RR, Ferretti JJ. Sequence analysis of the gene for the glucan-binding protein of Streptococcus mutans Ingbritt. Infection and Immunity. 1990;58:667–673. doi: 10.1128/iai.58.3.667-673.1990. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bar H, Booth J, Wells M. A mixture-model approach for parallel testing for unequal variances. Statistical Applications in Genetics and Molecular Biology. 2012;11:1–21. doi: 10.2202/1544-6115.1762. [DOI] [PubMed] [Google Scholar]
- Bar HY, Lillard DR. Accounting for heaping in retrospectively reported event data –a mixture-model approach. Statistics in Medicine. 2012;31:3347–3365. doi: 10.1002/sim.5419. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Booth JG, Eilertson KE, Olinares PDB, Yu H. A bayesian mixture model for comparative spectral count data in shotgun proteomics. Molecular & Cellular Proteomics. 2011;10:M110–007203. doi: 10.1074/mcp.M110.007203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Boyle JP, Dykstra RL. Advances in Order Restricted Statistical Inference: Proceedings of the Symposium on Order Restricted Statistical Inference; Iowa City, Iowa. September 11–13, 1985; New York: Springer; 1986. pp. 28–47. chap. A Method for Finding Projections onto the Intersection of Convex Sets in Hilbert Spaces. [Google Scholar]
- Bühlmann P, van de Geer S. Statistics for High-Dimensional Data. New York: Springer; 2009. [Google Scholar]
- Cavanaugh JE. Unifying the derivations for the Akaike and corrected Akaike information criteria. Statistics & Probability Letters. 1997;33:201–208. [Google Scholar]
- Celeux G, Martin O, Lavergne C. Mixture of linear mixed models for clustering gene expression profiles from repeated microarray experiments. Statistical Modelling. 2005;5:243–267. [Google Scholar]
- Chen J, Li P. Hypothesis test for normal mixture models: The EM approach. The Annals of Statistics. 2009;37:2523–2542. [Google Scholar]
- Deivanayagam CC, Rich RL, Carson M, Owens RT, Danthuluri S, Bice T, Höök M, Narayana SV. Novel fold and assembly of the repetitive B region of the Staphylococcus aureus collagen-binding surface protein. Structure. 2000;8:67–78. doi: 10.1016/s0969-2126(00)00081-2. [DOI] [PubMed] [Google Scholar]
- Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B. 1977;39:1–38. [Google Scholar]
- Everitt B, Hand D. Monographs on applied probability and statistics. New York: Springer; 1981. Finite Mixture Distributions. [Google Scholar]
- Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
- Fan J, Lv J. Nonconcave penalized likelihood with NP-dimensionality. IEEE Transactions on Information Theory. 2011;57:5467–5484. doi: 10.1109/TIT.2011.2158486. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ferretti JJ, Gilpin ML, Russell RR. Nucleotide sequence of a glucosyltransferase gene from Streptococcus sobrinus MFe28. Journal of Bacteriology. 1987;169:4271–4278. doi: 10.1128/jb.169.9.4271-4278.1987. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Flynn CJ, Hurvich CM, Simonoff JS. Efficiency for regularization parameter selection in penalized likelihood estimation of misspecified models. Journal of the American Statistical Association. 2013;108:1031–1043. [Google Scholar]
- Fraley C, Raftery AE. How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal. 1998;41:578–588. [Google Scholar]
- Fritz H, García-Escudero LA, Mayo-Iscar A. A fast algorithm for robust constrained clustering. Computational Statistics & Data Analysis. 2013;61:124–136. [Google Scholar]
- Garcíia-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A. A general trimming approach to robust cluster analysis. The Annals of Statistics. 2008;36:1324–1345. [Google Scholar]
- Hathaway RJ. A constrained formulation of maximum-likelihood estimation for normal mixture distributions. The Annals of Statistics. 1985;13:795–800. [Google Scholar]
- Huang J, Ma S, Zhang CH. Adaptive lasso for high-dimensional regression models. Statistica Sinica. 2008;18:1603–1618. [Google Scholar]
- Keyburn AL, Boyce JD, Vaz P, Bannam TL, Ford ME, Parker D, Di Rubbo A, Rood JI, Moore RJ. NetB, a new toxin that is associated with Avian Necrotic Enteritis caused by Clostridium perfringens. PLoS Pathogens. 2008;4:e26. doi: 10.1371/journal.ppat.0040026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee S, McLachlan G. On mixtures of skew normal and skew t-distributions. Advances in Data Analysis and Classification. 2013;7:241–266. [Google Scholar]
- Lee Y, MacEachern SN, Jung Y. Regularization of case-specific parameters for robustness and efficiency. Statistical Science. 2012;27:350–372. [Google Scholar]
- Lindsay BG, Lesperance ML. A review of semiparametric mixture models. Journal of Statistical Planning and Inference. 1995;47:29–39. [Google Scholar]
- Lovland A, Kaldhusdal M. Severely impaired production performance in broiler flocks with high incidence of Clostridium perfringens -associated hepatitis. Avian Pathology. 2001;30:73–81. doi: 10.1080/03079450020023230. [DOI] [PubMed] [Google Scholar]
- Markatou M. Mixture models, robustness, and the weighted likelihood methodology. Biometrics. 2000;56:483–486. doi: 10.1111/j.0006-341x.2000.00483.x. [DOI] [PubMed] [Google Scholar]
- Martin TG, Smyth JA. The ability of disease and non-disease producing strains of Clostridium perfringens from chickens to adhere to extracellular matrix molecules and Caco-2 cells. Anaerobe. 2010;16:533–539. doi: 10.1016/j.anaerobe.2010.07.003. [DOI] [PubMed] [Google Scholar]
- McDevitt R, Brooker J, Acamovic T, Sparks N. Necrotic enteritis; a continuing challenge for the poultry industry. World’s Poultry Science Journal. 2006;62:221–247. [Google Scholar]
- McLachlan GJ, Peel D. Finite mixture models. New York: Wiley; 2000. [Google Scholar]
- Meng XL, Rubin DB. Maximum likelihood estimation via the ecm algorithm: A general framework. Biometrika. 1993;80:267–278. [Google Scholar]
- Neykov N, Filzmoser P, Dimova R, Neytchev P. Robust fitting of mixtures using the trimmed likelihood estimator. Computational Statistics & Data Analysis. 2007;52:299– 308. [Google Scholar]
- Ng SK, McLachlan GJ, Wang K, Ben-Tovim Jones L, Ng SW. A mixture model with random-effects components for clustering correlated gene-expression profiles. Bioinformatics. 2006:1745–1752. doi: 10.1093/bioinformatics/btl165. [DOI] [PubMed] [Google Scholar]
- Oberg AL, Mahoney DW, Eckel-Passow JE, Malone CJ, Wolfinger RD, Hill EG, Cooper LT, Onuma OK, Spiro C, Therneau TM, et al. Statistical analysis of relative labeled mass spectrometry data from complex samples using anova. Journal of Proteome Research. 2008;7:225–233. doi: 10.1021/pr700734f. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Peel D, McLachlan G. Robust mixture modelling using the t distribution. Statistics and Computing. 2000;10:339–348. [Google Scholar]
- She Y. An iterative algorithm for fitting nonconvex penalized generalized linear models with grouped predictors. Computational Statistics & Data Analysis. 2012;56:2976–2990. [Google Scholar]
- She Y, Chen K. Robust reduced-rank regression. Biometrika. 2017 doi: 10.1093/biomet/asx032. In press. arXiv:1509.03938. [DOI] [PMC free article] [PubMed] [Google Scholar]
- She Y, Owen AB. Outlier detection using nonconvex penalized regression. Journal of the American Statistical Association. 2011;106:626–639. [Google Scholar]
- Shimizu T, Ohtani K, Hirakawa H, Ohshima K, Yamashita A, Shiba T. Complete genome sequence of Clostridium perfringens, an anaerobic flesh-eater. Proceedings of the National Academy of Sciences. 2002;99:996–1001. doi: 10.1073/pnas.022493799. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Smyth JA, Martin TG. Disease producing capability of netB positive isolates of C. perfringens recovered from normal chickens and a cow, and netB positive and negative isolates from chickens with necrotic enteritis. Veterinary Microbiology. 2010;146:76–84. doi: 10.1016/j.vetmic.2010.04.022. [DOI] [PubMed] [Google Scholar]
- Song W, Yao W, Xing Y. Robust mixture regression model fitting by Laplace distribution. Computional Statistics and Data Analysis. 2014;71:128–137. [Google Scholar]
- Tibshirani RJ. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B. 1996;58:267–288. [Google Scholar]
- Urfer W, Grzegorczyk M, Jung K. Statistics for proteomics: a review of tools for analyzing experimental data. Proteomics. 2006;6:48–55. doi: 10.1002/pmic.200600554. [DOI] [PubMed] [Google Scholar]
- Wijewanta EA, Seneviratna P. Bacteriological studies of fatal Clostridium perfringens infection in chickens. Avian Diseases. 1971;15:654–661. [PubMed] [Google Scholar]
- Wren BW. A family of clostridial and streptococcal ligand-binding proteins with conserved C-terminal repeat sequences. Molecular Microbiology. 1991;5:797–803. doi: 10.1111/j.1365-2958.1991.tb00752.x. [DOI] [PubMed] [Google Scholar]
- Yu C, Chen K, Yao W. Outlier detection and robust mixture modeling using nonconvex penalized likelihood. Journal of Statistical Planning and Inference. 2015;164:27–38. [Google Scholar]
- Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B. 2006;68:49–67. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.




