Skip to main content
Springer logoLink to Springer
. 2016 Jun 21;81:611–624. doi: 10.1007/s11336-016-9503-3

Biclustering Models for Two-Mode Ordinal Data

Eleni Matechou 1,, Ivy Liu 2, Daniel Fernández 2, Miguel Farias 3, Bergljot Gjelsvik 4,5
PMCID: PMC4978779  PMID: 27329648

Abstract

The work in this paper introduces finite mixture models that can be used to simultaneously cluster the rows and columns of two-mode ordinal categorical response data, such as those resulting from Likert scale responses. We use the popular proportional odds parameterisation and propose models which provide insights into major patterns in the data. Model-fitting is performed using the EM algorithm, and a fuzzy allocation of rows and columns to corresponding clusters is obtained. The clustering ability of the models is evaluated in a simulation study and demonstrated using two real data sets.

Electronic supplementary material

The online version of this article (doi:10.1007/s11336-016-9503-3) contains supplementary material, which is available to authorized users.

Keywords: EM algorithm, fuzzy clustering, Likert scale, proportional odds

Introduction

Measurement data with ordinal categories occur frequently and in many fields of application. For example in medicine, a continuous clinical response is often categorised into ordered subtypes based on histological or morphological terms. In a questionnaire, Likert scale responses might be “better”, “unchanged” or “worse”. When analysing such data, it is of interest to link the ordinal responses to a set of explanatory variables.

Despite being introduced more than 3 decades ago, the proportional odds model (PO, McCullagh, 1980) is still frequently employed in analysing ordinal response data in, for example, agriculture (Lanfranchi, Giannetto, & Zirilli, 2014), medicine (Skolnick et al., 2014; Tefera & Sharma, 2015) and socioeconomic studies (Pechey, Monsivais, Ng, & Marteau, 2015).

One motivation for the PO model assumes that the ordinal response has an underlying continuous variable (Anderson & Philips, 1981), called a latent variable, that follows a logistic distribution. The extensive use of the PO model is due to its parsimony for modelling the effect of covariates on the response, compared to other similar models such as the baseline-category logit model, thanks to the use of the proportional odds property (Agresti, 2010, Sect. 3.3.1). Additionally, the model parameters are invariant to the way the categories for the ordinal response are formed (Agresti, 2010, Sect. 3.3.3).

In the analysis of two-mode data matrices, with the modes being for example subjects and questions and with all of the elements being ordered categorical responses, one might be interested in modelling the effect of both the rows and columns on the response. An example of such data is an n by p matrix that summarises the responses of n individuals to p questions, each with q possible (ordered) responses. In this case, the PO model can be fitted to identify, for example, individuals and questions that tend to be linked with higher values of the ordinal response.

However, the number of parameters in the PO model increases as the number of rows or columns in the data set increases. As a result, interpretation becomes problematic for large data sets. Identifying patterns related to the heterogeneity of the data, for example clusters of rows or columns that have similar effect on the response, is challenging. Therefore, the formulation of model approaches taking into account the row and column cluster structure of the data is needed.

The work in this paper has been motivated by this need to model potential heterogeneity among the, assumed independent, ordinal responses in two-mode data by identifying row and/or column clusters. As well as a single-mode clustering, our proposed model provides a two-mode clustering, or biclustering, for fuzzy allocation of the rows and/or columns to corresponding clusters. This way, the number of parameters can be reduced considerably as rows and/or columns are clustered in corresponding homogeneous groups assumed to have the same effect on the response. The results provide insights into major patterns in the data, and row/column clusters can be compared and ranked according to their effect on the ordinal response.

A number of model-based or distance-minimising biclustering methods exist that allocate, probabilistically or not, the rows and columns of a data set containing continuous, binary or count data to corresponding clusters. Examples include the double k-means method of Vichi (2001) and Rocci and Vichi (2008) which, as the name suggests, resembles the k-means algorithm (Hartigan & Wong, 1979), and the block mixture models of Govaertand and Nadif (2003, 2010). Pledger and Arnold (2014) have recently proposed a group of likelihood-based models fitted using the Expectation–Maximisation algorithm (EM) (Dempster, Laird, & Rubin, 1977) for simultaneous fuzzy clustering of the rows and columns of binary or count data.

The cluster analysis given by Pledger and Arnold (2014) can be considered as a multivariate approach using latent modelling. For both ordered and unordered categorical variables, Desantis, Houseman, Coull, Stemmet-Rachamimiv, and Betensky (2008) proposed a one-mode clustering method based on latent modelling, which has been widely applied in many fields (e.g. Desantis, Andrés Houseman, Coull, Nutt, & Betensky, 2012; Eluru, Bagheri, & Miranda-Moreno, 2012; Molitor, Papathomas, Jerrett, & Richardson, 2010; Scharoun-Lee et al., 2011).

In this paper, we generalise the Pledger and Arnold (2014) work to the case of ordinal categorical response data, specifically using the PO model parameterisation. The proposed model structure is an extension of the one-mode clustering model given by Desantis et al. (2008).

Section 2 describes the model structure. The performance of several model selection criteria in selecting the true number of clusters in the data when our proposed model is used is assessed in Sect. 3.1. The reliability of the clustering resulting from our proposed model is evaluated, using simulation, in Sect. 3.2. Finally, applications to two real data sets are shown in Sects. 4.1 and 4.2 and the resulting clusters are compared to those obtained by double k-means (Vichi, 2001).

Materials and Methods

Background: Proportional Odds Model

Consider the data set as an n×p matrix Y with entry yij the realisation of a categorical distribution with q cells and θij1,,θijq probabilities, k=1qθijk=1,i,j. Let the set of model parameters be denoted by ϕ.

Under the PO model, and in the case where the additive effect of rows and columns on the response is considered

θijk=exp(μk-αi-βj)1+exp(μk-αi-βj),k=1exp(μk-αi-βj)1+exp(μk-αi-βj)-exp(μk-1-αi-βj)1+exp(μk-1-αi-βj),1<k<q1-k=1q-1θijk,k=q 1

or alternatively,

logitP(Yijk)=μk-αi-βj,1k<q+,k=q, 2

where μk is the kth cut-off point, with μ1<μ2<<μq-1, and αi, βj are, respectively, the effect of row i, column j on the response, with α1=β1=0. The total number of model parameters is equal to ν=(q-1)+(n-1)+(p-1).

Biclustering: Simultaneous Clustering of Rows and Columns

Suppose that the rows come from a finite mixture with R components or row clusters while the columns come from a finite mixture with C components or column clusters. Rows that belong to the same row cluster, r, are assumed to have the same effect on the response, modelled using parameter αr. Similarly, columns that belong to the same column cluster c have the same effect on the response modelled by parameter βc. If cell ij belongs to row group r and column group c then, under the PO model and assuming an additive effect of the clusters on the response,

logitP(Yijk)=μk-αr-βcif1k<qand+otherwise. 3

The proportion of rows in row group r is πr and the proportion of columns in column group c is κc, with r=1Rπr=c=1Cκc=1. As the rows and columns in the same row and column cluster, respectively, share the same parameters, αr and βc, respectively, there are now (q-1)+2(R-1)+2(C-1) parameters in the model, where Rn and Cp. Choosing Rn and Cp ensures that the number of independent parameters in this model is lower than the number of parameters in the proportional odds model formulated in expression (2).

However, cluster membership is typically unknown and hence the (incomplete data) likelihood sums over all possible partitions of rows into R clusters and over all possible partitions of columns into C clusters

(ϕ,π,κ|Y)=logc1=1Ccp=1Cκc1κcpr1=1Rrn=1Rπr1πrni=1nj=1pk=1qθricjkI(yij=k), 4

where πri and κcj is the proportion of rows and columns, respectively, that belong to row group r, column group c for the particular partition ij, of rows and columns into R and C clusters, respectively.

Here, following Pledger and Arnold (2014, Sect. 2.2.2), we adopt a finite mixture model which, assuming row-based conditional independence, we can describe using the following (incomplete data) log-likelihood

(ϕ,π,κ|Y)=logc1=1Ccp=1Cκc1κcpi=1nr=1Rπrj=1pk=1qθrcjkI(yij=k), 5

which sums over the possible column cluster partitions only. Equation (5) is obtained from Eq. (4) by taking terms of the i product through the r sums.

The additive model shown in Eq. (3) can be extended to a model which allows for an interaction between the row and column cluster effects, denoted by parameters γ, by modelling the logits of the cumulative probabilities as

logitP(Yijk)=μk-αr-βc-γrcif1k<qand+otherwise, 6

and, assuming constraints rγrc=0c and cγrc=0r, increasing the number of parameters by (R-1)(C-1) compared to the additive case.

The model can also be altered to consider one-mode clustering, and the set of different models that can be fitted are shown in Table 1 with details given in Appendix A. The first two columns in Table 1, labelled as “R” and “C”, denote, respectively, the number of row and column clusters assumed in the model when R=1 and C=1 all rows/columns are homogeneous forming a single row/column cluster, when R=n and C=p all rows/columns are heterogeneous, each forming its own row/column cluster, when R=r and C=c there are r and c homogeneous row/column clusters, respectively. Additionally, models incorporating an interaction term are indicated by the associated parameters γlk with l indexing the row clusters and k the column clusters.

Table 1.

Model set with corresponding number of parameters ν.

R C LogitP(Yijk),1k<q ν
r 1 μk-αr (q-1)+2R-2
r p μk-αr-βj (q-1)+2R+p-3
r p μk-αr-βj-γrj (q-1)+Rp+R-2
1 c μk-βc (q-1)+2C-2
n c μk-αi-βc (q-1)+2C+n-3
n c μk-αi-βc-γic (q-1)+Cn+C-2
r c μk-αr-βc (q-1)+2R+2C-4
r c μk-αr-βc-γrc (q-1)+RC+R+C-3

The following constraints are placed, where appropriate: α1=0,β1=0,kγkl=0,l,lγkl=0,k,r=1Rπr=1,c=1Cκc=1. R=1: a single row cluster, R=r: r row clusters, R=n: each row is in its own cluster. Similarly, C=1: a single column cluster, C=c: c column clusters and C=p: each column is in its own cluster. For example, when R=1, C=c, the rows form one cluster, while the columns form c clusters and the logits of the cumulative probabilities in the PO model for column cluster c and 1k<q are logitP(Yijk)=μk-βc, for all rows. If on the other hand R=n, C=c, the cumulative probabilities for row i, column cluster c are, assuming an interaction between row and column effects and 1k<q, logitP(Yijk)=μk-αi-βc-γic.

We denote by Zir and Xjc the indicator random variables for group membership of row i in row group r and column j in column group c, respectively. We use the EM algorithm (Dempster et al., 1977) by treating cluster membership as the missing data and derive estimates of the posterior probability of allocation of row i to row cluster r and of column j to column cluster c, given respectively by E(Zij)=z^ir and E(Xjc)=x^jc, for i=1,,n, j=1,,p, r=1,,R and c=1,,C with r=1Rz^ir=c=1Cx^jc=1,i,j.

The lack of a posteriori independence of the Zir and Xjc makes the evaluation of the expected value of their product computationally expensive as it requires a sum either over all possible allocations of rows to row groups, or over all possible allocations of columns to column groups. The variational approximation (Govaert & Nadif, 2005) which we employ (see Appendix A.3.1. for details) is a solution to this problem.

We give details of the EM algorithm steps in Appendix A for all models listed in Table 1.

All the computer code is written in R (R Core Team, 2014), and the (complete data) log-likelihood (given in Appendix A) is maximised using the Newton–Raphson algorithm provided as an option in optim to estimate parameters μ1,,μq-1 and the effects of row and column clusters, as well as their interaction, if these exist in the model being fitted. Since the likelihood surface is multimodal, the EM algorithm is started from a number of different points and the iteration with the highest obtained likelihood value is retained (Everitt, Landau, Leese, & Stahl, 2011). The R code to fit the models is available upon request from the first author.

Simulation Studies

We have performed two simulation studies: one to evaluate the performance of 10 model selection criteria in recovering the true number of clusters when our proposed models are used (Sect. 3.1) and one to evaluate the reliability of our proposed models (Sect. 3.2).

Model Selection

Since these are likelihood-based models, likelihood-based model selection criteria, such as AIC (Akaike, 1973), its small-sample modification (AICc, Akaike, 1973; Burnham & Anderson, 2002; Hurvich & Tsai, 1989), BIC (Schwarz, 1978) and its Integrated Classification Likelihood version (ICL-BIC, Biernacki, Celeux, & Govaert, 2000), can be used to select amongst them.

Following Fernández, Arnold, and Pledger (2014), we set up a simulation study to empirically establish a relationship between our likelihood-based models for ordinal data, specifically using the PO model, and the performance of 10 information criteria (Table 2) in recovering the true number of cluster components.

Table 2.

Information criteria summary table.

Criteria Definition Proposed for Depending on
AIC (Akaike, 1973) -2+2ν Regression ν
AICc (Akaike, 1973) AIC+2ν(ν+1)np-ν-1 ν and np
AICu (McQuarrie, Shumway, & Tsai, 1997) AICc+nplognpnp-ν-1
CAIC (Bozdogan, 1987) -2+ν(1+log(np))
BIC (Schwarz, 1978) -2+νlog(np)
AIC3 (Bozdogan, 1994) -2+3ν Clustering ν
CLC (Biernacki & Govaert, 1997) -2+2EN EN
NEC(R) (Biernacki, Celeux, & Govaert, 1999) EN-(1)
ICL-BIC (Biernacki et al., 2000) -2c+νlog(np) ν, np and EN
AWE (Banfield & Raftery, 1993) -2c+2ν32+log(np)

is the maximised incomplete-data log-likelihood (see Eq. 5); (1) is the maximised incomplete-data log-likelihood without clustering structure; and c is the maximised complete-data log-likelihood given in Appendix A. The third column categorises the criteria according to whether they were proposed for model selection in a regression setting or for clustering. The last column indicates whether the penalty depends on the number of parameters, ν, the total sample size which is the number of elements in the response matrix Y, np, and/or the entropy function, EN(·)=-c.

We set n=150, p=15, q=4, R=3 and C=2. We specified five scenarios by varying the row and column mixing proportions: a data set with similar dimensions (n=150 and p=15) to the data analysed in the example in Sect. 4.2 (Scenario 1), balanced row and column mixing proportions (Scenario 2), balanced column mixing proportions but unbalanced row proportions (Scenario 3), unbalanced row and column mixing proportions (Scenario 4) and one of the row mixing proportions close to zero (Scenario 5).

For each scenario, we simulated 100 data sets and noted the selected model using each of the 10 criteria out of models with R=1,2,3,4,5 and C=1,2,3,4,5. For each simulated data set, the EM algorithm was repeated 10 times with random starting points and the best ML estimates (those that led to highest log–likelihood value) were kept.

Figure 1 displays the percentage of cases in which each information criterion correctly recovered the true number of row and column clusters, i.e. the true model that generated the data, averaged across the five scenarios. AIC3 has the best performance (selecting the correct model in 78 % of cases), followed by BIC (75 %), AIC, AICc, AICu and CAIC (73 %).

Fig. 1.

Fig. 1

Simulation study to assess the performance of model selection criteria in recovering the true number of clusters for our proposed biclustering finite mixture PO (POFM) model. Bars depict the percentage of cases when the true model is correctly identified by each criterion, averaged across the five scenarios.

Our results are in accordance with Fonseca and Cardoso (2007) for the categorical case. ICL–BIC is underestimating the number of clusters (selecting a smaller number of clusters in 32 % of cases) and CLC is overestimating the number of clusters in 29 % of cases. A very poor performance is obtained by AWE and NEC (selecting the correct model in 46 and 24 % of cases, respectively).

It is important to highlight that these results are simply evaluating the ability of model selection criteria in selecting the right number of clusters in the mixture, but not necessarily in providing the best clustering structure for the data.

Model Evaluation

In this section, we evaluate the performance of our proposed method in (i) biclustering, varying the cluster sizes and the sample size and (ii) one-dimensional row clustering, compared to that of double k-means (Vichi, 2001) and standard k-means, respectively.

(i) We set R=3, C=2 and q=3 or 5. The cutpoint values are obtained such that the response categories have equal probabilities for the baseline row and column cluster. That is, P(Yij=1)=P(Yij=2)==P(Yij=q) when row i belongs to the first row cluster and column j belongs to the first column cluster. The cutpoint values are {μ1=log(1/2),μ2=log(2)} when q=3, and {μ1=log(1/4),μ2=log(2/3),μ3=log(3/2),μ4=log(4)} when q=5. We consider (α1,α2,α3)=(0,1,2), (β1,β2)=(0,-1) and π1=π2=π3=1/3. We vary n, p, q and (κ1,κ2) as n=(9,30,99), p=(10,20,100), q=(3,5) and (κ1,κ2)=(0.5,0.5), (0.4,0.6), (0.3,0.7), (0.2,0.8). The case with balanced column clusters assumes (κ1,κ2)=(0.5,0.5). For an unbalanced case, the scenarios are from (0.4,0.6) to (0.2,0.8).

The response {Yij} values are generated from a categorical distribution with size 1 and probabilities constrained as in expression (1). We assign the first 1/3 of rows to row cluster 1, the second 1/3 to row cluster 2 and the last 1/3 to row cluster 3. Similarly, the first 1/κ1 of columns are assigned to column cluster 1, and the rest of the columns to column cluster 2. We simulate 100 data sets for each scenario.

Table 3 shows the mean of parameter estimates obtained for α2, α3 and β2 from 100 simulated data sets. We are aware of the bias in the estimated parameters when n or p are small. This is due to the fact that the clusters are not fixed and hence their effect on the response is not fixed either. For example, a group of subjects who belong to a certain cluster in the true model might be allocated into a different cluster for a simulated data set. Or, they might be separated into different clusters. However, when both n and p are large, the means are close to the true parameters, because it is less likely to allocate a large number of subjects to a wrong cluster and, hence, the clusters themselves are more similar to the true clusters.

Table 3.

The average estimate obtained for each parameter over 100 simulations.

n p True (κ1,κ2)
(0.5, 0.5) (0.4, 0.6) (0.3, 0.7) (0.2, 0.8)
q=3 5 q=3 5 3 5 3 5
9 10 α2=1 1.40 1.46 1.43 1.58 1.43 1.56 1.46 1.49
10 α3=2 3.03 1.99 2.30 2.22 2.40 1.95 2.37 1.99
10 -β2=1 1.33 1.02 0.98 0.90 0.76 0.86 0.73 0.71
20 α2=1 1.42 1.38 1.42 1.43 1.41 1.38 1.45 1.40
20 α3=2 1.88 1.91 1.95 1.90 2.07 1.84 2.00 1.92
20 -β2=1 0.95 0.91 1.43 0.84 1.14 0.93 0.71 0.69
100 α2=1 1.31 1.42 1.34 1.43 1.38 1.44 1.37 1.44
100 α3=2 1.88 1.97 1.90 2.00 1.92 1.99 1.92 2.00
100 -β2=1 1.07 0.88 0.93 0.81 1.24 1.02 0.98 0.88
30 10 α2=1 1.41 1.44 1.43 1.37 1.38 1.45 1.40 1.38
10 α3=2 2.47 2.23 2.70 2.30 2.54 2.09 2.90 1.94
10 -β2=1 1.01 0.96 1.07 0.93 0.96 0.92 0.94 0.78
20 α2=1 1.26 1.18 1.15 1.19 1.19 1.22 1.19 1.23
20 α3=2 1.96 1.98 2.02 2.05 2.06 1.96 2.08 2.04
20 -β2=1 0.95 0.96 1.02 1.00 1.02 1.02 0.91 1.00
100 α2=1 1.11 1.30 1.16 1.34 1.16 1.34 1.17 1.32
100 α3=2 1.96 1.98 1.92 1.98 1.93 1.99 1.95 1.99
100 -β2=1 0.97 0.95 0.96 0.95 0.98 0.97 0.97 0.96
99 10 α2=1 1.22 1.24 1.42 1.31 1.22 1.22 1.39 1.19
10 α3=2 2.28 2.16 2.32 2.22 2.33 2.21 2.47 2.16
10 -β2=1 1.00 0.97 1.01 0.99 1.01 1.00 0.96 0.98
20 α2=1 1.05 1.02 1.03 1.03 1.06 1.01 1.06 1.06
20 α3=2 2.04 1.99 2.04 2.04 2.05 1.97 2.06 2.01
20 -β2=1 1.01 0.99 1.00 1.00 0.98 0.99 0.99 0.98
100 α2=1 1.03 1.13 1.04 1.14 1.05 1.19 1.04 1.17
100 α3=2 1.99 1.99 1.99 2.00 1.97 2.00 1.99 1.99
100 -β2=1 0.99 1.00 1.00 0.99 1.00 0.99 1.00 0.99

Regardless of the bias, the overall result shows that for balanced cases with (κ1,κ2)=(0.5,0.5), the estimates of the column effects are closer to the truth than for highly unbalanced cases (κ1,κ2)=(0.2,0.8) when n is small. The unbalanced column clusters do not affect the quality of the row cluster effect estimates. In general, when both n and p increase, the quality of row cluster effect estimates improves. The standard errors are between 0.05 to 0.5 for the cases of p=10. For the other cases, they range from 0.001 to 0.08.

To evaluate the clustering ability of our proposed method, we calculate the average proportion of times that the pairwise grouping is correct (Rand index, Rand, 1971) over 100 simulated data sets. For example, if two rows are in the same cluster for the true model, but the proposed method allocates them to different clusters, then this pair is mis-clustered and vice-versa. We report the average Rand index for all row/column pairs in Table 4 when (κ1,κ2)=(0.5,0.5) and (0.2,0.8) for both our proposed approach and the double k-means algorithm (Vichi, 2001). The two approaches have similar performance which improves as n and p increase and when the column clusters are balanced. For our approach, the largest standard error is 0.03 for the highly unbalanced cases and most standard errors are between 0.001 to 0.01.

Table 4.

The average Rand index for 100 simulated data sets based on our proposed (POFM) and double k-means (dkm) methods.

n p q=3 q=5
(κ1,κ2)= (0.5, 0.5) (0.2, 0.8) (0.5, 0.5) (0.2, 0.8)
Cluster POFM dkm POFM dkm POFM dkm POFM dkm
9 10 Row 0.61 0.75 0.63 0.72 0.65 0.76 0.64 0.74
10 Col. 0.64 0.63 0.60 0.54 0.65 0.59 0.59 0.52
20 Row 0.74 0.78 0.73 0.80 0.75 0.76 0.71 0.78
20 Col. 0.64 0.59 0.60 0.55 0.65 0.60 0.61 0.53
100 Row 0.81 0.99 0.79 0.97 0.77 0.98 0.76 0.97
100 Col. 0.66 0.62 0.62 0.55 0.70 0.64 0.65 0.57
30 10 Row 0.65 0.70 0.66 0.70 0.66 0.70 0.67 0.72
10 Col. 0.75 0.76 0.80 0.60 0.86 0.75 0.73 0.67
20 Row 0.76 0.77 0.78 0.78 0.78 0.79 0.78 0.80
20 Col. 0.90 0.80 0.86 0.65 0.91 0.83 0.86 0.71
100 Row 0.92 0.99 0.91 0.99 0.85 0.99 0.84 0.99
100 Col. 0.91 0.84 0.93 0.74 0.93 0.87 0.94 0.79
99 10 Row 0.68 0.70 0.68 0.71 0.69 0.71 0.69 0.71
10 Col. 0.99 0.96 0.95 0.85 0.99 0.97 0.93 0.88
20 Row 0.78 0.80 0.80 0.81 0.82 0.81 0.81 0.82
20 Col. 0.99 0.99 0.99 0.97 1.00 0.99 0.97 0.98
100 Row 0.98 0.99 0.97 0.99 0.92 0.99 0.91 0.99
100 Col. 0.99 0.99 1.00 0.98 1.00 0.99 1.00 0.99

(ii) We set R=3 and C=1, i.e. logitP(Yijk)=μk-αrif1k<qand+otherwise. The cutpoint values are calculated as in simulation setting (i) above. We vary n and p as n=(9,30,99), p=(10,20,100) and π1=π2=π3=1/3 with (α1,α2,α3)=(0,1,2),(0,2,4),(0,1,4) and q=(3,5,7).

When p is large, there are more data points for each row. When q is large, the ordered categorical response has a finer scale. For the row cluster effects {αr,r=1,2,3}, the last setting (0,1,4) gives an unbalanced effect where the difference between the first two clusters is small, but the first two clusters are quite different from the third cluster.

Table 5 shows the average Rand index for 1000 simulated data sets for each of the scenarios, comparing the proposed method (POFM) with k-means. All standard errors for the index are less than 0.0026. Most of them are around 0.001. POFM performs better than k-means when the cluster effects are balanced. In general, the greater n, p, q or the cluster effects are, the better the performance. The only case when k-means considerably outperforms POFM is when (α1,α2,α3)=(0,1,4) and p is large. For this particular case, POFM fails to distinguish between Clusters 1 and 2, and partitions the individuals into only two clusters, leaving one of the clusters empty. However, the quality of the row clustering is still satisfactory, with the average Rand index greater than 70% in all cases.

Table 5.

The average Rand index based on our proposed (POFM) and double k-means (dkm) methods for 1000 simulated data sets.

n p Method (α2,α3)=(1,2) (α2,α3)=(2,4) (α2,α3)=(1,4)
q=3 5 7 3 5 7 3 5 7
9 10 POFM 0.61 0.63 0.64 0.73 0.78 0.80 0.74 0.75 0.75
k-means 0.68 0.69 0.69 0.70 0.72 0.73 0.72 0.74 0.75
20 POFM 0.70 0.72 0.73 0.79 0.86 0.88 0.77 0.76 0.75
k-means 0.70 0.71 0.72 0.71 0.73 0.74 0.74 0.77 0.78
100 POFM 0.85 0.84 0.83 0.94 0.94 0.86 0.75 0.75 0.75
k-means 0.74 0.77 0.78 0.74 0.77 0.78 0.79 0.88 0.90
30 10 POFM 0.65 0.67 0.68 0.75 0.81 0.84 0.76 0.77 0.77
k-means 0.66 0.67 0.68 0.70 0.72 0.73 0.71 0.74 0.76
20 POFM 0.73 0.76 0.77 0.84 0.93 0.95 0.78 0.78 0.78
k-means 0.70 0.72 0.72 0.72 0.75 0.76 0.75 0.80 0.81
100 POFM 0.94 0.92 0.91 0.95 0.99 0.92 0.77 0.77 0.77
k-means 0.79 0.83 0.86 0.76 0.84 0.87 0.93 0.97 0.98
99 10 POFM 0.67 0.68 0.69 0.76 0.84 0.88 0.76 0.77 0.78
k-means 0.67 0.68 0.68 0.70 0.72 0.73 0.72 0.75 0.76
20 POFM 0.75 0.78 0.80 0.86 0.95 0.97 0.79 0.78 0.78
k-means 0.71 0.73 0.74 0.73 0.77 0.80 0.79 0.85 0.86
100 POFM 0.98 0.97 0.96 0.97 1.00 0.97 0.78 0.78 0.78
k-means 0.88 0.92 0.93 0.82 0.87 0.89 0.99 0.99 0.99

Results: Case-Studies

Religious beliefs

We consider part of the data set from a study first published by Wiech et al. (2008). Twelve individuals, self-classified as religious, replied to 16 questions, shown in Appendix B, all rated on a 6-point Likert scale, (1) “Strongly disagree”, ..., (6) “Strongly agree”. The questions were designed to assess an individual’s beliefs on the level of control that god (first 8 questions) and powerful other individuals (last eight questions) have on their lives.

The biclustering model proposed in Sect. 2 was fitted to the 12 by 16 matrix by considering R,C=2,,4. The model with the greatest support by AIC3 has R=3, C=2 and an interaction between row group effects and column group effects.

The two column clusters separate the questions into the two categories (god and others) almost perfectly. Cluster 1 includes questions {1, 2, 3, 4, 5, 6, 8, 10}, while Cluster 2 includes questions {7, 9, 11, 12, 13, 14, 15, 16}. The three row clusters are {3, 4, 5, 6, 8, 9, 10, 12}, {1, 2, 11} and {7}. Double k-means (Vichi, 2001) gives the same row clusters and similar column clusters {2, 3, 4, 5, 6, 8}, {1, 7, 9, 10, 11, 12, 13, 14, 15, 16}.

The estimated probabilities of replying 3 or above to each of the two question clusters for the three row clusters are shown in Figure 2. All row groups tend to agree more with god-related questions than with questions related to the effect of other powerful people. The estimated probabilities of agreeing with the god-related questions do not vary considerably between the three row clusters. However, that is not the case for the second column group since Row Cluster 1 and particularly Row Cluster 3, which consists of Individual 7 alone, tend to give lower scores than individuals in Row Cluster 2. Note that in addition, Individual 7 strongly agrees with questions in Cluster 1, demonstrating more extreme views than individuals belonging to the other clusters, who tend to be more moderate in their answers.

Fig. 2.

Fig. 2

Estimated probabilities of replying 3 or above to each of the 2 column clusters for all 3 row clusters, as derived by the biclustering model with R=3, C=2.

Attempted Suicides

The data set was collected as part of a study of patients admitted for deliberate self-harm (DSH) at the Acute Medical Departments of three major hospitals in Eastern Norway. We consider the answers of 151 individuals to 13 questions, shown in Appendix C, that were designed to assess the level of depression of the respondent by means of the Beck Depression Inventory-Short Form (BDI-SF) (Furlanetto, Mendlowicz, & Romildo Bueno, 2005). Response options range from 1 to 4, with higher scores indicating higher levels of depression (Beck, Schuyler, & Herman, 1974).

We fitted biclustering models with R=2,,5 and C=2 or 3. The model supported by AIC3 has R=5,C=2 and an additive effect of row and column groups on the response.

The two column clusters are {1, 2, 3, 4, 5, 7, 8, 10, 13} and {6, 9, 11, 12}, with the first cluster receiving higher scores than the second (β^2=-0.99(0.10)), suggesting that the nine questions of Cluster 1 are, possibly, markers of more severe forms of depression. The allocation of individuals to the five row groups is in proportion to 0.211, 0.266, 0.208, 0.030, 0.285. Double k-means (Vichi, 2001) gives the following column clusters: {2, 3, 4, 5, 6, 7, 8} and {1, 9, 10, 11, 12, 13}. For row clusters, we present the proportion of individuals from each of our clusters that are allocated to each of the double k-means clusters in Table 6, where it can be seen that with the exception of Cluster 4, the highest proportions appear in the diagonal of the table.

Table 6.

Percent of individuals from the five POFM clusters, represented in the rows, that are clustered in the corresponding five double k-means (Vichi, 2001) clusters.

POFM cluster Double k-means cluster
1 2 3 4 5
1 100 0 0 0 0
2 26 72 2 0 0
3 0 10 48 23 0
4 0 0 0 0 100
5 0 0 21 30 49

The fourth row cluster, which consists of four individuals, is believed to show the most signs of depression since α^4=1.8(0.32). The first cluster follows with α^1=0 since it is the baseline, followed by Clusters 5 (α^5=-1.14(0.12)), 2 (α^2-2.37(0.13)), and 3 (α^3=-3.79(0.16)). In fact, no one in Cluster 4 contacted someone for help after their attempt, while the corresponding proportions for the other four clusters are all greater than 25%, which demonstrates the greater determination of individuals in Cluster 4 to succeed in their attempt. Of course, the size of Cluster 4 is possibly too small to make meaningful comparisons of this type. However, the proportion of individuals in Clusters 1, 5, 2 and 3 that had at least one episode of DSH within three months after the study is, respectively, equal to 30, 24, 16 and 3.4%. DSH is one of the most robust predictors of subsequent death by suicide (Hawton, Casanas, Comabella, Haw, & Saunders, 2013). The risk of suicide among DSH patients treated at hospital is 30- to 200-fold in the year following an episode compared to individuals with no history of DSH (Cooper et al., 2005; Hawton et al., 2012; Owens, Horrocks, & House, 2002). Our model has successfully ordered the groups in terms of their risk of DSH within three months since the data we considered were collected.

Discussion

Our biclustering models identify homogeneous groups of both rows and columns in two-mode data sets of ordinal responses, reducing the number of parameters needed to adequately describe the data and therefore easing interpretation. They fully account for the ordinal nature of the responses, while, being likelihood-based, give access to tools for selecting between possible models.

We have performed an extensive simulation study to compare the performance of a number of model selection criteria in identifying the correct number of mixture components for models and data such as the ones we considered in our applications, conditional on using the EM algorithm and the variational approximation of Govaert and Nadif (2005). The variational approximation is known to produce local optima, and hence it is recommended to use different random starting values for several runs of the EM algorithm. Recently, Keribin, Brault, Celeux, and Govaert (2014) developed latent block models for categorical data, considering a Bayesian approach, which do not require the aforementioned approximation. The potential to develop such models for the PO parameterization is a matter of future research.

In the two real data applications considered, both including questionnaire-type data designed to gain knowledge about the participants’ personality, feelings and way of thinking, the clusters identified by the model agree with our knowledge of the system and provide useful insight of the characteristics of the participants. Especially in the example of Sect. 4.2, the way the participants were clustered agrees with information collected three months after the study was conducted.

In the analysis presented in Sect. 4.2 we have considered only individuals with complete records, excluding participants with missing data. Missing data are often present in similar studies; and, hence, future work could extend the models to deal with such issues. Fitting the models using a Bayesian approach could provide a way of dealing with the missing data and also of choosing the right number of clusters, as, for example, in van Dijk, van Rosmalen, and Paap (2009) and Wyse and Friel (2012), or of appropriately averaging over models, for example using reversible jump MCMC (Green, 1995).

Substantial developments in specialised methods for ordinal data have recently been made (see Liu & Agresti, 2005, for an overview). For instance, Fernández et al. (2014) have recently developed one- and two-dimensional clustering models for ordinal data having a likelihood-based foundation. They did this by using the assumption of the ordinal stereotype model, which allows the determination of a new spacing of the ordinal categories, as dictated by the data. The models presented in this paper may be extended to other ordinal models such as the adjacent-categories logit models, continuation-ratio logit models, and mean response models (see Agresti, 2012, for details on these models). Similarly, incorporating covariates into the model, when these are available, is straightforward by adjusting the linear predictor accordingly.

We have presented the case when q, i.e. the number of levels, is the same for all variables. However, the models are easily extended to allow for a set of cutpoints to be calculated for each unique value of q observed in the data set.

The area of application of these models is extremely wide and includes market research, where questions of the type “How likely are you to buy this product in the future” have possible responses “Very likely to buy”, “Likely to buy”, “May or may not buy”, etc. Additionally, the models are useful for services, such as websites, that review products, such as books, music albums, hotels. and provide recommendations to the users according to their own past reviews, as they can simultaneously cluster the individuals according to their taste, but also the products according to the reviews they have received from all users.

Future research will develop a graphical method for matrix visualisation, taking the resulting probabilities of allocation for each individual data point into account. The existing graphical methods rely on the use of ad hoc distance metrics and similarity measures which, as we have noted above, do not respect the full ordinal nature of the data.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Acknowledgments

We are grateful to Shirley Pledger and Richard Arnold for the discussions about the Pledger and Arnold (2014) paper and to Maurizio Vichi for sharing his double k-means Matlab code.

References

  1. Agresti A. Analysis of Ordinal Categorical Data. 2. New Jersey: Wiley; 2010. [Google Scholar]
  2. Agresti A. Categorical data analysis. New Jersey: Wiley; 2012. [Google Scholar]
  3. Akaike, H., (1973). Information theory and an extension of the maximum likelihood principle. B. N. Petrov, and F. Caski, (eds.) Proceeding of the Second International Symposium on Information Theory. Akademiai Kiado, Budapest, pp. 267–281.
  4. Anderson JA, Philips PR. Regression, discrimination and measurement models for ordered categorical variables. Applied Statistics. 1981;30:22–31. doi: 10.2307/2346654. [DOI] [Google Scholar]
  5. Banfield JD, Raftery AE. Model-based Gaussian and non-Gaussian clustering. Biometrics. 1993;49:803–821. doi: 10.2307/2532201. [DOI] [Google Scholar]
  6. Beck, A. T., Schuyler, D., & Herman, I. (1974). Development of suicidal intent scales. In A. T. Beck, H. L. Resnik, & D. J. Lettieri (Eds.), The prediction of suicide. : Charles Press.
  7. Biernacki C, Celeux G, Govaert G. An improvement of the NEC criterion for assessing the number of clusters in mixture model. Pattern Recognition Letters. 1999;20:267–272. doi: 10.1016/S0167-8655(98)00144-5. [DOI] [Google Scholar]
  8. Biernacki, C., Celeux, G., Govaert, G. (2000). Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on pattern analysis and machine intelligence 22, No. 7.
  9. Biernacki C, Govaert G. Using the classification likelihood to choose the number of clusters. Computing Science and Statistics. 1997;29:451–457. [Google Scholar]
  10. Bozdogan H. Model selection and Akaike’s information criterion (AIC): The general theory and its analytical extensions. Psycometrika. 1987;52:345–370. doi: 10.1007/BF02294361. [DOI] [Google Scholar]
  11. Bozdogan H. Mixture-model cluster analysis using model selection criteria and a new informational measure of complexity. Proceedings of the First US/Japan Conference on the Frontiers of Statistical Modeling: An Informational Approach. 1994;1:69–113. [Google Scholar]
  12. Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference. : Springer.
  13. Cooper J, Kapur N, Webb R, Lawlor M, Guthrie E, Mackway-Jones K, Appleby L. Suicide after deliberate self-harm: a 4-year cohort study. American Journal of Psychiatry. 2005;162(2):297–303. doi: 10.1176/appi.ajp.162.2.297. [DOI] [PubMed] [Google Scholar]
  14. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B. 1977;39:1–38. [Google Scholar]
  15. Desantis SM, Andrés Houseman E, Coull BA, Nutt CL, Betensky RA. Supervised bayesian latent class models for high-dimensional data. Statistics in medicine. 2012;31:1342–1360. doi: 10.1002/sim.4448. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Desantis SM, Houseman EA, Coull BA, Stemmet-Rachamimiv AS, Betensky RA. A penalized latent class model for ordinal data. Biostatistics. 2008;9:249–262. doi: 10.1093/biostatistics/kxm026. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Eluru N, Bagheri M, Miranda-Moreno LF, Fu L. A latent class modeling approach for identifying vehicle driver injury severity factors at highway-railway crossings. Accident Analysis & Prevention. 2012;47:119–127. doi: 10.1016/j.aap.2012.01.027. [DOI] [PubMed] [Google Scholar]
  18. Everitt, B. S., Landau, S., Leese, M., & Stahl, D. (2011). Cluster analysis. : Wiley.
  19. Fernández, D., Arnold, R., Pledger, S., (2014) Mixture-based clustering for the ordered stereotype model. Computational Statistics and Data Analysis .
  20. Fonseca JRS, Cardoso M. Mixture-model cluster analysis using information theoretical criteria. Intelligent Data Analysis. 2007;11:155–173. [Google Scholar]
  21. Furlanetto LM, Mendlowicz MV, Romildo Bueno J. The validity of the Beck Depression Inventory-Short Form as a screening and diagnostic instrument for moderate and severe depression in medical inpatients. Journal of Affective Disorders. 2005;86:87–91. doi: 10.1016/j.jad.2004.12.011. [DOI] [PubMed] [Google Scholar]
  22. Govaert G, Nadif M. Clustering with block mixture models. Pattern Recognition. 2003;36:463–473. doi: 10.1016/S0031-3203(02)00074-2. [DOI] [Google Scholar]
  23. Govaert, G., & Nadif, M. (2005). An EM algorithm for the block mixture model. Speech and Signal Processing on Pattern Analysis and Machine Intelligence: IEEE Transactions on Acoustics. 27. [DOI] [PubMed]
  24. Govaert G, Nadif M. Latent block model for contingency table. Communications in Statistics - Theory and Methods. 2010;39:416–425. doi: 10.1080/03610920903140197. [DOI] [Google Scholar]
  25. Green PJ. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika. 1995;82:711–732. doi: 10.1093/biomet/82.4.711. [DOI] [Google Scholar]
  26. Hartigan JA, Wong MA. A k-means clustering algorithm. Applied Statistics. 1979;28:100–108. [Google Scholar]
  27. Hawton K, Bergen H, Kapur N, Cooper J, Steeg S, Ness J, Waters K. Repetition of self-harm and suicide following self-harm in children and adolescents: findings from the Multicentre Study of Self-harm in England. Journal of Child Psychology and Psychiatry. 2012;53(12):1212–1219. doi: 10.1111/j.1469-7610.2012.02559.x. [DOI] [PubMed] [Google Scholar]
  28. Hawton K, Casanas I, Comabella C, Haw C, Saunders K. Risk factors for suicide in individuals with depression: A systematic review. Journal of Affective Disorders. 2013;147(1–3):17–28. doi: 10.1016/j.jad.2013.01.004. [DOI] [PubMed] [Google Scholar]
  29. Hurvich CM, Tsai CL. Regression and time series model selection in small samples. Biometrika. 1989;76:297–307. doi: 10.1093/biomet/76.2.297. [DOI] [Google Scholar]
  30. Keribin, C., Brault, V., Celeux, G., Govaert, G., (2014). Estimation and selection for the latent block model on categorical data. Statistics and Computing , 1–16.
  31. Lanfranchi M, Giannetto C, Zirilli A. Analysis of demand determinants of high quality food products through the application of the cumulative proportional odds model. Applied Mathematical Sciences. 2014;8:3297–3305. [Google Scholar]
  32. Liu I, Agresti A. The analysis of ordered categorical data: an overview and a survey of recent developments. Test. 2005;14:1–73. doi: 10.1007/BF02595397. [DOI] [Google Scholar]
  33. McCullagh P. Regression models for ordinal data. Journal of the Royal Statistical Society. Series B. 1980;42:109–142. [Google Scholar]
  34. McQuarrie A, Shumway R, Tsai CL. The model selection criterion AICu. Statistics and Probability Letters. 1997;34:285–292. doi: 10.1016/S0167-7152(96)00192-7. [DOI] [Google Scholar]
  35. Molitor J, Papathomas M, Jerrett M, Richardson S. Bayesian profile regression with an application to the national survey of children’s health. Biostatistics. 2010;11:484–498. doi: 10.1093/biostatistics/kxq013. [DOI] [PubMed] [Google Scholar]
  36. Owens D, Horrocks J, House A. Fatal and non-fatal repetition of self-harm. Systematic review. Br J Psychiatry. 2002;181:193–199. doi: 10.1192/bjp.181.3.193. [DOI] [PubMed] [Google Scholar]
  37. Pechey R, Monsivais P, Ng YL, Marteau TM. Why don’t poor men eat fruit? Socioeconomic differences in motivations for fruit consumption. Appetite. 2015;84:271–279. doi: 10.1016/j.appet.2014.10.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Pledger S, Arnold R. Multivariate methods using mixtures: Correspondence analysis, scaling and pattern-detection. Computational Statistics & Data Analysis. 2014;71:241–261. doi: 10.1016/j.csda.2013.05.013. [DOI] [Google Scholar]
  39. R Core Team, 2014. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria. http://www.R-project.org/.
  40. Rand WM. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association. 1971;66:846–850. doi: 10.1080/01621459.1971.10482356. [DOI] [Google Scholar]
  41. Rocci R, Vichi M. Two-mode multi-partitioning. Computational Statistics and Data Analysis. 2008;52:1984–2003. doi: 10.1016/j.csda.2007.06.025. [DOI] [Google Scholar]
  42. Scharoun-Lee M, Gordon-Larsen P, Adair LS, Popkin BM, Kaufman JS, Suchindran CM. Intergenerational profiles of socioeconomic (dis) advantage and obesity during the transition to adulthood. Demography. 2011;48:625–651. doi: 10.1007/s13524-011-0024-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Schwarz G. Estimating the dimension of a model. Annals of Statistics. 1978;6:461–464. doi: 10.1214/aos/1176344136. [DOI] [Google Scholar]
  44. Skolnick BE, Maas AI, Narayan RK, van der Hoop RG, MacAllister T, Ward JD, Nelson NR, Stocchetti N. A clinical trial of progesterone for severe traumatic brain injury. New England Journal of Medicine. 2014;371:2467–2476. doi: 10.1056/NEJMoa1411090. [DOI] [PubMed] [Google Scholar]
  45. Tefera M, Sharma M. Determinants of immunization among children aged 12–23 months in ethiopia: A proportional odds model approach. International Journal of Statistics in Medical Research. 2015;4:140–155. doi: 10.6000/1929-6029.2015.04.01.15. [DOI] [Google Scholar]
  46. van Dijk, B., van Rosmalen, J., & Paap, R. (2009). A Bayesian approach to two-mode clustering. Econometric Institute Research Papers: Technical Report.
  47. Vichi, M., (2001). Double k-means clustering for simultaneous classification of objects and variables, in: Borra, S., Rocci, R., Vichi, M., Schader, M. (Eds.), Advances in Classification and Data Analysis. Springer Berlin Heidelberg. Studies in Classification, Data Analysis, and Knowledge Organization, pp. 43–52.
  48. Wiech K, Farias M, Kahane G, Shackel N, Tiede W, Tracey I. An fMRI study measuring analgesia enhanced by religion as a belief system. PAIN. 2008;139(2):467–476. doi: 10.1016/j.pain.2008.07.030. [DOI] [PubMed] [Google Scholar]
  49. Wyse J, Friel N. Block clustering with collapsed latent block models. Statistics and Computing. 2012;22:415–428. doi: 10.1007/s11222-011-9233-4. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials


Articles from Psychometrika are provided here courtesy of Springer

RESOURCES