Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2023 Apr 17;19(4):e1011044. doi: 10.1371/journal.pcbi.1011044

Multi-view clustering by CPS-merge analysis with application to multimodal single-cell data

Lixiang Zhang 1, Lin Lin 2, Jia Li 1,*
Editor: Mark Alber3
PMCID: PMC10138214  PMID: 37068097

Abstract

Multi-view data can be generated from diverse sources, by different technologies, and in multiple modalities. In various fields, integrating information from multi-view data has pushed the frontier of discovery. In this paper, we develop a new approach for multi-view clustering, which overcomes the limitations of existing methods such as the need of pooling data across views, restrictions on the clustering algorithms allowed within each view, and the disregard for complementary information between views. Our new method, called CPS-merge analysis, merges clusters formed by the Cartesian product of single-view cluster labels, guided by the principle of maximizing clustering stability as evaluated by CPS analysis. In addition, we introduce measures to quantify the contribution of each view to the formation of any cluster. CPS-merge analysis can be easily incorporated into an existing clustering pipeline because it only requires single-view cluster labels instead of the original data. We can thus readily apply advanced single-view clustering algorithms. Importantly, our approach accounts for both consensus and complementary effects between different views, whereas existing ensemble methods focus on finding a consensus for multiple clustering results, implying that results from different views are variations of one clustering structure. Through experiments on single-cell datasets, we demonstrate that our approach frequently outperforms other state-of-the-art methods.

Author summary

Advances in single-cell profiling technologies have made it possible to measure various types of features from a single cell. In this new type of data, known as multimodal single-cell data, each cell has numerical measurements from multiple views. Analyzing multimodal data has opened up new horizons for single-cell genomics, where clustering is a fundamental analysis for validating existing hypotheses or discovering insights when little prior knowledge is available. Existing clustering methods either combine data from different modalities for simultaneous processing or use integration algorithms to aggregate clustering results from multiple views. In this paper, we propose a new approach called CPS-merge analysis, which considers both consensus and complementary effects among clustering results across views and provides a quantified contribution of each view. The approach operates on single-view cluster labels, enabling the use of advanced clustering algorithms in any individual view. Furthermore, since CPS-merge analysis does not require pooling the original data, it can be applied to distributed sources or data with sharing concerns. This new approach tackles the problem of multi-view clustering from a novel combinatorial perspective and has the potential to become a widely used and effective tool.

Introduction

Multi-view data are becoming increasingly prevalent in real-world applications. For example, a data entry of a subject can contain image, audio, and text data. Single-cell genomics is a prominent biomedical area where multi-view data often arise. In the literature on single-cell data analysis, the term “multimodal” is usually used instead of “multi-view”. In this paper, we use them interchangeably. To apply our methods developed here, data in different views can be of different modalities. For instance, RNA expression levels of cells can reveal much of the cellular heterogeneity, and many advanced techniques and tools have been developed to analyze such data, e.g., Matrix factorization [1] for revealing low-dimensional structure, CIDER [2] for clustering. However, other data modalities, e.g., DNA, protein expression, are found necessary to fully understand the cellular mechanics [3, 4]. RNA expression is often inadequate to separate immune cells that are molecularly similar but functionally distinct, and many subpopulations of T cells, indistinguishable by scRNA-seq data, are identifiable in other modalities [5, 6]. These examples present the following case: each view contains useful information about an instance, and the information in different views is complementary to some degree. A well-designed learning algorithm that leverages all the views can greatly improve performance. In particular, analytical tools for multimodal single-cell data have helped reconstruct gene regulatory networks, a significant leap forward for revealing the inner workings of biological systems [7, 8]. In biomedical multi-view learning, several related but different tasks have been pursued, e.g., multi-view classification [9], multi-view clustering [10], multi-view deconvolution [11], and multi-view data integration [12]. Here, we focus on unsupervised multi-view clustering, used to reveal the underlying cellular structure that can assist downstream analysis.

The authors of [13] proposed to divide multi-view clustering methods for genomics data into three types: early, intermediate, and late integration. The early integration type contains methods that concatenate variables across all the views. Many drawbacks, e.g., the sharp increase in dimension, the neglect of special statistical properties of particular views, have been noted for such methods [14]. These issues, especially severe for multimodal single-cell data, are mitigated to some extent by intermediate integration methods. According to a few highly regarded surveys on multi-view learning [1518], intermediate integration methods include well-known multi-view clustering algorithms that belong to four schools: co-training, multiple kernel clustering, multi-view subspace clustering, and multi-view graph clustering. These methods combine data from multiple views into one set using weights, transformations, or simplification based on similarity or dimension reduction. In contrast, the late integration methods, which mostly belong to ensemble clustering methods, generate aggregated clusters based on clustering results obtained in every single view. Example methods of late integration ensemble clustering include [1922]. Specifically, in [21, 22], dissimilarity measures between clusters in different results are computed based on the cluster membership of samples in each result. Dendrogram clustering is then applied to yield an integrated clustering result.

Ensemble clustering methods are popular for treating multi-view data, for example, the fast multi-view clustering via ensembles (FastMICE) method [23]. However, not all ensemble methods strictly adhere to the early, intermediate, and late integration taxonomy. FastMICE [23] is of particular interest because it employs a hybrid of early and late integration. This method of mixed early-late integration aims to identify a consensus among view-group clustering results, with each view group containing a random number of views, such as a single view or multiple views. The clusters in multi-view groups may be established using early integration.

Late integration methods overcome some disadvantages of early or intermediate integration [1922]. Apparently, it is straightforward for such methods to incorporate advanced clustering methods developed for single-view data since they operate on single-view cluster labels instead of the original data. Consequently, late integration methods are easier to adopt when data in multiple views cannot be pooled, for instance, due to privacy concerns. On the other hand, the advantages of late integration come at a cost. Since such methods only examine the single-view cluster labels for integration, naturally, relevant information in the original data but not retained in cluster labels cannot be leveraged.

There are two primary principles for multi-view clustering, namely, the consensus principle [24] and the complementary principle [15]. The consensus principle assumes a shared clustering structure across all views, so clustering results from different views are considered variations of a single clustering result. Methods developed under this principle seek a plausible “average” of the clustering results. In contrast, the complementary principle emphasizes that clusters may only emerge when data from all views are analyzed together, as is often observed with single-cell data. Early integration methods naturally follow this principle since original data from all views are combined. However, incorporating the complementary principle into late integration methods is challenging because they only have access to cluster labels from different views. In fact, existing late integration clustering methods ignore the complementary effect between views.

Most methods in the intermediate integration type also ignore the complementary principle. For instance, multi-view clustering algorithms by co-training [25, 26] make the underlying assumptions of sufficiency and compatibility: (a) each view is sufficient for clustering on its own, (b) the target function of both views predict the same labels for co-occurring features with a high probability, and so on. Under the sufficiency assumption, co-training methods aim at maximizing agreement between two views (consensus principle). In addition, the compatibility assumption restricts clustering algorithms allowed. Specifically, similar algorithms are used in different views. Not only co-training methods but also other methods in the intermediate integration type, e.g., multiple kernel clustering [2729], multi-view subspace clustering [3033], and multi-view graph clustering [3436], are by construction not ready to leverage state-of-the-art algorithms for clustering single-view data. Furthermore, the multiple kernel clustering methods do not scale well with the sample size due to the quadratic complexity (in sample size) for computing the kernel matrix. The multiple subspace clustering methods assume implicitly that a shared latent subspace across the views determines the clusters (the spirit of the consensus principle). Multi-view graph clustering methods, aiming at finding a fusion graph from multiple views, are vulnerable to noisy datasets because it ignores inconsistency between views [37]. A number of algorithms have been designed specifically for multimodal single-cell data, e.g., weighted-nearest neighbor (WNN) analysis [5], totalVI [38], and multi-omics factor analysis v2 (MOFA+) [39]. Based on the comparison in [5], WNN is the state-of-the-art multimodal single-cell clustering algorithm, but it is a multi-view graph clustering approach with disadvantages discussed previously. Last but not least, the intermediate integration methods must centralize the multiple views at the data level and hence are not applicable to distributed data.

In this paper, we aim at developing a late integration method that accounts for both the consensus and complementary principles. Although late integration has many benefits for single-cell data, existing methods overlook the importance of the complementary principle. We illustrate the significance of this principle through three simulated scenarios, with detailed findings provided in Fig A, Fig B, and Fig C in S1 Appendix. Our results demonstrate that the complementary effect between views plays an essential role for identifying meaningful clusters.

Our novel algorithm called Covering Point Set Merge (CPS-merge) analysis contributes to the paradigm of late integration by combining the two principles, namely, the consensus and complementary principles. In CPS-merge analysis, we create Cartesian product clusters based on single-view clusters. Many product clusters may arise due to randomness and may not represent meaningful subgroups. To address this issue, we have developed a computationally efficient approach to merge product clusters by considering the uncertainty level of each cluster. Furthermore, we propose a new measure to quantify the contribution of each view to the identification of any final cluster. This measure is valuable for understanding cell heterogeneity in single-cell studies.

Materials and methods

Overview of analysis pipeline

The pipeline of CPS-merge analysis is shown in Fig 1. CPS-merge analysis generates an aggregated multi-view clustering result by the following modules.

Fig 1. The pipeline of CPS-merge analysis.

Fig 1

When there are more than two views, users can either directly treat the Cartesian product clusters with higher orders or conduct step-wise merging such that two views are treated at each step. Current mutlimodal single-cell datasets only contain two views.

  • Module 1: Data are perturbed by random noises in each view. A collection of clustering results (aka, partitions) are obtained from the perturbed data using a view-specific and user-specified clustering algorithm. The same algorithm is applied to the original data to yield a clustering result which we call reference partition. Then clusters in different clustering results are aligned with the reference partition via optimal transport, a step to remove inconsistency in the cluster labels used in different results.

  • Module 2: We form new clusters by the Cartesian product of the clusters from two or more views, that is, each ordered pair (or k-tuples in general) of cluster labels from the two views defines one cluster.

  • Module 3: To obtain a final clustering result, we merge unstable clusters progressively to maximize tightness given a specified number of final clusters. The tightness measure is defined in [40], which quantifies the clustering stability. A comprehensive review of clustering stability is referred to [41]. If the number of Cartesian product clusters at the start of merging is large (for example, more than 100), we conduct a first-stage merging by bipartite clustering. Otherwise, we directly begin the second-stage merging using Covering Point Set (CPS) analysis, available via the R package OTclust [40].

The output of CPS-merge analysis contains an integrated clustering result and quantities that measure the contribution of each view to the final clusters. The Cartesian product clusters from multiple views capture the complementary effects between the views. On the other hand, these clusters are subject to merging based on cross-view correspondence between clusters. We establish this correspondence by CPS analysis under the consensus principle. The cross-view correspondence exists as a mapping between clusters or between the so-called super-clusters, the former by optimal transport and the latter by bipartite clustering.

The computation complexity of CPS-merge analysis depends on all the modules. In Module 1, the complexity of generating perturbed data is linear in sample size. The complexity of the single-view clustering algorithms used can vary. However, many clustering algorithms have linear complexity in sample size. In Module 2 and 3, CPS-merge analysis involves optimal transport or bipartite clustering applied to the single-view clusters instead of the original data points. Hence the complexity is quadratic in the number of clusters, which is usually much smaller than the sample size. In summary, if the single-view clustering algorithms have linear complexity in sample size, the complexity of CPS-merge analysis will also be linear unless the number of clusters is in the same order as the sample size.

Although there are usually two views in current multimodal single-cell datasets, our method extends straightforwardly to more than two views. In such a case, our method can be applied either directly to the Cartesian product clusters across all the views or progressively to two views at a time, e.g., aggregating the first two views into one and then taking the third view as the other view, so on so forth. Without loss of generality, we assume there are two views in our discussion. Next, we elaborate on each module in CPS-merge analysis.

Module 1: Generate coherent cluster labels within each view

Because our algorithm aims at maximizing the overall tightness of a clustering result, we need to generate random variations of a clustering result within each view to evaluate tightness. The tightness measure is defined for both individual clusters and the entire partition to quantify the level of uncertainty. We first explain in the steps listed below how to obtain random variations of a clustering result. Details for the definition and computation of the tightness of a cluster are provided at the end of this Section.

  1. Apply clustering to the original single-view data and call the result reference partition.

  2. Perturb the original single-view data by adding random noise to each point. The noise is sampled from a Gaussian distribution with mean zero and variances adjusted with the data. We usually set the variance to be 10% of the average within-cluster variance. Repeat the perturbation step to obtain multiple perturbed versions of the whole dataset.

  3. Obtain a collection of partitions by applying a user-chosen clustering algorithm to every perturbed dataset.

  4. Align clusters in those partitions with the clusters in the reference partition.

Note that our algorithm works with any baseline clustering algorithm chosen for a single view. Thus users can easily incorporate state-of-the-art clustering algorithms. Because of the unsupervised nature of clustering, Step (4) to align clusters (the reference partition as a benchmark) is necessary since cluster labels used in different partitions are not automatically consistent. For instance, even if two partitions are identical, the cluster labels may be permuted. In practice, precise correspondence between clusters in two partitions rarely exists. We use a cluster alignment algorithm based on optimal transport [42]. Next, we explain how clustering results are aligned within each view. More details can be found in [42].

Suppose there are two partitions denoted by P(p), p = 1, 2, each contains kp clusters C1(p),,Ckp(p). The alignment between clusters is captured by a so-called cluster aligning matrix:

W=(wi,j)i=1,,k1;j=1,,k2,wi,j[0,1].

The entry wi,j is a coupling/matching weight between Ci(1) and Cj(2), a higher value indicating a stronger match. For example, if P(2) contains the same clusters in P(1) but with permuted labels, W will encode the permutation by having wi,j > 0 if the ith cluster in P(1) is the jth cluster in P(2) and wi,j = 0 if j′ ≠ j. In general, W can specify partial matching between clusters in order to handle more complicated situations, e.g., k1k2, one cluster splitting into multiple clusters, etc. W is solved by optimal transport (OT) with the objective of minimizing the weighted sum of distances between every pair of clusters across the two partitions. OT ensures that if the two clustering results are permuted versions of each other, the permutation will be identified through the solution for W.

Suppose each cluster Ci(p) is assigned with a significance weight qi(p), with i=1kpqi(p)=1. Usually, qi(p) is the proportion of data points in Ci(p). Let d(⋅, ⋅) be the distance between two clusters. We solve W by OT:

D(P(1),P(2))minWi=1k1j=1k2wi,jd(Ci(1),Cj(2))s.t.j=1k2wi,j=qi(1),i=1,,k1i=1k1wi,j=qj(2),j=1,,k2wi,j0,i=1,,k1;j=1,,k2. (1)

The Jaccard distance is adopted as the distance between clusters, i.e.,

d(Ci(1),Cj(2))=1-|Ci(1)Cj(2)|/|Ci(1)Cj(2)|,

where |⋅| means the cardinality of a set, “∩” the intersection of sets, and “∪” the union of sets. The first two constraints on wi,j’s ensure that the total influence of any cluster is determined by its proportion. The objective is to minimize the weighted sum of the matching costs between clusters. The minimized objective function D(P(1),P(2)) is defined as the distance between the two partitions, often called the Wasserstein distance. Consider P(2) as the reference partition. After obtaining W, we normalize its ith row and define γi,j=wi,j/qi(1) (qi(1) is the proportion of data points in Ci(1)), which indicates the proportion of cluster Ci(1) mapped to cluster Cj(2). Denote this cluster mapping matrix as Γ(1)=(γi,j(1))i=1,,k1;j=1,,k2.

For the general case of aligning with more than two partitions, suppose we have a reference partition P(r) that contains κ clusters: C1(r),,Cκ(r). Let the proportion of points in Cj(r) be qj(r), j = 1, …, κ. Similarly, suppose we have m other partitions to align with the reference, and each partition P(p) contains kp clusters C1(p),,Ckp(p). Let the proportion of points in cluster Ci(p) be qi(p), i = 1, …, kp, p = 1, …, m. We align each P(p) with P(r). Let the cluster aligning matrix from P(p) to P(r) be W(p)=(wi,j(p))i=1,,kp;j=1,,κ and the cluster mapping matrix be Γ(p)=(γi,j(p))i=1,,kp;j=1,,κ. Denote the cluster-posterior matrix of partition P(p) by P(p)=(prh,i(p))h=1,,n;i=1,,kp, where prh,i(p) is the posterior probability of the hth data point belonging to cluster Ci(p). We then define the aligned cluster-posterior matrix based on P(p) but “translated” to the cluster labels of the reference P(r) by

P(pr)=(prh,j(pr))h=1,,n;j=1,,κ=P(p)Γ(p).

Each row of the cluster-posterior matrix P(pr) is regarded as the posterior probabilities of the corresponding point belonging to each cluster in the reference clustering result. We then use the maximum a posteriori (MAP) rule to assign the aligned cluster labels (i.e., the cluster labels used in the reference partition) to the sample points in partition P(p),p=1,,m. In this way, we obtain a collection of clustering results with consistent cluster labels under each view. Next, three key definitions are provided below.

Covering Point Set (CPS)

Suppose we have already obtained the cluster mapping matrix Γ(p). Similarly, we normalize W(p) column-wise to obtain

Γ˜(p)=(γ˜i,j(p))i=1,,k1;j=1,,κ,γ˜i,j=wi,j/qj(r),

where qj(r) is the proportion of data points in Cj(r). Based on Γ(p) and Γ˜(p), four types of topological relationships between clusters are defined: “match”, “split”, “merge”, and “lack of correspondence”. For example, Ci(p) and Cj(r) match if γi,jζ and γ˜i,jζ, where ζ is a relaxation threshold set between 0.5 and 1. If the “match” relationship holds between Ci(p) and Cj(r), they are considered to be the same cluster but possibly labeled differently in P(p) and P(r). Detailed definitions about all topological relationships are referred to [42].

Suppose for the kth cluster in the reference clustering result, there is a collection of matched clusters Si, i = 1, …, m, each is a subset of the whole dataset {x1, …, xn}. Then the covering point set (CPS) Sα of cluster k at a coverage level α is defined as the smallest set such that at least 100(1 − α)% of Si’s are subsets of Sα, that is, to solve the optimization problem: minS|S|, s.t. i=1m1(SiS)m(1-α) (we use the Least Impact First Targeted-removal algorithm developed in [42]). In summary, CPS, a counterpart of the confidence interval of a numerical estimation, is a set of possible points for one cluster at a certain level of coverage.

Tightness

Suppose there are m other partitions in total, and the proportion of partitions that have a cluster “matched” with the kth cluster in the reference partition is pk (e.g., some partitions can have “lack of correspondence” or other relationships for reference cluster k). For those partitions that contain a matched cluster to cluster k, let the corresponding cluster k be sets Si, i = 1, …, mk, mkm, pk = mk/m. At the coverage level α, let Sα be CPS of cluster k. The tightness of cluster k is defined as

Rt(k)=pk·i=1mk|Si|/|Sα|mk.

Also, the overall tightness of the whole partition, denoted by R¯t, is defined as the average over the tightness values of individual clusters. A larger value of tightness indicates more stable clustering.

Cluster Alignment and Points based (CAP) separability

We first compute the CPS of each cluster in the reference partition, denoted by Sα(Cj(r)). Large overlap between the CPSs of different clusters indicates poor separation between them. The Cluster Alignment and Points based (CAP) separability between two clusters Cj(r) and Cj(r) is defined as

δcap(Cj(r),Cj(r))=d(Sα(Cj(r)),Sα(Cj(r))),

where d(⋅, ⋅) can be any distance between two sets of points. We use the Jaccard distance which lies in [0, 1].

Module 2: Form Cartesian product clusters

Suppose a collection of aligned clustering results have been obtained in each of the two views. Let the number of clusters in the first view be κA and that in the second view be κB. Denote the reference partition for the first view by A and the collection of m clustering results (obtained from the perturbed data) by A={A(1),,A(m)}, where A(p) is the pth clustering result and p = 1, …, m. For brevity of notation, we record each clustering result by the cluster labels for the n sample points: A(p)=(a1(p),,an(p)), where ah(p){1,,κA}, h = 1, …, n. Similarly, for the second view, let the reference partition be B and the collection of clustering results be B={B(1),,B(m)}, where B(p)=(b1(p),,bn(p)), p = 1, …, m, and bh(p){1,,κB}, h = 1, …, n. After cluster alignment with the reference partition A (or B), described in the previous subsection, the cluster labels in any A(p) (or B(p)) are consistent with those used in A (or B). In the subsequent discussion, we assume this is always the case.

Consider a random pair of clustering results in the two views: A(pa) and B(pb). For every point h, the pair of cluster labels (ah(pa),bh(pb)) determines the Cartesian product cluster which the point belongs to. The total number of Cartesian product clusters is κA × κB. We denote the Cartesian product clustering result of these two clustering results by

(A(pa),B(pb))=((a1(pa),b1(pb)),,(an(pa),bn(pb))).

We simply call (A(pa),B(pb)) a product partition and (ak(pa),bk(pb)) a product label. Let the Cartesian product of the two sets A and B be

A×B={(A(pa),B(pb)),pa=1,,m;pb=1,,m}.

To reduce computation, our algorithm uses a subset of A×B to carry out the analysis: C={(A(p),B(p)),p=1,,m}. Since clustering results in different views are obtained independently, C is formed essentially by randomly pairing up the partitions across the two views and keeping m pairs.

Module 3: Integration across multiple views

Optimization objective of multi-view clustering

If we assume the clustering results in the two views are fully complementary, the product clusters induced by (A,B) (the product partition of reference clustering results in the two views) can be taken as the final clusters, an example shown in Fig 2c. In practice, however, the views are usually not fully complementary. Moreover, the number of clusters in the product partition is often too large (roughly in the exponential order of the number of views). Due to randomness in data and nuances in the clustering algorithms, an observed product cluster, e.g., all the points with product label (1, 2), may not truly exist. We thus propose to merge the product clusters such that the overall tightness of the final clusters is maximized. The perturbed versions of (A,B), specifically, (A(p),B(p)), provide the basis for computing the tightness of product clusters. Let the product clusters generated by (A,B) be denoted by labels (ξA, ξB), ξA ∈ {1, …, κA}, ξB ∈ {1, …, κB}. Suppose the desired number of clusters is κ1. How the product clusters are merged into κ1 clusters is given by a many-to-one mapping from a product label to a label in the set {1, 2, …, κ1}. Denote the mapping from the product clusters to the final clusters by f, where f(ξA, ξB) ∈ {1, …, κ1}. Denote the tightness of the kth cluster by Rt(k), k = 1, …, κ1. The kth cluster is formed by the points in all the product clusters (ξA, ξB) such that f(ξA, ξB) = k. Then the optimization objective is:

argmaxfk=1κ1Rt(k). (2)

The above optimization problem is intrinsically combinatorial. We thus propose a greedy algorithm that exploits a two-stage merging procedure. The first stage is optional and aims at improving computational efficiency. If the number of clusters in the product partition is small to begin with (e.g., fewer than 100), we can skip the first-stage merging and thus bipartite clustering.

Fig 2. First-stage merging of Cartesian product clusters based on bipartite clustering.

Fig 2

(a) Bipartite clustering yields super-clusters, each containing multiple clusters in every view. A super-cluster is marked by a given color, and the same super color is shown by different shapes in the two views. Any super-cluster of interaction effects will be treated as a merged product cluster in later analysis. (b) The off-diagonal white blocks correspond to unmatched product (UP) super-clusters. The diagonal colored blocks correspond to matched product (MP) super-clusters. (c) A simple case that the true clusters are the product clusters from two views. The information from the two views is fully complementary.

First-stage: Generating and aligning super-clusters across views

In the first-stage merging, we use bipartite clustering to generate the so-called super-clusters. Correspondence between clusters in different views can happen at a structural level higher than the original clusters. For instance, a cluster may split into several clusters in another view, or vice versa, multiple clusters may merge into one. Bipartite clustering aims at finding groups of clusters (aka, super-clusters) for which cross-view correspondence is sharp. With details to be explained shortly, super-clusters help decrease the number of Cartesian product clusters that proceed to the second-stage merging, thus improving computational efficiency. For large product clusters containing high proportions of data points, we determine how they aggregate mostly in the second-stage merging, while smaller product clusters are more likely to be combined based on super-clusters. Note that we do not replace the original clusters by super-clusters. The restrictive usage of super-clusters reflects a careful balance of applying the consensus and complementary principles.

We build a bipartite graph [43, 44] for the clusters in the reference partitions A and B under the two views respectively. Let the nodes in set U correspond one-to-one with the clusters in A and the nodes in V correspond with those in B. Edges exist only between a node in U and a node in V. Recall that there are κA clusters ϕ1,,ϕκA in the first view, and κB clusters ψ1,,ψκB in the second view. A cluster aligning matrix W (a κA × κB matrix) is computed to indicate the extent of matching between any ϕi, i = 1, …, κA, and ψj, j = 1, …, κB. We compute W using OT in the same way as that described in Section “Module 1: Generate coherent cluster labels within each view”. For each (A(p),B(p))C, let the cluster aligning matrix between A(p) and B(p) be W(p), p = 1, …, m. We define the average of W(p)’s, W¯=p=1mW(p)/m, as the matching weight matrix. The matching weight matrix W¯=(w¯i,j)i=1,,κA;j=1,,κB, and w¯i,j is taken as the edge weight between nodes ϕi and ψj, a larger w¯i,j indicating a stronger connection between ϕi and ψj. Then we use the Leiden algorithm [45, 46] for bipartite clustering. Each cluster generated in this way is what we call a super-cluster, containing in general multiple clusters in both views.

Next, we illustrate the first-stage merging with an example shown in Fig 2. Suppose 4 super-clusters S1, …, S4 are formed, each containing multiple clusters in every view. For instance, suppose ϕ1, ϕ2, ψ1, and ψ2 belong to S1 (two from each view). Product clusters formed by two cluster labels belonging to the same super-cluster—those shown in the colored diagonal blocks in Fig 2b—are kept for further analysis in the second-stage merging, e.g., (ϕ1, ψ1), (ϕ1, ψ2). A product cluster will lie in an off-diagonal white block if the two cluster labels belong to different super-clusters, e.g., (ϕ1, ψ3), (ϕ2, ψ3). We call (Si, Sj), ij, an unmatched product (UP) super-cluster, and (Si, Si) a matched product (MP) super-cluster.

In a nutshell, we will analyze the product clusters belonging to a UP super-cluster at the granularity of the super-cluster but those belonging to an MP super-cluster at the granularity of the original product clusters. Specifically, we merge all the product clusters in any UP super-cluster—they become a single cluster in the second-stage. Because of the nature of bipartite clustering, small product clusters tend to locate in UP super-clusters. For example, in Fig 2a, only the following clusters proceed to the second-stage: all (Si, Sj), ij, and all (ϕm, ψn) with ϕm and ψn belonging to the same Si. Since the product clusters (ϕi, ψj)’s capture the interaction effects between the two views, in our approach, the interaction effect between clusters in the same super-cluster will be examined at a more refined granularity.

In practice, it is possible that some product clusters are empty. Obviously, empty clusters will not feature in later analysis. Furthermore, we often observe clusters that hardly arise, which we call “rare clusters”. In particular, suppose there are m clustering results. If a product cluster label is not taken by sample points at least m times across the m results (less than one time per result on average), we say it is “rare”. Points assigned with a rare cluster label are re-labeled by a majority vote. For any such point, we find its most frequent cluster label among the m results and assign this label to this point in all the results.

Second-stage: Separability-based merging to maximize tightness

Recall that the reference partitions in the two views are A and B, containing κA and κB clusters respectively. In addition, the two views have multiple aligned clustering results, randomly paired up to produce m Cartesian product clustering results: C={(A(p),B(p)),p=1,,m}. For the convenience of the following discussion, suppose the first-stage merging has generated κ0 clusters assigned with labels 1, …, κ0. Let the mapping from (ξA, ξB), ξA ∈ {1, …, κA}, ξB ∈ {1, …, κB} to those κ0 labels be g0(ξA, ξB) ∈ {1, …, κ0}. Note that if the first-stage merging is skipped, g0 is just the one-to-one identical mapping (otherwise, a many-to-one mapping). Denote the clustering result induced by g0 on (A,B) by C, which contains clusters C1,,Cκ0. We call C the combined reference partition. To generate the combined partition based on any (A(p),B(p)), p = 1, …, m, we apply OT to align the Cartesian product clusters of (A(p),B(p)) with those of C, in the same way as described in Section “Module 1: Generate coherent cluster labels within each view”. Denote the pth combined partition obtained from (A(p),B(p)) by C(p), and let C˜={C(p),p=1,,m}.

To solve optimization problem Eq (2), clusters in C are merged based on a quantity called Cluster Alignment and Points based (CAP) separability [42]. A higher separability corresponds to a lower similarity. These tools are collectively called CPS (Covering Point Set) analysis. To conduct CPS analysis, we only need cluster membership information for a collection of clustering results. In our case, the reference partition is C and the collection of clustering results obtained from perturbed datasets is C˜. Since it is computationally infeasible to examine all the possible ways of merging the clusters into κ1 final clusters, we propose to recursively merge clusters, two at a time, in the same manner as creating a dendrogram.

CPS merging

We use the pair-wise separability measure between clusters, provided by CPS analysis, as the cluster distance. We also compute the tightness of every cluster, a higher value of tightness indicating higher stability (or lower uncertainty). Suppose Ci is the most unstable cluster. Ci usually yields low separability from many other clusters, but the lowest pair-wise separability does not necessarily arise between Ci and some other cluster. To increase the overall tightness, we first merge Ci with a cluster closest to it, that is, in terms of low separability. After every merge, the per-cluster tightness and pair-wise separability measures are updated. The merging continues recursively, producing a dendrogram. In our experiment, we stop the process when the required number of clusters (usually the average tightness exceeds 0.8) is reached. The computation required is more intense than the usual way of generating dendrograms using a linkage scheme because we cannot update the separability or tightness recursively based on a linkage function. These quantities are computed from scratch after every merge. We thus have designed an accelerated version of the merging process, which is presented below.

Accelerated CPS merging

At each step of merging, we set a threshold for the tightness of clusters. Any cluster with tightness below the threshold will be merged with its closest cluster. This rule essentially allows multiple merges to occur in one round without updating separability or tightness. After all such clusters are processed, we update tightness and separability measures. If some of the updated tightness measures still fall below the same threshold, we repeat the procedure, sometimes going through several rounds under a fixed threshold. If the tightness of every cluster is above the current threshold, merging can also continue if we gradually increase the threshold. In practice, the thresholds are usually set as 0.35, 0.5, 0.65, 0.8. We can apply other stopping criteria, for instance, each cluster reaching a minimum size, or a certain number of clusters having been reached (excluding singletons or tiny clusters). In most cases, the result converges at threshold 0.8 or meets another stopping criterion, e.g., reaching a required total number of clusters. If computational efficiency is a concern, this accelerated merging scheme is a close substitute to the first scheme. Users also have the option to combine the two schemes, for instance, applying the second scheme first to reduce the number of clusters to a certain level and then switching to the first scheme.

After the second-stage merging, we obtain a many-to-one mapping of cluster labels from {1, …, κ0} to {1, …, κ1}, κ1κ0, which is denoted by g1. Applying g1 to C, the combined reference partition, we obtain the final clustering result, denoted by F that contains κ1 clusters F1,,Fκ1. In summary, the composite mapping f = g1g0 is a solution to the optimization problem Eq (2). Next, we present an approach to quantify the contribution of each view to the formation of any cluster. Understanding the role of each view in the generation of clusters is helpful in single-cell studies.

Evaluating cluster-wise contribution of each view

Clusters in the final result F usually do not correspond well with clusters in any single view, e.g., A or B. We propose two methods to assess the contribution of each view to the existence of any cluster in F. The two methods are suitable for different scenarios.

In the first scenario, we assume that final clusters Fk, k = 1, …, κ1, have been approximately captured in a single view, which generally varies with the cluster. We treat F as the reference partition and the raw partitions {A˜(1),,A˜(m)} (or {B˜(1),,B˜(m)}) that have not been aligned with the single-view reference partition from the first view (or the second) as perturbed clustering results of F. We can then carry out CPS analysis and compute the tightness for each cluster Fk. Let the tightness for Fk computed from the results in the lth view be Rt(k, l), l = 1, 2. Extension to more than two views is straightforward. Then we define the contribution of the lth view to cluster Fk by

ζk,l=Rt(k,l)lRt(k,l).

If ∑l Rt(k, l′) = 0, let ζk,l = 0.5. We call ζk,l tightness-based contribution. The rationale for the definition of ζk,l is that if Fk is stable in one view but not in another, the former view plays the dominant role in the rise of Fk. Apparently, the defined cluster-wise contribution of each view is between 0 and 1, a higher score indicating a higher contribution.

The definition of contribution presented above is based on the notion that the degree of uncertainty reflects the level of significance or contribution. This same concept has been utilized by existing methods that use a local weighting strategy, as seen in [47, 48]. However, these methods presuppose that different partitions are independently generated. In our scenario, since different partitions within a single view are obtained by the same method on slightly perturbed data, the independence assumption is not appropriate, making those methods unsuitable for direct application.

We note, however, ζk,l is not a good choice to quantify the contribution of each view when Rt(k, l)’s across all the views are low. In such a case, no cluster in any single view corresponds reasonably well with Fk. For instance, cluster Fk is identified due to interaction effects. It is thus questionable to compare the contribution of views based on stability or tightness measures. We will use a different measure described below.

CPS analysis applied to the reference partition F and the partitions from the lth view, e.g., {A˜(1),,A˜(m)} from the first view, provide us the cluster aligning matrix Wl(p), p = 1, …, m, l = 1, 2. Wl(p) is a matrix of size κp,l × κ1, where κp,l is the number of clusters in the pth partition from the lth view. Then we calculate the cluster aligning vector Vl(p)=1κp,lTWl(p)/κp,l and the matching weight vector Vl¯=p=1mVl(p)/m. Let the the kth element in Vl¯ be vk,l, k = 1, …, κ1, which is the average matching weight of cluster Fk under the lth view. Similar to ζk,l, we define the contribution of the lth view to Fk by

ηk,l=vk,llvk,l.

If ∑l vk,l = 0, let ηk,l = 0.5. We call ηk,l matching-weight-based contribution. The rationale for the definition of ηk,l is that if Fk receives a larger weight in one view compared to another, we assume that the former view is more important for the existence of Fk. In our experiments, we use ηk,l instead of ζk,l if more than 40% of the clusters in the reference partition have tightness 0.

Results

In this section, we present the experimental results of CPS-merge analysis and its accelerated version (referred to as A-CPS-merge) on three multimodal scRNA-seq datasets. Table 1 summarizes basic information about the three datasets. Details on how the datasets are pre-processed are provided in their respective sub-sections. We also conducted extensive simulation studies with results provided in Table A and Table B in S1 Appendix.

Table 1. Summary of the three multi-view datasets after pre-processing.

Dataset # Instances Dimensions (View 1, View 2) # Clusters
HBMC 30672 (50, 24) 27
PBMC1 10032 (50, 49) 14
PBMC2 161764 (50, 50) 31

We compare results with the following popular methods for multi-view clustering.

  1. Co-training clustering (Co-train): The EM algorithm for mixture of categorical data is used (implemented in R package mvc [49]).

  2. Multiple kernel clustering (MKC): This method is based on localized multiple kernel k-means (implemented in R package klic [50]).

  3. Multiple subspace clustering (MSC): Two-level weighted subspace clustering (implemented in R package wskm [51]) is used.

  4. Ensemble clustering: This method is based on a hybrid bipartite graph formulation [52] (implemented in GitHub repository ClusterEnsembles [53]). We input the entire collection of clustering results used in CPS-merge analysis to the ensemble algorithm so that the comparison with CPS-merge is fair. We also performed ensemble clustering using the default input of the software and obtained similar results.

  5. Deep co-clustering (DeepCC): DeepCC [54] utilizes a deep autoencoder to learn a low-dimensional representation of the multi-view data, and employs a variant of Gaussian Mixture Model (GMM) for clustering (implemented in GitHub repository Deep-Co-Clustering [55]).

  6. Weighted-nearest neighbor (WNN): WNN is developed for multimodal single-cell clustering [5] (implemented in R package Seurat [56]). Briefly speaking, WNN generates weights for every modality based on within-modality prediction and cross-modality prediction of each cell and uses them to create a k-nearest neighbor (KNN) graph, based on which clustering is performed.

  7. Concatenation cluster analysis (CCA): We concatenated features from all the views. Then function FindClusters in the R package Seurat is applied to cluster the concatenated feature vectors.

We measure clustering performance by three metrics: the adjusted Rand index (ARI) [57], normalized mutual information (NMI) [58] and F-measure [59]. NMI measures the amount of information shared by two clustering results. F-measure is the harmonic mean of precision and recall, assuming the ground truth is provided. All three metrics lie in [0, 1], with 1 indicating identical clustering. We use UMAP [60] to visualize the clustering result in each view.

The hyperparameters in our method include α (CPS analysis coverage level), δ2 (the variance of Gaussian noise to generate perturbed data), κ1 (the number of final clusters), and m (the number of perturbed clustering results). The effects of α and δ2 have been studied in [40]. We suggest α = 0.1 and δ2 be set as 10% of the average within-cluster variance of the original data. We have also conducted a sensitivity analysis to study the performance under different values of κ1 and m. Results show that the performance of CPS-merge is stable when κ1 deviates from the true value. There is a general trend of better performance at a larger m until m reaches a certain level. Detailed results with discussion are provided in Table C and Fig D in S1 Appendix.

Multimodal single-cell data

We now analyze three gold-standard multimodal single-cell datasets to demonstrate the competitive performance of CPS-merge analysis and the quantification of the cluster-wise contribution of each view. In Table 2, the performance of CPS-merge on the three datasets is compared with six other algorithms listed previously. For the competing methods, default parameter settings in the algorithms are used. The algorithm MKC becomes computationally infeasible for HBMC and PBMC2 because of the quadratic (in sample size) complexity of computing the kernel matrix. We thus cannot report its performance. For single-view clustering (View 1, View 2 in Table 2), data are pre-processed to reduce dimension before being clustered. Details on the dimension reduction methods used in each view will be provided shortly when we discuss each dataset separately. The single-view data are then clustered using the FindClusters function in the R package Seurat.

Table 2. Clustering results on three muti-view datasets (HBMC, PBMC1 and PBMC2) obtained by 8 methods (first 8 columns).

Performance is measured by ARI, NMI and F-measure. Columns View 1 and View 2 are single-view clustering results on each dataset, where View 1 refers to RNA data in datasets and View 2 refers to ADT (protein) data in HBMC and PBMC2, and ATAC data in PBMC1. The highest ARI, NMI and F-measure achieved for each dataset are in bold.

ARI
(NMI)
[F-measure]
Co-train MKC MSC Ensemble DeepCC WNN CCA CPS-merge A-CPS-merge View 1 View 2
HBMC 0.695 0.014 0.270 0.416 0.733 0.706 0.823 0.819 0.672 0.654
(0.774) (0.041) (0.565) (0.457) (0.812) (0.812) (0.815) (0.815) (0.768) (0.758)
[0.723] [0.084] [0.303] [0.482] [0.756] [0.732] [0.841] [0.838] [0.707] [0.681]
PBMC1 0.635 0.003 0.241 0.484 0.203 0.764 0.744 0.829 0.829 0.829 0.668
(0.748) (0.025) (0.438) (0.650) (0.309) (0.804) (0.812) (0.839) (0.839) (0.839) (0.738)
[0.673] [0.096] [0.320] [0.532] [0.289] [0.795] [0.776] [0.850] [0.850] [0.850] [0.707]
PBMC2 0.463 0.006 0.204 0.142 0.649 0.620 0.824 0.764 0.600 0.643
(0.692) (0.011) (0.484) (0.310) (0.791) (0.761) (0.808) (0.764) (0.740) (0.764)
[0.499] [0.071] [0.241] [0.209] [0.678] [0.651] [0.846] [0.796] [0.634] [0.675]

CITE-seq Human Bone Marrow Cells (HBMC)

This dataset [61] is generated by the CITE-seq technology [62]. CITE-seq can simultaneously quantify RNA and surface protein abundance at the single-cell level by sequencing antibody-derived tags (ADTs). Thus each individual cell is measured in two views: RNA and protein (ADT). Moreover, each view individually is inadequate to identify all the cell types [61]. The data consist of 30, 672 human bone marrow cells (HBMC) of 27 different cell types.

For this example, we use one of the most popular single-cell clustering R packages Seurat for analyzing both views. Specifically, in each view, we follow the default Seurat clustering procedure. We first log normalize the RNA data and perform the centered log-ratio transformation for ADT. We then perform dimension reduction using PCA, keeping the first 50 components for RNA and the first 24 components for ADT. For both RNA and ADT, the number of components follows the default setting in Seurat. Lastly, we perform cluster analysis using Seurat default functions FindNeighbors and FindClusters. The argument called resolution in FindClusters controls the number of clusters obtained. In multimodal single-cell data analysis, we either use the default resolution or slightly adjust it so that the number of clusters in a single view is similar to the ground-truth number of clusters. The single-view clustering results are visualized in Fig 3b and 3e. When comparing with the true cell type labels, shown in Fig 3a and 3d, we see that neither view can precisely identify all the clusters.

Fig 3. UMAP visualization for HBMC data and the clustering results.

Fig 3

(a) True clusters on RNA. (b) Single-view clustering result on RNA. (c) CPS-merge analysis result on RNA. (d) True clusters on Protein (ADT). (e) Single-view clustering result on Protein (ADT). (f) CPS-merge analysis result on Protein (ADT).

Results by CPS-merge analysis are shown in Fig 3c and 3f. ARI is 0.823 for CPS-merge and 0.819 for A-CPS-merge. Compared with the two single-view results, the most obvious improvement is the identification of CD14 Mono cell. As aforementioned, MKC failed to run due to the large sample size. MSC, Ensemble clustering and DeepCC yield poor accuracy. Co-train and WNN achieve relatively high ARI, but lower than that of CPS-merge analysis. CCA performs slightly better than single-view clustering, but not as accurately as CPS-merge analysis in terms of ARI.

To evaluate the cluster-wise contribution of each view, we find that more than 40% of the clusters in the final result have tightness 0, suggesting that the matching-weight-based contribution is more appropriate here. As studied in [5, 6, 63], RNA is more informative for recognizing the progenitor populations (GMP, HSC, LMPP, Prog_B1, Prog_B2, Prog_DC, Prog_MK, Prog_RBC), while protein is more informative for distinguishing T cells (CD4 Memory, CD4 Naive, CD8 Effector_1, CD8 Effector_2, CD8 Memory_1, CD8 Memory_2, CD8 Naive, gdT, MAIT, Treg). By our analysis, the contribution of RNA to clusters corresponding to progenitor populations is on average 0.653, and the contribution of protein to clusters corresponding to T cells is on average 0.688. Therefore, our measures of the cluster-wise contribution of each view indicate that the RNA view plays a dominant role in separating progenitor populations while the protein view is more important for separating T cells. These findings are consistent with existing domain insights.

10x Multiome Human Peripheral Blood Mononuclear Cells (PBMC1)

This dataset is generated by the 10x Genomics Multiome (https://support.10xgenomics.com/single-cell-multiome-atac-gex/datasets) ATAC + RNA kit [39], which contains 10032 peripheral blood mononuclear cells (PBMC) of 14 different cell types (Fig 4). Each cell has measurements in two views: RNA and ATAC (Assay for Transposase-Accessible Chromatin). As described in [64], ATAC-seq has much lower coverage and worse signal-to-noise than RNA-seq. Therefore, RNA provides most of the information for the clusters to be revealed, while ATAC can be used as an ancillary view. Motivated by the prior information, we use RNA as the dominant view and do not perturb the RNA data.

Fig 4. UMAP visualization for PBMC1 data and the clustering results.

Fig 4

(a) Truth clusters on RNA. (b) Single-view clustering result on RNA. (c) CPS-merge analysis result on RNA. (d) Truth clusters on ATAC. (e) Single-view clustering result on ATAC. (f) CPS-merge analysis result on ATAC.

In each view, we follow the standard clustering procedure, which is slightly different between RNA and ATAC. For RNA data, we perform clustering as we have done with the HBMC data. For ATAC, we pre-process the data using R package Signac [65], which runs term frequency inverse document frequency (TF-IDF) normalization on the data and carries out dimension reduction by singular value decomposition (SVD) (we keep the 2nd to 50th principal components according to Seurat as the first component is typically correlated with the sequencing depth). Next, we use the default FindNeighbors and FindClusters functions in Seurat to cluster data.

The single-view clustering results are visualized in Fig 4b and 4e. The ARI of the single-view clustering result is 0.829 for the view RNA and 0.668 for ATAC. CPS-merge and A-CPS-merge yield the same ARI of 0.829. This result suggests that ATAC do not contribute extra information about the clusters in this dataset. As shown in Table 2, Co-train, MKC, MSC, Ensemble clustering, and DeepCC all perform worse than clustering in any single view. In particular, the ARIs obtained by WNN and CCA are 0.764 and 0.744 respectively, worse than the result obtained in the RNA view. This comparison indicates that ATAC is not very useful for clarifying the true clusters.

CITE-seq Human Peripheral Blood Mononuclear Cells (PBMC2)

The last multimodal single-cell dataset is also from peripheral blood mononuclear cells (PBMC). It is generated by CITE-seq and provided in [5]. It consists of 161, 754 cells with 31 different cell types. Same as in the previous dataset, we have two views: RNA and protein (ADT), but both views contribute substantially to the identification of clusters. This dataset has already been pre-processed. We simply apply Seurat to perform clustering in each view. The single-view clustering results are shown in Fig 5b and 5e. CPS-merge analysis yields an ARI value of 0.823 (A-CPS-merge analysis achieves ARI 0.764). Again, MKC failed to run because of the large sample size, and the other methods do not yield substantially better results than those obtained in any single view.

Fig 5. UMAP visualization for PBMC2 data and the clustering results.

Fig 5

(a) True clusters on RNA. (b) Single-view clustering result on RNA. (c) CPS-merge analysis result on RNA. (d) True clusters on Protein (ADT). (e) Single-view clustering result on Protein (ADT). (f) CPS-merge analysis result on Protein (ADT).

As for evaluating the cluster-wise contribution of each view, we find again that more than 40% of the clusters in the final result have tightness 0. Thus the matching-weight-based contribution is more suitable to use. The contribution of the protein view to clusters corresponding to CD8+ and CD4+ T cells is on average 0.622, consistent with the fact that these two cell types are usually mixed in the transcriptome data but separated clearly in the protein data.

Discussion

In this paper, we have introduced CPS-merge analysis, a new method for multi-view data clustering that is guided by both the consensus and complementary principles. As a late integration method, CPS-merge only requires cluster labels obtained from single views, making it compatible with advanced clustering algorithms designed for single-view data. We have also proposed novel measures to quantify the contribution of each view to the formation of any cluster. These measures have been validated using real datasets and domain knowledge.

However, CPS-merge has limitations in two scenarios where additional information is required for accurate results. The first scenario is when the final partition is highly unstable (e.g., the average cluster tightness falls below 0.65). While such cases can be easily identified, caution is necessary when interpreting the results. The second scenario is when stable but incorrect clustering results are generated in certain single views. As our algorithm uses cluster stability to perform merging, it cannot address this issue. One potential remedy is to identify which views are ancillary a priori, allowing the algorithm to adjust accordingly.

As suggested by one reviewer, exploring online learning for multi-view clustering is a promising direction for future research. Since CPS-merge analysis only uses cluster memberships but not the original data, it can be employed in an incremental learning mode as long as the clustering algorithms used in individual views allow online learning. Numerous clustering algorithms can be easily adapted to online learning, for instance, by representing previous data using per-cluster statistics, e.g., mean vectors and covariance matrices. Based on these stored representations, new data batches can be clustered or assigned to new clusters without accessing past data. Neural networks can also assist with online clustering. For instance, deep autoencoders can encode the original data in lower dimensions, which are typically easier to cluster, particularly under an online learning paradigm. Additionally, neural networks are frequently trained in batch mode, making them naturally suited for online learning. One challenge to consider for biomedical data, such as single-cell data, is that various data batches often contain batch effects that must be eliminated. Current methods for removing batch effects typically require processing all data in one view together, preventing effective online learning. Albeit interesting, how to overcome this issue in online learning is beyond the scope of our method here.

Supporting information

S1 Appendix. This file contains the description of the simulation study and sensitivity analysis.

(PDF)

Data Availability

All relevant datasets are described within the manuscript and Supporting information. The code is available at https://github.com/LixiangZhang/CPS-merge.

Funding Statement

The research is supported by the National Science Foundation (NSF) under grant DMS-2013905. LZ and JL received summer salary from this NSF grant. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Stein-O’Brien GL, Arora R, Culhane AC, Favorov AV, Garmire LX, Greene CS, et al. Enter the matrix: factorization uncovers knowledge from omics. Trends in Genetics. 2018;34(10):790–805. doi: 10.1016/j.tig.2018.07.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Hu Z, Ahmed AA, Yau C. CIDER: an interpretable meta-clustering framework for single-cell RNA-seq data integration and evaluation. Genome Biology. 2021;22(1):1–21. doi: 10.1186/s13059-021-02561-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Gomes T, Teichmann SA, Talavera-López C. Immunology driven by large-scale single-cell sequencing. Trends in immunology. 2019;40(11):1011–1021. doi: 10.1016/j.it.2019.09.004 [DOI] [PubMed] [Google Scholar]
  • 4. Chen H, Ye F, Guo G. Revolutionizing immunology with single-cell RNA sequencing. Cellular & molecular immunology. 2019;16(3):242–249. doi: 10.1038/s41423-019-0214-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Hao Y, Hao S, Andersen-Nissen E, Mauck WM III, Zheng S, Butler A, et al. Integrated analysis of multimodal single-cell data. Cell. 2021;. doi: 10.1016/j.cell.2021.04.048 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Mereu E, Lafzi A, Moutinho C, Ziegenhain C, McCarthy DJ, Álvarez-Varela A, et al. Benchmarking single-cell RNA-sequencing protocols for cell atlas projects. Nature Biotechnology. 2020;38(6):747–755. doi: 10.1038/s41587-020-0469-4 [DOI] [PubMed] [Google Scholar]
  • 7. Zhu C, Preissl S, Ren B. Single-cell multimodal omics: the power of many. Nature methods. 2020;17(1):11–14. doi: 10.1038/s41592-019-0691-5 [DOI] [PubMed] [Google Scholar]
  • 8. Efremova M, Teichmann SA. Computational methods for single-cell omics across modalities. Nature methods. 2020;17(1):14–17. doi: 10.1038/s41592-019-0692-4 [DOI] [PubMed] [Google Scholar]
  • 9. Yan K, Fang X, Xu Y, Liu B. Protein fold recognition based on multi-view modeling. Bioinformatics. 2019;35(17):2982–2990. doi: 10.1093/bioinformatics/btz040 [DOI] [PubMed] [Google Scholar]
  • 10. Yu Y, Zhang LH, Zhang S. Simultaneous clustering of multiview biomedical data using manifold optimization. Bioinformatics. 2019;35(20):4029–4037. doi: 10.1093/bioinformatics/btz217 [DOI] [PubMed] [Google Scholar]
  • 11. Schmid B, Huisken J. Real-time multi-view deconvolution. Bioinformatics. 2015;31(20):3398–3400. doi: 10.1093/bioinformatics/btv387 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Demetci P, Santorella R, Sandstede B, Noble WS, Singh R. Gromov-Wasserstein optimal transport to align single-cell multi-omics data. BioRxiv. 2020;. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Hamid JS, Hu P, Roslin NM, Ling V, Greenwood CM, Beyene J. Data integration in genetics and genomics: methods and challenges. Human genomics and proteomics: HGP. 2009;2009. doi: 10.4061/2009/869093 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Rappoport N, Shamir R. Multi-omic and multi-view clustering algorithms: review and cancer benchmark. Nucleic acids research. 2018;46(20):10546–10562. doi: 10.1093/nar/gky889 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Xu C, Tao D, Xu C. A survey on multi-view learning. arXiv preprint arXiv:13045634. 2013;.
  • 16. Zhao J, Xie X, Xu X, Sun S. Multi-view learning overview: Recent progress and new challenges. Information Fusion. 2017;38:43–54. doi: 10.1016/j.inffus.2017.02.007 [DOI] [Google Scholar]
  • 17. Yang Y, Wang H. Multi-view clustering: A survey. Big Data Mining and Analytics. 2018;1(2):83–107. doi: 10.26599/BDMA.2018.9020003 [DOI] [Google Scholar]
  • 18.Chao G, Sun S, Bi J. A survey on multi-view clustering. arXiv preprint arXiv:171206246. 2017;.
  • 19.Fred A. Finding consistent clusters in data partitions. In: International Workshop on Multiple Classifier Systems. Springer; 2001. p. 309–318.
  • 20. Strehl A, Ghosh J. Cluster ensembles—a knowledge reuse framework for combining multiple partitions. Journal of machine learning research. 2002;3(Dec):583–617. [Google Scholar]
  • 21. Amiri S, Clarke BS, Clarke JL. Clustering categorical data via ensembling dissimilarity matrices. Journal of Computational and Graphical Statistics. 2018;27(1):195–208. doi: 10.1080/10618600.2017.1305278 [DOI] [Google Scholar]
  • 22. Amiri S, Clarke BS, Clarke JL, Koepke H. A general hybrid clustering technique. Journal of Computational and Graphical Statistics. 2019;28(3):540–551. doi: 10.1080/10618600.2018.1546593 [DOI] [Google Scholar]
  • 23. Huang D, Wang CD, Lai JH. Fast multi-view clustering via ensembles: Towards scalability, superiority, and simplicity. IEEE Transactions on Knowledge and Data Engineering. 2023;. doi: 10.1109/TKDE.2023.3236698 [DOI] [Google Scholar]
  • 24. Vega-Pons S, Ruiz-Shulcloper J. A survey of clustering ensemble algorithms. International Journal of Pattern Recognition and Artificial Intelligence. 2011;25(03):337–372. doi: 10.1142/S0218001411008683 [DOI] [Google Scholar]
  • 25.Bickel S, Scheffer T. Multi-view clustering. In: ICDM. vol. 4. Citeseer; 2004. p. 19–26.
  • 26.Kumar A, Daumé H. A co-training approach for multi-view spectral clustering. In: Proceedings of the 28th international conference on machine learning (ICML-11); 2011. p. 393–400.
  • 27. De Sa VR, Gallagher PW, Lewis JM, Malave VL. Multi-view kernel construction. Machine learning. 2010;79(1-2):47–71. doi: 10.1007/s10994-009-5157-z [DOI] [Google Scholar]
  • 28. Gönen M, Margolin AA. Localized data fusion for kernel k-means clustering with application to cancer biology. Advances in Neural Information Processing Systems. 2014;27:1305–1313. [Google Scholar]
  • 29. Lu Y, Wang L, Lu J, Yang J, Shen C. Multiple kernel clustering based on centered kernel alignment. Pattern Recognition. 2014;47(11):3656–3664. doi: 10.1016/j.patcog.2014.05.005 [DOI] [Google Scholar]
  • 30. Zhao X, Evans N, Dugelay JL. A subspace co-training framework for multi-view clustering. Pattern Recognition Letters. 2014;41:73–82. doi: 10.1016/j.patrec.2013.12.003 [DOI] [Google Scholar]
  • 31.Deng Q, Yang Y, He M, Xing H. Locally adaptive feature weighting for multiview clustering. In: Uncertainty Modelling in Knowledge Engineering and Decision Making: Proceedings of the 12th International FLINS Conference. World Scientific; 2016. p. 139–145.
  • 32.Cao X, Zhang C, Fu H, Liu S, Zhang H. Diversity-induced multi-view subspace clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2015. p. 586–594.
  • 33.Quadrianto N, Lampert CH. Learning multi-view neighborhood preserving projections. In: ICML; 2011.
  • 34.Nie F, Li J, Li X, et al. Parameter-free auto-weighted multiple graph learning: a framework for multiview clustering and semi-supervised classification. In: IJCAI; 2016. p. 1881–1887.
  • 35. Hou C, Nie F, Tao H, Yi D. Multi-view unsupervised feature selection with adaptive similarity and view weight. IEEE Transactions on Knowledge and Data Engineering. 2017;29(9):1998–2011. doi: 10.1109/TKDE.2017.2681670 [DOI] [Google Scholar]
  • 36. Zhan K, Nie F, Wang J, Yang Y. Multiview consensus graph clustering. IEEE Transactions on Image Processing. 2018;28(3):1261–1270. doi: 10.1109/TIP.2018.2877335 [DOI] [PubMed] [Google Scholar]
  • 37. Liang Y, Huang D, Wang CD, Philip SY. Multi-view graph learning by joint modeling of consistency and inconsistency. IEEE Transactions on Neural Networks and Learning Systems. 2022;. doi: 10.1109/TNNLS.2022.3192445 [DOI] [PubMed] [Google Scholar]
  • 38. Gayoso A, Steier Z, Lopez R, Regier J, Nazor KL, Streets A, et al. Joint probabilistic modeling of single-cell multi-omic data with totalVI. Nature Methods. 2021;18(3):272–282. doi: 10.1038/s41592-020-01050-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Argelaguet R, Arnol D, Bredikhin D, Deloro Y, Velten B, Marioni JC, et al. MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biology. 2020;21(1):111. doi: 10.1186/s13059-020-02015-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Zhang L, Lin L, Li J. CPS analysis: self-contained validation of biomedical data clustering. Bioinformatics. 2020;36(11):3516–3521. doi: 10.1093/bioinformatics/btaa165 [DOI] [PubMed] [Google Scholar]
  • 41. Liu T, Yu H, Blair RH. Stability estimation for unsupervised clustering: A review. Wiley Interdisciplinary Reviews: Computational Statistics. 2022; p. e1575. doi: 10.1002/wics.1575 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Li J, Seo B, Lin L. Optimal transport, mean partition, and uncertainty assessment in cluster analysis. Statistical Analysis and Data Mining: The ASA Data Science Journal. 2019;12(5):359–377. doi: 10.1002/sam.11418 [DOI] [Google Scholar]
  • 43. Diestel R, Schrijver A, Seymour P. Graph theory. Oberwolfach Reports. 2010;7(1):521–580. doi: 10.4171/OWR/2010/11 [DOI] [Google Scholar]
  • 44. Asratian AS, Denley TM, Häggkvist R. Bipartite graphs and their applications. vol. 131. Cambridge university press; 1998. [Google Scholar]
  • 45.Zha H, He X, Ding C, Simon H, Gu M. Bipartite graph partitioning and data clustering. In: Proceedings of the tenth international conference on Information and knowledge management; 2001. p. 25–32.
  • 46. Traag VA, Waltman L, Van Eck NJ. From Louvain to Leiden: guaranteeing well-connected communities. Scientific reports. 2019;9(1):1–12. doi: 10.1038/s41598-019-41695-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Huang D, Wang CD, Lai JH. Locally weighted ensemble clustering. IEEE Transactions on Cybernetics. 2017;48(5):1460–1473. doi: 10.1109/TCYB.2017.2702343 [DOI] [PubMed] [Google Scholar]
  • 48. Huang D, Wang CD, Lai JH, Kwoh CK. Toward multidiversified ensemble clustering of high-dimensional data: From subspaces to metrics and beyond. IEEE Transactions on Cybernetics. 2021;52(11):12231–12244. doi: 10.1109/TCYB.2021.3049633 [DOI] [PubMed] [Google Scholar]
  • 49.Andreas M. mvc: Multi-View Clustering; 2014. Available from: https://cran.r-project.org/web/packages/mvc/index.html.
  • 50.Cabassi A, Kirk P, Gonen M. klic: Kernel Learning Integrative Clustering; 2020. Available from: https://cran.rstudio.com/web/packages/klic/index.html. [DOI] [PMC free article] [PubMed]
  • 51.Williams G, Huang J, Chen X, Wang Q, Xiao L, Zhao H. wskm: Weighted k-Means Clustering; 2020. Available from: https://cran.r-project.org/web/packages/wskm/index.html.
  • 52.Fern XZ, Brodley CE. Solving cluster ensemble problems by bipartite graph partitioning. In: Proceedings of the twenty-first international conference on Machine learning; 2004. p. 36.
  • 53.Sano T. ClusterEnsembles; 2021. Available from: https://github.com/tsano430/ClusterEnsembles.
  • 54.Xu D, Cheng W, Zong B, Ni J, Song D, Yu W, et al. Deep co-clustering. In: Proceedings of the 2019 SIAM International Conference on Data Mining. SIAM; 2019. p. 414–422.
  • 55.Dongkuan X. Deep-Co-Clustering; 2021. Available from: https://github.com/dongkuanx27/Deep-Co-Clustering.
  • 56.Butler A, Choudhary S, Darby C, Farrell J, Hafemeister C, Hao Y, et al.. Seurat: Tools for Single Cell Genomics; 2021. Available from: https://cran.r-project.org/web/packages/Seurat/index.html.
  • 57. Rand WM. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association. 1971;66(336):846–850. doi: 10.1080/01621459.1971.10482356 [DOI] [Google Scholar]
  • 58. Shannon CE. A mathematical theory of communication. The Bell system technical journal. 1948;27(3):379–423. doi: 10.1002/j.1538-7305.1948.tb01338.x [DOI] [Google Scholar]
  • 59. Pfitzner D, Leibbrandt R, Powers D. Characterization and evaluation of similarity measures for pairs of clusterings. Knowledge and Information Systems. 2009;19(3):361–394. doi: 10.1007/s10115-008-0150-6 [DOI] [Google Scholar]
  • 60.McInnes L, Healy J, Melville J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:180203426. 2018;.
  • 61. Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM III, et al. Comprehensive integration of single-cell data. Cell. 2019;177(7):1888–1902. doi: 10.1016/j.cell.2019.05.031 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62. Stoeckius M, Hafemeister C, Stephenson W, Houck-Loomis B, Chattopadhyay PK, Swerdlow H, et al. Simultaneous epitope and transcriptome measurement in single cells. Nature methods. 2017;14(9):865–868. doi: 10.1038/nmeth.4380 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. Villani AC, Satija R, Reynolds G, Sarkizova S, Shekhar K, Fletcher J, et al. Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes, and progenitors. Science. 2017;356 (6335). doi: 10.1126/science.aah4573 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Przytycki PF, Pollard KS. CellWalker integrates single-cell and bulk data to resolve regulatory elements across cell types in complex tissues. Genome biology. 2021;22(1):1–16. doi: 10.1186/s13059-021-02279-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Stuart T, Srivastava A, Hoffman P, Satija R. Signac: Analysis of Single-Cell Chromatin Data; 2021. Available from: https://cran.r-project.org/web/packages/Signac/index.html.
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1011044.r001

Decision Letter 0

Mark Alber, Pedro Larranaga

8 Feb 2023

Dear Professor Li,

Thank you very much for submitting your manuscript "Multi-view clustering by CPS-merge analysis with application to multimodal single-cell data" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Pedro Larranaga

Guest Editor

PLOS Computational Biology

Lucy Houghton

Staff

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The topic is appropriate for publication, and the technical novelty of the paper is somewhat novel. Its contribution is moderately significant and the coverage of the problem is sufficiently comprehensive and balanced. The overall organization of the paper could be improved. The experimental results show that it obtains better performance than the state-of-the-art. However, following I have some minor questions that the authors should address to improve this work:

1. I want to advise authors to use GNN or some new methods rather than subspace learning. There are many similar methods. How to give novelty using new techniques or using the proposed methods on large-scale datasets?

2. There are many hyperparameters; how to tune them?

3. How to use the algorithm in an online learning way? Or use neural networks to optimize such ideas?

4. Somehow, it is incremental work; I still want to know the main difference and new.

5. Why the proposed method can obtain better performance than others? I would appreciate a broader discussion on why the proposed method performs better than the others.

6. I strongly recommend that authors release the source code along with the submission since the learning-based projects are typically open-source oriented to facilitate a fair assessment of the performance of the proposed methods for the community.

7. Can you give toy-data figure to show the motivation clearer?

Reviewer #2: This paper presents a late integration based multi-view clustering method for multi-modal single-cell data. In the Introduction, it should be better explained how the proposed method advances the late integration research and especially what limitations to the previous late integration based multi-view clustering methods have been tackled by the proposed method.

In page 11, the contribution of each view to the clusters in the final result is evaluated. The weighting problem has been investigated in quite a few ensemble clustering methods, such as the local weighting strategy in Locally weighted ensemble clustering and Multidiversified ensemble clustering. Please explain whether the existing weighting strategies in ensemble clustering are feasible for the proposed work and what the advantages of the proposed weighting method are.

The computational complexity of the proposed method should be analyzed. In recently, some large-scale ensemble clustering technique has been proposed, such as the ultra-scalable spectral clustering and ensemble clustering. It can be discussed whether the proposed framework can be extended to large-scale scenarios in the future work.

Some minor issues in the References: (i) For Ref. [36], it is suggested to use its journal version (https://doi.org/10.1109/TNNLS.2022.3192445) rather than its arXiv versions. (ii) For the discussions of the late integration methods in the third paragraph of the Introduction, it is strange that no references have been provided. Some late integration based multi-view clustering methods, such as [https://doi.org/10.1109/TKDE.2023.3236698], should be discussed specifically. In fact, the late integration methods merge the multiple partitions from multiple views, which resemble the ensemble clustering technique. Some discussions regarding the relationship between the late integration and the ensemble clustering are also suggested.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

References:

Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript.

If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1011044.r003

Decision Letter 1

Mark Alber, Pedro Larranaga

22 Mar 2023

Dear authors,

We are pleased to inform you that your manuscript 'Multi-view clustering by CPS-merge analysis with application to multimodal single-cell data' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Pedro Larranaga

Guest Editor

PLOS Computational Biology

Lucy Houghton

Staff

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Well addressed my concerns.

Reviewer #2: The authors have well addressed my previous concerns. I have no further comments.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: None

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1011044.r004

Acceptance letter

Mark Alber, Pedro Larranaga

4 Apr 2023

PCOMPBIOL-D-22-01585R1

Multi-view clustering by CPS-merge analysis with application to multimodal single-cell data

Dear Dr Li,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Zsofi Zombor

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Appendix. This file contains the description of the simulation study and sensitivity analysis.

    (PDF)

    Attachment

    Submitted filename: ResponseLetter.pdf

    Data Availability Statement

    All relevant datasets are described within the manuscript and Supporting information. The code is available at https://github.com/LixiangZhang/CPS-merge.


    Articles from PLOS Computational Biology are provided here courtesy of PLOS

    RESOURCES