Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 Mar 1.
Published in final edited form as: IEEE J Biomed Health Inform. 2014 Mar;18(2):548–554. doi: 10.1109/JBHI.2013.2281362

Multi-view Co-modeling to Improve Subtyping and Genetic Association of Complex Diseases

Jiangwen Sun 1, Jinbo Bi 2,, Henry R Kranzler 3
PMCID: PMC4158610  NIHMSID: NIHMS621260  PMID: 24235312

Abstract

Genetic association analysis of complex diseases has been limited by heterogeneity in their clinical manifestations and genetic etiology. Research has made it possible to differentiate homogeneous subtypes of the disease phenotype. Currently, the most sophisticated subtyping methods perform unsupervised cluster analysis using only clinical features of a disorder, resulting in subtypes for which genetic association may be limited. In this work, we seek to derive a novel multi-view data analytic method that integrates two views of the data: the clinical features and the genetic markers of the same set of patients. Our method is based on multi-objective programming that is capable of clinically categorizing a disease phenotype so as to discover genetically different subtypes. We optimize two objectives jointly: (1) in cluster analysis, the derived clusters should differ significantly in clinical features; (2) these clusters can be well separated using genetic markers by constructed classifiers. Extensive computational experiments with two substance-use disorders using two populations show that the proposed algorithm is superior to existing subtyping methods.

Index Terms: Cluster analysis, Classification, Co-training, Multi-view analysis, Phenotypic subtyping, Genetic association

I. Introduction

Many disease traits are a collection of subtypes demonstrating heterogeneity at the molecular and clinical syndrome levels [1]. Categorizing a disease phenotype clinically has been hindered by the inconsistency of subtyping methods and a lack of validation with objective metrics [2]. There is currently no empirically derived, statistically rigorous method to identify and select optimal subtypes of a disease [3]. We propose an approach aimed at finding homogeneous subtypes that can be of use in clinical diagnosis and at the same time be of value in gene finding efforts.

Multivariate cluster analysis has been the most sophisticated method used in subtyping [4], [5], [6], [7]. Three main steps have been used in prior subtyping studies: (1) collecting both clinical and genetic data for a group of subjects, (2) identifying subgroups by the application of cluster analysis with either k-means, k-medoids, or hierarchical clustering or their combination to clinical features [4], [5], [6], and (3) conducting linkage or association analysis for the subtypes derived from the sample [7]. Because the creation of subgroups in the second step is independent of the genetic analysis in the third step, the resultant subtypes may be suboptimal and the association analysis may fail.

In a subtyping study, an objective function may be used to evaluate how strongly the subtypes derived from the grouping are associated with a given set of genetic markers, or how well the subtypes can be separated by the genetic markers. Mathematically, given two sets of variables, clinical features Z and genetic markers X from the same sample, the goal is to partition the sample into subgroups based on pairwise similarities between subjects in Z so that the resultant subgroups y can be classified by X. This problem is different from traditional supervised or unsupervised machine learning problems where labels of subjects are either given precisely or not given at all. In our problem, the labels of subjects need to be derived from the clinical features Z so they can be used to train a classifier with the genetic data X.

In the machine learning literature, the most related work might be the set of multi-view data analysis methods, co-training methods [8] and co-clustering methods [9] where multiple groups of input variables are collected for the same set of subjects. When only a small portion of the data is labeled, co-training improves the classification accuracy by enforcing consistency between the classification decisions of the unlabeled data determined by the models learned independently from each of the views. Nevertheless, co-training methods are not applicable to the subtyping problem because there are no labeled data to start with. Multi-view clustering methods seek groupings of subjects that are consistent across different views. These methods treat the data from the two views equally as the input variables. In the subtyping problem, however, the two views have to be treated differently in that one is used to define the subtypes y and the other is used to explain them. For instance, only a sparse set of genetic risk markers are identified to be associated with a subtype but the subtypes may be defined using many clinical features.

The paper is organized as follows. Section II presents the proposed subtyping methodology, based on which a multi-objective program is derived in Section III together with an algorithm to solve it. Computational results on the problems of subtyping opioid dependence and cocaine dependence are examined in Section IV and we conclude in Section V.

II. The Proposed Methodology

We propose a multi-objective optimization framework to solve the subtyping problem. For a set of cluster labels y, each assigned to one subject, we construct a model as a function of a subject’s genetic markers X to approximate the subject’s label. The model M is built by minimizing a loss function (y, X|Mθ) where Mθ is a specific inference model, such as the model of support vector machine (SVM), or logistic regression, and θ denotes the set of its parameters. Since the labels y of subjects are not given beforehand, the labels themselves need to be derived. In other words, we optimize the objective as follows

miny,θ(y,XMθ)+λR(Mθ) (1)

for the best y and θ where R(Mθ) defines the regularization term that controls the complexity of the model M, and λ is a tuning factor to balance between and R. Notice that not every possible labeling y of subjects is a feasible solution of Problem (1). The search space of y is confined by the similarity measure defined on the features Z.

Suppose that the classification of subjects y is obtained by partitioning subjects based on a similarity measure that is pre-specified on Z. The parameters used in the similarity measure often need to be tuned, such as the parameter σ if a Gaussian similarity exp(−||ZiZj||2/σ2) is used where Zi and Zj are the two vectors of clinical features for Subjects i and j. Choosing different values of σ or other relevant parameters will produce different clusters of the subjects. In general, we expect that the resultant clusters will be well differentiated from each other and that subjects in the same cluster will be closer than those from other clusters in the Z space. Many metrics have been derived in the literature to measure the quality of clusters, such as the Dunn’s Validity Index [10], Davies-Bouldin Validity Index [11], and Silhouette Validation [12]. If a metric ε(y|σ, Z) is employed to measure the quality of clusters when using a specific value of σ, the metric corresponds to another objective of the subtyping problem. We hence optimize simultaneously two objectives as follows

miny,θ,σ{Obj1:ε(yσ,Z)Obj2:(y,XMθ)+λR(Mθ). (2)

We assume that ε(y|σ, Z) is a metric to be minimized, or otherwise it can be inverted or negated. The two objectives of Problem (2) may not be optimized at the same solution. Thus, it formulates a multi-objective optimization problem.

Multi-objective programming (MOP) is a technique that was developed to solve optimization problems with multiple conflicting objectives. Solving a multi-objective program requires the search for Pareto-optimal solutions [13]. Traditional methods convert multiple objectives into a single objective using certain schemes and user-specified parameters. Two simple and widely used methods for such conversions are the weighted sum method and the constraint method [13]. The weighted sum method transforms two objectives into a single objective by multiplying each objective with a pre-defined weight and adding them together as follows

minc1Obj1+c2Obj2 (3)

where the weights c1 and c2 are non-negative and at least one of them is not zero. If the MOP is not convex, the non-convex frontier of the Pareto-optimal set cannot be obtained by the weighted sum method. The constraint method reformulates the MOP by keeping one of the objectives and restricting the rest of the objectives within user-specified limits, such as,

minObj2,subjectto:Obj1δ. (4)

Our MOP-based subtyping framework follows the constraint method, and can be implemented using any proper cluster analysis algorithm to optimize Obj1, and any suitable classification algorithm to optimize Obj2. In the next section, we will instantiate this methodology by utilizing a spectral clustering method [14] and the one-norm SVM [15] in the MOP.

III. A Multi-objective Optimization Formulation

A spectral clustering method [14] is employed to search for the cluster assignments of subjects by varying the parameter σ in its Gaussian similarity measure. The Davies-Bouldin Validity Index [11], measuring how significantly the resultant clusters differ from each other, serves as Obj1. The one-norm SVM [15] is used to build a classifier, as a function of the genetic variables X, that separates subjects in different clusters. The loss function used in the one-norm SVM serves as Obj2. Notice that the framework (2) can be realized in conjunction with other choices of clustering and classification methods.

A. The Clustering Algorithm

Spectral clustering is a method based on undirected similarity graph G = (V, E) in which each node in V represents a data point (a subject) and each edge in E is weighted by the similarity between the two connected data points. Partitions of data points represented in the similarity graph can be obtained by cutting the graph into unconnected components with the minimum cost. In a balanced cut, the sizes of these unconnected components should be comparable. Two methods have been proposed to achieve this kind of balanced cut, RatioCut [16] and Ncut [17], that minimize the following objectives, respectively,

RatioCut(C1,,Ck):=12i=1kA(Ci,C¯i)Ci=Tr(HTLH)Ncut(C1,,Ck):=12i=1kA(Ci,C¯i)vol(Ci)=Tr(TTD-1/2LD-1/2T) (5)

where Ci is one of the identified components (clusters), |Ci| and vol(Ci) denote the number of nodes and the sum of edge weights in Ci respectively, and i consists of the nodes that are not in Ci. The matrix A = {aij} is the adjacency matrix and aij measures the similarity between the nodes i and j, D is a diagonal matrix whose ith diagonal element dii = Σj:j i aij, L is the graph Laplacian defined by L = DA, Tr(·) means the trace norm, and both H and T are matrixes consisting of indicator vectors as columns defined as follows:

H=[1C1𝟙1,,1Ck𝟙k]T=D1/2[1vol(C1)𝟙1,,1vol(Ck)𝟙k] (6)

where Inline graphic is an indicator vector whose entries equal 1 if the corresponding nodes are in Ci, or 0 otherwise. Finding the global optimal solution to either of these two objectives is NP hard [18]. Their relaxed versions have been defined by allowing the indicator vectors in H and T to take real values. It has been shown that the optimal solutions to the relaxed problems of RatioCut and Ncut are the matrices composed by the eigenvectors corresponding to the first k smallest eigenvalues of L and D−1/2LD−1/2, respectively [14].

In spectral clustering, the clusters are determined by the adjacency matrix A which is further determined by a pre-chosen similarity measure. Spectral clustering is sensitive to changes in the similarity measure [14]. In our approach, we search for the most suitable similarity measure, more precisely, the best value of σ in the Gaussian similarity, to optimize Obj1 and Obj2.

B. The Objectives in Our Multi-objective Program

(1) First Objective

Spectral clustering requires an adjacency matrix A that encodes the pairwise similarities between subjects and the desired number of clusters k as its inputs, and outputs the clusters Ci of subjects, i = 1, · · ·, k. In our approach, we search for the best value of σ in the Gaussian similarity measure to optimize the Davies-Bouldin Validity Index (DBVI) [11] that measures the quality of the clusters. DBVI is a measure related to the ratio of within-cluster distance to between-cluster distance. The lower value of DBVI indicates better quality of the clusters. Hence, we minimize the DBVI as follows using Ncut [17] for the best σ

minσDBVI=1ki=1kmaxijDist(Ci)+Dist(Cj)Dist(Ci,Cj) (7)

where Dist(Ci) is the average distance from each data point in Ci to the cluster center, Dist(Ci, Cj) is the distance between the center of Ci and the center of Cj. These distances are calculated in the Z dimension.

(2) Second Objective

For each cluster Ci, without loss of generality, we construct a classifier in the linear form of f(X) = WTX + b to separate the subjects in Ci from the remaining subjects. The model WiTX+bi specific for Cluster Ci is obtained by minimizing the regularized empirical error (yi, X, Wi) + λR(Wi) where we use a binary vector yi to indicate the cluster membership: yij=1 if subject Xj is in Ci, or otherwise yij=-1, j = 1, · · ·, n, for all n subjects. We employ the hinge loss commonly used in SVMs, e.g., (yi,X,Wi)=j=1n[1-yij(WiTXj+bi)]+ where [a]+ = 0 if a < 0, otherwise [a]+ = a, and R(Wi) takes a sparse-favoring form in order to select among features, in particular, 1-norm ||Wi||1 = Σd|Wid|. The ℓ1-norm shrinks the coefficients W of irrelevant variables to zero [15]. Constructing all of the k classifiers together corresponds to minimizing the overall regularized error as follows:

minWi,bi,i=1,,ki=1k[(yi,X,Wi)+λR(Wi)] (8)

(3) Constrained Conversion

Clearly, the first objective is not convex, which leads to a non-convex multi-objective program. The constraint conversion method is more suitable to find the Pareto-optimal solutions to this problem. As the subtyping problem seeks to obtain clusters that are interpretable in the X dimension (genetic markers), we model the first objective as a constraint. In other words, we search for solutions that minimize the second objective subject to an acceptable quality of clusters in the Z dimension (clinical features). The following problem (9) is the problem we will solve.

minσ,Wi,bii=1,,ki=1k(j=1n[1-yij(WiTXj+bi)]++λWi1)subjectto1ki=1kmaxijDist(Ci)+Dist(Cj)Dist(Ci,Cj)δlσσuσ (9)

where δ, lσ and uσ are tuning parameters to bound σ.

C. The Proposed Algorithm

Traditional methods for finding the optimal solution to a constrained optimization problem include deterministic approaches such as gradient-based methods, Newton’s methods, and non-deterministic approaches such as simulated annealing [19]. To avoid the difficulty of computing derivatives of the objective function, we design an efficient algorithm based on simulated annealing to solve the converted MOP (9) as depicted in Algorithm 1.

Algorithm 1.

Simulated Annealing for MOP (9)

Input: Z, X, k, δ, MI
Initialize: σ, T, h = 0;
for t = 0 to MI do
 Calculate Temperature T;
 Find a neighbor of σ to obtain σnew based on T;
 Construct adjacency matrix A using Z and the Gaussian similarity with σnew;
 Obtain clusters Ci, i = 1, · · ·, k, by running Ncut with A and k;
 Calculate Obj1 in (7) and assign its value to q;
if qδ then
  Compute Wi, bi for each Ci separately by the one-norm SVM;
  Calculate Obj2 in (8) and assign its value to hnew;
else
  Continue;
end if
if probability(h, hnew, T) > random(0, 1) then
  h = hnew, σ = σnew;
end if
end for
Output: clusters Ci:1,…,k, the values of Obj1 and Obj2.

In Algorithm 1, the temperature T starts from a high value, and decreases gradually at each iteration. A probability density function defined according to T is used to search for σnew. The first objective is evaluated after the clusters are obtained. If this objective is within the pre-specified limit δ, an SVM model is constructed for each cluster, and the second objective is evaluated. The probability of accepting σnew is calculated via the acceptance probability density function discussed in [20] and defined by the objective values h, hnew and the temperature T. If this probability is larger than a number randomly drawn from [0, 1], σnew is accepted; or otherwise the previous value of σ is retained. Readers can consult with [20] for more discussions on simulated annealing.

IV. Computational Results

We applied the proposed algorithm to two real-world data sets that were aggregated from genetic studies of opioid dependence (OD) and cocaine dependence (CD) [4], [5], [6], [7]. We limited the analysis to European Americans to avoid confounding by population differences in allele frequencies and structure. We compared our approach to an existing subtyping method that performed a sequence of two separate steps: spectral clustering and one-norm SVM classification in the same fashion as in [4]. We refer to this as the sequential subtyping method. The two approaches were compared in terms of the separability of their resultant clusters based on genetic markers.

A. Data sets

Subjects were recruited from multiple sites, including Yale University School of Medicine, University of Connecticut Health Center, University of Pennsylvania School of Medicine, McLean Hospital and Medical University of South Carolina. All subjects gave written, informed consent to participate, using procedures approved by the institutional review board at each participating site.

Opioid use and cocaine use behaviors were assessed by two separate components dedicated to the diagnosis of OD and CD respectively in a computer-assisted interview process, called the Semi-Structured Assessment for Drug Dependence and Alcoholism (SSADDA) [21]. The SSADDA variables selected by previous OD and CD subtyping studies [6], [4] were used in the present analysis. Multiple Correspondence Analysis (MCA) [22] was performed to reduce data. The top MCA dimensions that overall explained more than 80% of data variance were used in cluster analysis.

A total of 1350 single nucleotide polymorphisms (SNPs) selected from 130 candidate genes were genotyped for association tests [23]. For each dataset, we performed quality control as follows. SNPs for which data were available for less than 95% of the subjects, or for which the P-value for Hardy-Weinberg equilibrium was less than 10−7 were excluded from further analysis. The minor allele frequency (MAF) of each SNP was calculated within each population. SNPs with MAF less than 0.5% in a population were removed from the association tests for the respective population. The remaining missing entries in the SNP data were imputed.

For the OD data set, we treat opioid users as cases and healthy subjects as controls. For the CD data set, subjects who were diagnosed with cocaine dependence were treated as cases and healthy subjects who had been exposed to illicit drugs were regarded as controls. Table I summarizes the statistics of the two data sets in terms of the numbers of cases, controls, SSADDA variables (Vars), MCA dimensions (Dims), and SNPs used in the subtyping analysis.

TABLE I.

Summary of the OD and CD data sets

Dataset #cases #controls #Vars #MCA Dims #SNPs
opioid 827 643 69 13 1185
cocaine 1279 187 68 25 1248

B. Experimental settings

We utilized the CPLEX optimization package to solve the one-norm SVM, and implemented spectral clustering in MAT-LAB. Adaptive simulated annealing, an open source variant of simulated annealing, together with its MATLAB gateway (ASAMIN v1.39) was used to search for the value of σ that optimizes the multi-objective program (9). The parameters δ, λ were set to 0.7 and 0.08 respectively. The upper bound of σ, uσ was set to a number that led to a pairwise similarity value of at least 0.99, and the lower bound of σ, lσ was set to the value producing a similarity matrix of the median value less than 0.0001. These tuning steps were based on 3-fold cross validation.

A typical way to choose a value for σ is to use the median value of all entries in the pairwise distance matrix [14]. We fixed σ to the median value in the sequential method. For both the proposed and sequential methods, cluster analysis was only applied to cases. The resultant clusters were characterized based on important clinical features related to drug use and related behaviors. A generalized estimating equation (GEE) Wald Type 3 χ2-test was employed to test the significance of the difference between the resultant clusters in these clinical variables with Bonferroni correction for multiple comparisons.

For each obtained cluster, an SVM model was built to separate cases in the cluster labeled as +1 from controls labeled as −1. SVM is sensitive to unbalanced data where the size of a sample with one label is significantly larger than that with another label. To address this problem, we duplicated subjects in the smaller group to make the sample size of the two groups comparable. Let a and b be the dominating and minor groups, respectively, na and nb be their sample sizes, and t = ⌊ na/nb ⌋. We first duplicated each subject labeled by b t times, and then randomly selected nat * nb subjects from the sample pool composed by all subjects with label b. Ten-fold cross validation with stratified case-control split was conducted for every cluster, and receiver operating characteristic (ROC) curves were obtained using the test data combined from all folds to evaluate the classification performance. We provide the Area Under the ROC Curve (AUC) in our results to compare the two methods. The AUC reflects the cluster separability based on genetic markers.

Moreover, different analytic approaches, such as SVM, or logistic regression, may identify important SNPs of different associative effects. A larger coefficient for a SNP in the SVM models does not necessarily translate into a smaller p-value in logistic regression. We further tested each of the selected SNPs, i.e., those with none zero coefficients in the SVM models, by a separate logistic regression and evaluated their corresponding p-values to determine the significance of the association with the identified subtypes. Here, logistic regression models were obtained in the similar sampling scheme introduced early on to balance the data.

C. Opioid use subtypes

We set the desired number of clusters to 2, so that the resultant clusters were sufficiently large and gave adequate statistical power. The optimal value of σ found by our approach was 5.8.

1) Cluster Clinical Characteristics

We characterized the two clusters obtained with σ = 5.8 based on 11 important clinical variables depicting opioid use and its consequences. Table II shows that the two clusters differ significantly on almost all of these clinical features, except the mean age of first opioid use. Subjects in Cluster 1 have used opioids more heavily than those in Cluster 2. For example, they had heavier daily use and more intravenous injections. The negative consequences of opioid use, such as “interfering with work” and “been arrested” among subjects in Cluster 1 were much more severe than those for subjects in Cluster 2. Thus, Cluster 1 was a heavy opioid user group whereas Cluster 2 was composed of moderate opioid users.

TABLE II.

Clinical Opioid-related Characteristics of opioid user clusters [N(%)]

Behaviors Cluster 1
657(79.44)
Cluster 2
170(20.56)
χ2(df) p-value
Age of first use [Mean (SD) in year] 21.15(6.59) 21.67(7.71) 0.58 0.45
Used opioids daily or almost daily 653(99.39) 107(62.94) 65.48 5.55 × 10−16
Injected opioids intravenously 526(80.06) 50(29.41) 134.40 < 1 × 10−16
Stayed high from opioids for a whole day or more 599(91.17) 103(60.59) 78.05 < 1 × 10−16
Strong desire for opioids made it hard to think of anything else 617(93.91) 50(29.41) 245.63 < 1 × 10−16
Opioid use interfered with work, school, or home life 574(87.37) 39(22.94) 201.13 < 1 × 10−16
Family members, friends, doctor, clergy, boss, or people at work or school objected to opioid use 611(93.00) 52(30.59) 187.13 < 1 × 10−16
Been arrested or had trouble with the police because of opioid use 444(67.58) 23(13.53) 114.34 < 1 × 10−16
Give up or greatly reduced important activities due to opioid use 600(91.32) 48(28.24) 212.67 < 1 × 10−16
Ever treated for an opioid-related problem 610(92.85) 37(21.76) 260.89 < 1 × 10−16
Ever attended self-help group for opioid use 505(76.86) 23(13.53) 141.76 < 1 × 10−16

2) Associated Genetic Markers

Eight SNPs were associated with Cluster 1 at p < 1 × 10−3 as shown in Table III. A SNP (rs915906) was very close to the empirical threshold (p < 0.05/1154 = 4.34 × 10−5) after Bonferroni correction was applied to address the inflation of type I error due to multiple tests. For Cluster 2, SNP rs6957496 on gene CHRM2 was significant with a p-value close to 10−5, and it remained significant after Bonferroni correction (empirical threshold: p < 0.05/1154 = 4.34 × 10−5). Odds ratios and the genes where the corresponding SNPs are located are also shown in Table III.

TABLE III.

Risk factors (SNPs) associated with opioid-use subtypes

SNP p-value Odds Ratio Gene
Cluster 1 rs915906 5.32 × 10−5 0.6595 CYP2E1
rs10896065 3.32 × 10−4 2.0537 FOSL1
rs7940700 4.15 × 10−4 2.2496 FOSL1
rs755203 5.18 × 10−4 0.7617 CHRNA4
rs2581206 5.56 × 10−4 0.7594 SLC6A11
rs698 5.59 × 10−4 0.7615 ADH1C
rs4077851 7.69 × 10−4 1.5542 GABRB2
rs2515642 8.02 × 10−4 0.7294 CYP2E1

Cluster 2 rs6957496 1.09 × 10−5 2.25 CHRM2

3) Comparison

For the sequential method, we followed the standard approach to selecting σ for spectral clustering [14] and computed the median value of the pairwise distances, which was 1.07. When σ = 1.07, a very unbalanced partition was resulted: 826 in one cluster and 1 in the other, which was not of practical value. In order to find a σ value that gives clusters of similar size, we increased the value of σ several times, and each time by 1 until a proper σ was found. The final value was 6.07. Two clusters were built to separate each subtype from controls. The AUC results were compared to evaluate the cluster separability in the genetic view as shown in Table IV. Genetic markers had better predictive power for those clusters obtained by the proposed approach than the sequential method with a larger supporting sample size. More significant associations were found for the clusters created by the proposed method. Thus it demonstrates the effectiveness of the proposed method.

TABLE IV.

Comparison on genetic separability of opioid user clusters

Optimal σ = 5.8 σ = 6.07
N(%) AUC N(%) AUC
Cluster 1 657(79.4) 0.59 600(72.6) 0.50
Cluster 2 170(20.6) 0.85 227(27.4) 0.80

D. Cocaine use subtypes

Since a large number of cases were available, we set the desired number of clusters to 3. The optimal value of σ found by our approach here was 1.76.

1) Cluster Clinical Characteristics

The three clusters obtained with σ = 1.76 were characterized in Table V based on 12 important features related to cocaine use and its consequences. Table V shows that the three clusters differ significantly on all the 12 clinical features. Both Clusters 1 and 3 were heavy cocaine user groups compared to Cluster 2 as indicated by almost all of the features. For example, 96.76% and 94.71% of the subjects in Clusters 1 and 3, respectively, ever used cocaine daily or almost daily in comparison with only 76.52% of the subjects in Cluster 2. Even though Clusters 1 and 3 were both heavy user groups, they were distinct on several features, especially on the age of onset and on cocaine intravenous injection rates. Subjects in Cluster 1 started the initial and heavy use of cocaine at much younger age than those in Cluster 3. Cluster 1 had a high portion of subjects (91.47%) who had injected cocaine intravenously in contrast to a much lower rate of that (9.19%) in Cluster 3.

TABLE V.

Clinical cocaine-related Characteristics of cocaine user clusters [N(%)]

Behaviors Cluster 1
340(33.11)
Cluster 2
328(31.94)
Cluster 3
359(34.96)
χ2(df) p-value
Age of first cocaine use [Mean (SD) in year] 17.61(4.13) 19.53(5.16) 21.28(6.22) 79.50(2) < 1 × 10−16
Age of onset of heaviest cocaine use [Mean (SD) in year] 25.95(8.09) 25.82(8.12) 29.47(7.70) 44.48(2) 2.19 × 10−10
Used cocaine daily or almost daily 329(96.76) 251(76.52) 340(94.71) 73.32(2) 1.11 × 10−16
Injected cocaine intravenously 311(91.47) 132(40.24) 33(9.19) 298.77(2) < 1 × 10−16
Stayed high from cocaine for a whole day or more 304(89.41) 210(64.02) 327(91.09) 83.49(2) < 1 × 10−16
Strong desire for cocaine made it hard to think of anything else 308(90.59) 176(53.66) 332(92.48) 162.45(2) < 1 × 10−16
Cocaine interfered with work, school, or home life 312(91.76) 139(42.38) 311(86.63) 198.06(2) < 1 × 10−16
Family members, friends, doctor, clergy, boss, or people at work or school objected to cocaine use 310(91.18) 173(52.74) 324(90.25) 159.72(2) < 1 × 10−16
Been arrested or had trouble with the police because of cocaine use 223(65.59) 69(21.04) 175(48.75) 127.35(2) < 1 × 10−16
Give up or greatly reduced important activities due to cocaine use 321(94.41) 179(54.57) 340(94.71) 177.31(2) < 1 × 10−16
Ever treated for an cocaine-related problem 264(77.65) 91(27.74) 249(69.36) 178.74(2) < 1 × 10−16
Ever attended self-help group for cocaine use 250(73.53) 89(27.13) 227(63.23) 139.27(2) < 1 × 10−16

2) Associated Genetic Markers

The results from association tests for the three clusters are provided in Table VI, in which only those SNPs with tested p values less than 1×10−3 are shown. SNP rs3802280 on gene OPRK1 was associated with Cluster 1 at p < 1 × 10−3. Four SNPs were identified to be nominally associated with Cluster 3 at p < 1×10−3. None of the SNPs was identified to be associated with Cluster 2 at p < 1 × 10−3.

3) Comparison

For the CD data, the median value of the pairwise distances was 1.45, which was used as the value of σ in the sequential method. We ran spectral clustering based on the similarity matrix and also obtained three clusters. We compared these three clusters against those obtained by our approach in terms of the cluster separability based on genetic data. We built three classifiers, each used to separate subjects in one cluster from the controls. We averaged the AUC of the three classifiers with standard deviation over the 10-fold cross validation. A box plot was drawn for each method as shown in Figure 1. As shown in Figure 1, classifiers trained on the clusters obtained by the proposed method have a slightly better average AUC value (i.e., separability) and significantly smaller error bar than those obtained on the clusters from the sequential method, which implicates that the proposed approach is better in terms of finding genetically-separable clinical clusters than the existing approach.

Fig. 1.

Fig. 1

The comparison of genetic separability of the cocaine user clusters obtained by the proposed method and the sequential method in [4].

V. Conclusion

Identifying genes that contribute to risk of complex diseases has been challenging due to two major issues: (1) The diseases have diverse clinical manifestations and complex etiology with both genetic and environmental risk factors. (2) Disease phenotypes are heterogeneous and homogeneous subtypes have not been optimized empirically. To address these issues, researchers have sought to leverage the technology of cluster analysis to identify clinically homogeneous subtypes that correlate to homogeneous genetic risk factors. Although encouraging results have been obtained, success remains limited because existing methods mismatch the clinical cluster analysis to the goal of genetic association.

We have developed a novel multi-objective programming approach that optimizes two objectives: (1) the cluster-derived subtypes should differ significantly in clinical features; (2) the subtypes can be classified using genetic markers. Our method forms a novel multi-view data analytic method that treats the different views differently instead of equally as input views. In our method, the view of clinical features was used to define and derive subtypes of the disease based on cluster analysis, and the view of genetic markers was used to interpret the subtypes based on sparse modeling. Two case studies of subtyping of opioid use and cocaine use, and related behaviors in aggregated samples of European Americans were performed. A comparison between our proposed approach and a typical subtyping method [4] demonstrated the superiority of our approach.

TABLE VI.

Risk factors (SNPs) associated with cocaine-use subtypes

SNP p-value Odds Ratio Gene
Cluster 1 rs3802280 7.98 × 10−4 1.8265 OPRK1

Cluster 3 rs511895 3.03 × 10−4 0.6456 CAT
rs722651 4.95 × 10−4 1.5062 MPDZ
rs7940700 5.87 × 10−4 0.6585 CAT
rs494024 6.22 × 10−4 0.6602 CAT

Acknowledgments

We thank Dr. J. Gelernter who provided us genetic data from NIAAA arrays that enabled the experimental analysis in this study. This work was supported by NIH grants DA12849, DA12690, DA22288, AA03510, AA11330, and AA13736.

Biographies

graphic file with name nihms621260b1.gifJiangwen Sun received a Bachelor Degree in Clinical Medicine and a Master Degree in Computer Engineering. He has been studying for the Ph.D. in Computer Engineering at the University of Connecticut since 2010. His research interests include: health informatics, bioinformatics, data mining and machine learning.

graphic file with name nihms621260b2.gifJinbo Bi, Ph.D. received a Ph.D. in mathematics and M.Sc. in Electrical Engineering. She is an associate professor of Computer Science and Engineering at the University of Connecticut. Prior to her current appointment, she worked with Siemens Medical Solutions on computer aided diagnosis research and Partners Healthcare on clinical decision support systems. Her research interests lie in machine learning, data mining, bioinformatics and biomedical informatics.

graphic file with name nihms621260b3.gifHenry R. Kranzler, M.D is a clinical addiction psychiatrist and Director of the Center for Studies of Addiction at the University of Pennsylvania. His research focuses on the genetics and pharmacological treatment of alcohol and drug dependence.

Contributor Information

Jiangwen Sun, Email: javon@engr.uconn.edu, Department of Computer Science and Engineering, University of Connecticut, Storrs, CT, 06269 USA.

Jinbo Bi, Email: jinbo@engr.uconn.edu, Department of Computer Science and Engineering, University of Connecticut, 371 Fairfield Way, Unit 4155, Storrs, CT 06269.

Henry R. Kranzler, Email: kranzler@mail.med.upenn.edu, Treatment Research Center, University of Pennsylvania, Perelman School of Medicine and Philadelphia VAMC, Philadelphia, PA, USA

References

  • 1.Sorlie T. Introducing molecular subtyping of breast cancer into the clinic? Journal of clinical oncology : official journal of the American Society of Clinical Oncology. 2009;27(8):1153. doi: 10.1200/JCO.2008.20.6276. [DOI] [PubMed] [Google Scholar]
  • 2.Godfrey A, Leonard M, Donnelly S, Conroy M, Laighin Gi, Meagher D. Validating a new clinical subtyping scheme for delirium with electronic motion analysis. Psychiatry Research. 2010;178(1):186–190. doi: 10.1016/j.psychres.2009.04.010. [DOI] [PubMed] [Google Scholar]
  • 3.Glahn DC, Curran JE, Winkler AM, Carless MA, Kent JW, Charlesworth JC, Johnson MP, Goring HHH, Cole SA, Dyer TD, Moses EK, Olvera RL, Kochunov P, Duggirala R, Fox PT, Almasy L, Blangero J D. Molecular Substrates of Neuroplasticity. High dimensional endophenotype ranking in the search for major depression risk genes. Biological psychiatry. 2012;71(1):6–14. doi: 10.1016/j.biopsych.2011.08.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Chan G, Gelernter J, Oslin D, Farrer L, Kranzler HR. Empirically derived subtypes of opioid use and related behaviors. Addiction. 2011;106(6):1146–1154. doi: 10.1111/j.1360-0443.2011.03390.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Sun J, Bi J, Chan G, Anton RF, Oslin D, Farrer L, Gelernter J, Kranzler HR. Improved methods to identify stable, highly heritable subtypes of opioid use and related behaviors. Addictive Behaviors. 2012;37(10):1138–1144. doi: 10.1016/j.addbeh.2012.05.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Kranzler HR, Wilcox M, Weiss RD, Brady K, Hesselbrock V, Rounsaville B, Farrer L, Gelernter J. The validity of cocaine dependence subtypes. Addictive Behavior. 2008;33(1):41–53. doi: 10.1016/j.addbeh.2007.05.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Gelernter J, Panhuysen C, Wilcox M, Hesselbrock V, Rounsaville B, Poling J, Weiss R, Sonne S, Zhao H, Farrer L, Kranzler HR. Genomewide linkage scan for opioid dependence and related traits. Am J Hum Genet. 2006;78(5):759–69. doi: 10.1086/503631. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Blum A, Mitchell T. Combining labeled and unlabeled data with co-training. Proceedings of the 11th Annual Conference on Computational Learning Theory; New York, NY, USA: ACM; 1998. pp. 92–100. [Google Scholar]
  • 9.Kumar A, HD A co-training approach for multi-view spectral clustering. In: Getoor L, Scheffer T, editors. Proceedings of the 28th International Conference on Machine Learning; New York, NY, USA: ACM; 2011. pp. 393–400. [Google Scholar]
  • 10.Dunn JC. Well separated clusters and optimal fuzzy-partitions. Journal of Cybernetics. 1974;4:95–104. [Google Scholar]
  • 11.Davies DL, Bouldin DW. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1979 Apr;PAMI-1(2):224–227. [PubMed] [Google Scholar]
  • 12.Rousseeuw P. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987 Nov;20(1):53–65. [Google Scholar]
  • 13.Bi J. Multi-objective programming in SVMs. Proceedings of the 20th International Conference on Machine Learning; 2003; pp. 35–42. [Google Scholar]
  • 14.Luxburg U. A tutorial on spectral clustering. Statistics and Computing. 2007 Dec;17:395–416. [Google Scholar]
  • 15.Zhu J, Rosset S, Hastie T, Tibshirani R. Neural Information Processing Systems. MIT Press; 2003. 1-norm support vector machines; p. 16. [Google Scholar]
  • 16.Hagen L, Kahng AB. New spectral methods for ratio cut partitioning and clustering. IEEE Trans Comput-Aided Design Integr Circuits Syst. 1992 Sep;11(9):1074–1085. [Google Scholar]
  • 17.Shi J, Malik J. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2000;22(8):888–905. [Google Scholar]
  • 18.Wagner D, Wagner F. Between min cut and graph bisection. Mathematical Foundations of Computer Science. 1993:744–750. [Google Scholar]
  • 19.Kirkpatrick S, Gelatt CD, Vecchi MP. Optimization by simulated annealing. Science, Number 4598, 13 May 1983. 1983;220(4598):671–680. doi: 10.1126/science.220.4598.671. [DOI] [PubMed] [Google Scholar]
  • 20.Ingber L. Very fast simulated re-annealing. Mathematical and Computer Modelling. 1989;12(8):967–973. [Google Scholar]
  • 21.Pierucci-Lagha A, Gelernter J, Chan G, Arias A, Cubells JF, Farrer L, Kranzler HR. Reliability of dsm-iv diagnostic criteria using the semi-structured assessment for drug dependence and alcoholism (ssadda) Drug Alcohol Depend. 2007;91(1):85–90. doi: 10.1016/j.drugalcdep.2007.04.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Murtagh F. Multiple correspondence analysis and related methods. Psychometrika. 2007;72(2):275–277. [Google Scholar]
  • 23.Hodgkinson CA, Yuan Q, Xu K, Shen PH, Heinz E, Lobos EA, Binder EB, Cubells J, Ehlers CL, Gelernter J, Mann J, Riley B, Roy A, Tabakoff B, Todd RD, Zhou Z, Goldman D. Addictions biology: haplotype-based analysis for 130 candidate genes on a single array. Alcohol Alcohol. 2008;43(5):505–15. doi: 10.1093/alcalc/agn032. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES