Abstract
Background
Analyzing longitudinal gene expression data is extremely challenging due to limited prior information, high dimensionality, and heterogeneity. Similar difficulties arise in research of multifactorial diseases such as Type 2 Diabetes. Clustering methods can be applied to automatically group similar observations. Common clinical values within the resulting groups suggest potential associations. However, applying traditional clustering methods to gene expression over time fails to capture variations in the response. Therefore, shape-based clustering could be applied to identify patient groups by gene expression variation in a large time metabolic compensatory intervention.
Objectives
To search for clinical grouping patterns between subjects that showed similar structure in the variation of IL-1β gene expression over time.
Methods
A new approach for shape-based clustering by IL-1β expression behavior was applied to a real longitudinal database of Type 2 Diabetes patients. In order to capture correctly variations in the response, we applied traditional clustering methods to slopes between measurements.
Results
In this setting, the application of K-Medoids using the Manhattan distance yielded the best results for the corresponding database. Among the resulting groups, one of the clusters presented significant differences in many key clinical values regarding the metabolic syndrome in comparison to the rest of the data.
Conclusions
The proposed method can be used to group patients according to variation patterns in gene expression (or other applications) and thus, provide clinical insights even when there is no previous knowledge on the subject clinical profile and few timepoints for each individual.
Keywords: Shape-based clustering, Longitudinal data, Gene expression, Type 2 Diabetes, Knowledge Discovery in Databases
Introduction
Type 2 Diabetes (T2D) is one of the most complex, prevalent and heterogeneous diseases whose etiology involves multiple interactions between genetic predisposing factors and environmental triggers [1]. Inflammation is a relevant component of the pathophysiological alterations that define the progression from metabolic syndrome to T2D [2]. Interleukin-1 beta (IL-1β) is a proinflammatory cytokine related to this clinical inflammation in T2D individuals, and is a well-known immune system modulator secreted by activated macrophages that can affect β-cell function and reduce insulin secretion [3]. Currently, one of the most important lines of research in diabetes is precision medicine, with the principal aim of grouping T2D individuals in different clinical subtypes defined by biomarkers. This group identification could be translated into an emerging approach to disease treatment and prevention that considers individual variability in genes, environment and lifestyle [4]. In this way, Ahlqvist et al. could recently break down T2D subjects into five distinct subgroups, with an improvement prediction of disease progression and outcome by including six variables (age at diagnosis, body mass index [BMI], glycated haemoglobin [HbA1c], Glutamic Acid Decarboxylase Autoantibodies [GADA], estimation of insulin secretion [HOMA-B] and estimation of insulin-sensitivity [HOMA-IS]) [5]. The measurement of GADA by Ahlqvist et al. assessed the possible diagnosis of LADA (Latent Autoimmune Diabetes in Adults). The role of precision medicine in diabetes management was recognized by the American Diabetes Association (ADA) in collaboration with the European Association for the Study of Diabetes (EASD), which launched the Precision Medicine Initiative in 2018 in Diabetes [6]. The ultimate goal of precision medicine is the personalized provision of medical care, with better recognition of people at high risk for the development of T2D and its complications, and the implementation of personalized treatments at the individual level. In this sense, artificial intelligence could be used to detect clinical subtypes by matching individuals to their combinations of different biomarkers, with techniques such as large-scale prediction models. In the last decades, there have been great developments in methods for gene expression analysis, giving rise to an abundant quantity of data [7]. Since the amount of data grows faster than the ability to understand their implications, methods that allow drawing conclusions from gene expression data can be very useful to narrow this gap. The analysis of gene expression data can be very challenging due to limited prior knowledge on the observed phenomenon, heterogeneity, noise in the data and missing observations in the subject data [8]. Therefore, data mining tools that can provide potential relationships among framework. Longitudinal studies include repeated measures of a variable of interest -usually called a response- in the same subject over time, yielding multiple responses per individual noted as a response trajectory. In this work, the response variable relates to gene expression at a certain time point and the response trajectory describes the evolution of gene expression for a certain individual over time. It must be pointed out that when there are few time points and mistimed measurements, the mathematical tools that can be used are limited. For example, Fourier transformations, the standard procedure for time series, are no longer valid for few measurements. In this setting, the increase or decrease of gene expression between different measurement occasions can be studied [9]. Variables can be very useful for a clinical comprehension Clustering algorithms aim to group observations according to some measure of similarity, or conversely, to separate observations according to dissimilarity. When quantitative variables are involved, the dissimilarity can be based on distance measures. The selection of these features is closely related to the application area and the research objective. Regarding clustering algorithms, K-Means is the most popular method due to the low computational complexity of the algorithm and performance in big data. A variation of this method is the Kernel Based K-means algorithm [10]. The major disadvantage of these algorithms is the susceptibility to outliers and to the random initial group assignment. Another alternative is the K-Medoids algorithm [11]. This algorithm is more robust to outliers and initialization than K-means. Some works proposed clustering subjects according to the corresponding variation of gene expression, suggesting associations between a certain behavior in the gene expression over time with other variables [9], [12]. Many publications used this approach assuming simultaneous measurements to cluster different genes according to the increase or decrease in their expression, defining groups of co-expressed genes, or activating and repressing genes [13-18]. On most occasions, data corresponding to different subjects are not simultaneously collected, and other strategies must be used. Möller et al., applied a clustering algorithm to the transcriptional program of budding yeast, allowing mistimed measurements [19-20]. In a similar way, similarities in the variation of gene expression, can suggest associations with observable clinical features, which can be a starting point for further investigation.
Objectives
The main objective of this work was to cluster subjects in order to find relationships between patterns in Interleukin 1-beta (IL-1β) variation and clinical metabolic variables from a database of T2D patients. Also, to focus potential associations with obesity and metabolic syndrome as central clinical phenotypes. In this article, we perform a new analysis of data from a cohort of patients previously studied by our research group [21].
Materials and methods: 1. Prospective controlled study database
The database used for the development of the clustering algorithm included the results of a prospective controlled study conducted in patients with newly diagnosed T2D and hyperglycemia (HbA1c > 8%), and after 6 and 12 months of treatment to achieve metabolic remission (HbA1c < 7%). The treatment was personalized: each participant received the first-line pharmacological treatment, and in all cases lifestyle changes were included through diet and physical exercise. Detailed information on this population can be found in our previously published manuscript [21]. It was the first follow-up study that evaluated IL-1β mRNA expression in hyperglycemic people with T2D after glycemic normalization treatment.
The study was conducted in a group of 30 adults (23.33% were female subjects and 76.67% male subjects) with a median age of 46 years (IQR 18.75 years) recruited from the Diabetes Care Unit. All procedures performed in the study were in accordance with the ethical standards of the institutional research committee, the 1964 Helsinki Declaration and its later amendments or comparable ethical standards. The study was approved by the Ethics Committees of the Hospital de Clínicas “José de San Martín” from Ciudad Autónoma de Buenos Aires and all the participants gave their written informed consent. An anonymized database for pre and post intervention (6 and 12 months) instances was constructed for the data mining study. All individuals informed their age and gender and anthropometric measurements (height, weight, and waist circumference), BMI and systolic and diastolic blood pressure (SBP and DBP, respectively) were determined by standardized protocols. Venous blood samples were drawn of every individual, high-density lipoprotein cholesterol (HDL-c), triglycerides (TG), fasting blood glucose (FBG) and HbA1c were measured in serum using standardized procedures [21]. Low-density lipoprotein cholesterol (LDL-C) was calculated by the Friedewald equation. Blood anticoagulated with EDTA K2 was used for mRNA extraction and IL-1β mRNA expression analysis.
2. Notation
The subjects considered in the study required the same number of repeated measures over time (noted as variable t at 0, 6 and 12 months from intervention). The different subjects were grouped according to the variation over time of the IL-1β gene expression (noted as variable r). With this notation, for example, t(i,j) and r(i,j) represent the time point and the gene expression at measurement number j for subject i, respectively.
3. Clustering methods
We used hard partitioning methods for quantitative features. Clustering applications was performed after three sequential definitions:
set of variables to be considered;
distance between different variable observations defined in item A;
clustering algorithm that groups observations defined in item A according to the distance function defined in item B.
A) The clustering objective was set on grouping subjects according to the increase or decrease of gene expression, therefore the algorithm considered two subjects as similar if the corresponding slopes between time points are similar. For each subject i, the variation of gene expression r between time points j–1 and j is given by the following slope value:
Therefore, if each subject i has a set of J repeated measures noted r(i), the same subject has a corresponding set of slopes m(i) with J-1 values. These sets of slopes will be noted as slope vectors. Thus, the slope vectors m(i) were used instead of using the response vectors r(i) for each subject i. Hence, the automatic grouping relied on the distance between slopes.
B) Whenever it was applicable, two distance functions were considered:
• the Euclidean distance [22], that adds the squared differences of slopes and applies a square root to the results. For example, for two subjects i and k the distance is computed as:
• the Manhattan distance [23], that adds the absolute values of the slope differences. For example, for two subjects i and k the distance is calculated as follows:
C) Regarding the clustering methods, three alternatives were applied
• K-Means (based on the Euclidean distance)
• Gaussian Kernel based K-Means (based on the Euclidean distance)
• K-Medoids (based on the Manhattan distance)
These clustering methods are used to group individuals according to their corresponding set of slopes m(i). Regarding the Kernel function required for Kernel based K-Means, a Gaussian kernel function was applied [10]. It must be pointed out that the K-Means based algorithms apply exclusively the Euclidean distance, whereas the K-Medoids algorithm allows the use of other distance functions, such as the Manhattan distance. The algorithms were applied using standard commands of R software. Details on the clustering algorithms are available in Hastie et al. [11]. In the sequel we will refer to clustering algorithms applied to the individuals’ set of slopes as shape-based. For example, K-Medoids applied to the slope vectors m(i) can be referred as Shape-Based K-Medoids. More details on these procedures can be found in Appendix C.
4. Statistical Inference
Once the data was grouped in clusters, statistical tests were applied to the anthropometric and metabolic variables of the database, searching for group differences in BMI, HDL-c, TG, LDL-c, FBG, HbA1c, Waist circumference, Age and number of Metabolic Syndrome components [ncMS], according to the Adult Treatment Panel III (ATPIII) guidelines [24]. To assess the statistical significance of differences between and within groups we performed non-parametric tests due to the small sample size and unverifiable assumptions. Kruskal-Wallis test was performed to assess the differences between groups [25], and paired Wilcoxon test to assess the differences within groups [26].
Results
The database was analyzed and subjects with at least one missing response value were excluded, due to the impossibility to attain a slope set comparable with other subjects. After removal, a total of 26 individuals remained for further research. The responses were scaled prior to partitioning, thus, the mean gene expression was subtracted and the result was divided by the corresponding standard deviation [27]. Figure 1 shows the different groups resulting from the applied clustering algorithms. The partitions of the algorithms involving K-means (upper [a] and central [b] panel of Figure 1), result in groups that are likely to mix stable and highly variable gene expression trajectories. This effect can be explained by the lack of robustness of the K-means algorithm. Inspecting the results, K-Medoids clustering (lower panel [c] of Figure 1) is preferred in this application based on the following observations: subjects in Cluster 1 had an initial decrease and a posterior increase, subjects in Cluster 2 showed an initial increase and a posterior decrease, whereas subjects in Cluster 3 had a stable level of IL-1β expression throughout the study, with small increases or decreases over time. Therefore, in the following, the results of the K-Medoids will be shown since the requirements of variation similarities are met. Furthermore, although all clustering methods are subject to randomness, the K-Medoids algorithm showed such robustness that running several times the procedure yielded the same partition. For the K-Medoids algorithm, it is worth mentioning that there was a subject in Cluster 2 whose gene expression increased in both time intervals and has been classified in this group due to the initial increase, which is not present in other clusters, and therefore, the algorithm located the subject in the most similar group. This subject could be morphologically seen as an outlier, and perhaps should have been classified in a separate group. However, a single subject cluster does not allow a correct between-group comparison. Given this clustering, a subsequent analysis was performed in the remaining variables of the database. The main results are given in Table 1. We found significant differences across groups in waist circumference, BMI, HDL-c and TG; and a tendency for LDL-c; but we did not find significant differences in FBG and HbA1c (Table 1). This similarity across groups is explained by the main objective of the original design of the study in order to follow up on the T2D individuals: to attain a decrease in HbA1c levels for all the participants. Also, since the Kruskal-Wallis detects differences between groups, further inspection of the values of most variables suggest that this difference is mainly observed in subjects from Cluster 1. Table 1 shows that subjects in Cluster 1 presented a decrease in LDL-c, TG and increase in HDL-c over time, whereas these values were stable for other clusters. Also, BMI and waist circumference values for subjects in Cluster 1 were smaller compared to those of the other clusters, also suggesting healthier features for Cluster 1. In addition, the Wilcoxon paired test was applied to all variables comparing the values at the start and the end of the study. The Wilcoxon test was not performed in ncMS and Age since the values do not vary over time. The lowest p-values corresponded to Cluster 1, suggesting greater differences in key variables for subjects in this group. Even if statistical significance was not achieved, the p-value is close to 10%, which represents considerable differences in the variables, given the small number of subjects and that non-parametric tests generally provide less statistical power. In addition, the p-values for Cluster 1 are considerably lower than the values corresponding to other clusters, reinforcing the observable difference between the evolutions of people from different clusters. Although we found differences in age, none of the variables analyzed showed a significant association with age (data not shown).
Figure 1:

IL-1β (2-Δ Ct) expression over time, grouped according to the slopes between time points using the three clustering algorithms described in Section 3.3: (a) K-Means (Upper panel), (b) Kernel K-Means (Center panel) and (c) K-Medoids (Lower panel).
Table 1:
Observed differences in quantitative variables of the dataset, separated by time measurement (at 0, 6 and 12 months). The waist circumference results at 6 months were omitted due to a low proportion of observed data. m: median; IQR: interquartile range; BMI: body mass index; HDL-c: high-density lipoprotein cholesterol; LDL-c:low-density lipoprotein cholesterol; TG: triglycerides; HbA1c: glycated haemoglobin; FBG: fasting blood glucose; ncMS: number of Metabolic Syndrome components.
| Variable | Time | Cluster 1 m (IQR) |
Cluster 2 m (IQR) |
Cluster 3 m (IQR) |
P-value (Kruskal-Wallis) |
|---|---|---|---|---|---|
| Waist circumference | 0 mo | 100 (92-102) | 104 (100-109) | 112 (100-117) | 0.0149 |
| (cm) | 12 mo | 100 (97-103) | 106 (101-114) | 111 (98-119) | |
| p-value (Wilcoxon) | 0/12 mo | 0.8922 | 1.0000 | 0.9056 | |
| BMI (kg/m2) | 0 mo | 31.11 (29.22-31.42) | 32.91 (31.23-38.02) | 34.02 (31.60-37.92) | 0.0106 |
| 6 mo | 29.68 (29.18-30.11) | 33.57 (29.56-38.22) | 32.50 (31.65-35.75) | ||
| 12 mo | 30.48 (29.91-30.85) | 33.28 (30.35-37.90) | 32.80 (28.19-33.85) | ||
| p-value (Wilcoxon) | 0/12 mo | 0.0544 | 0.0852 | 0.1358 | |
| HDL-c (mmol/L) | 0 mo | 1.10 (1.01-1.31) | 1.09 (0.83-1.16) | 1.01 (0.91-1.03) | 0.0470 |
| 6 mo | 1.22 (1.00-1.47) | 1.06 (0.92-1.11) | 1.05 (0.94-1.14) | ||
| 12 mo | 1.32 (1.34-1.45) | 1.11 (0.98-1.40) | 1.09 (0.98-1.11) | ||
| p-value (Wilcoxon) | 0/12 mo | 0.0544 | 0.0852 | 0.1358 | |
| LDL-c (mmol/L) | 0 mo | 3.22 (2.97-3.30) | 3.00 (2.22-3.08) | 3.29 (2.90-3.75) | 0.0718 |
| 6 mo | 3.19 (2.87-3.44) | 2.56 (2.37-2.97) | 3.11 (2.82-4.40) | ||
| 12 mo | 2.28 (2.22-2.38) | 2.72 (2.57-3.09) | 3.13 (2.81-3.60) | ||
| p-value (Wilcoxon) | 0/12 mo | 0 1250 | 0.9453 | 0.8125 | |
| TG (mmol/L) | 0 mo | 1.46 (1.27-1.69) | 2.06 (1.45-2.74) | 1.51 (1.32-2.01) | 0.0047 |
| 6 mo | 1.64 (0.97-1.88) | 2.42 (2.09-3.20) | 2.27 (1.86-2.94) | ||
| 12 mo | 0.93 (0.90-0.99) | 2.19 (1.67-2.72) | 1.47 (1.32-2.75) | ||
| p-value (Wilcoxon) | 0/12 mo | 0.1250 | 1.0000 | 0.7597 | |
| HbA1c (%) | 0 mo | 8.6 (8.0-10.1) | 9.5 (9.0-10.8) | 8.1 (7.9-11.2) | 0.6652 |
| 6 mo | 6.2 (6.1-6.4) | 6.4 (5.9-6.9) | 6.7 (5.8-7.2) | ||
| 12 mo | 5.9 (5.7-6.1) | 6.1 (5.6-6.8) | 6.2 (5.9-7.0) | ||
| p-value (Wilcoxon) | 0/12 mo | 0.0625 | 0.0039 | 0.0029 | |
| FBG (mmol/L) | 0 mo | 8.69 (7.41-15.17) | 8.16 (7.38-15.01) | 8.77 (7.33-12.10) | 0.8086 |
| 6 mo | 5.91 (5.76-6.52) | 5.94 (5.27-6.97) | 6.33 (5.83-7.89) | ||
| 12 mo | 5.99 (5.83-6.22) | 6.33 (5.61-6.66) | 6.49 (6.27-7.44) | ||
| p-value (Wilcoxon) | 0/12 mo | 0.0625 | 0.0078 | 0.0322 | |
| ncMS | 3 (2-4) | 4 (3-4) | 4 (3-5) | 0.05907 | |
| Age (Years) | 60 (57-62) | 42 (39-52) | 46 (40-58) | 0.00423 |
5. Discussion
In the current application, the K-Medoids clustering method using the Manhattan distance applied to the slopes attained the best results concerning the main objective, which was grouping subjects according to the variation in the response of IL-1β expression and showing differential behaviour in clinical variables. The other clustering algorithms considered in our work ([11]), when applied to the slopes yielded heterogeneous groups and therefore, did not meet the desired qualities for such clustering. Similar results are shown when applied to another controlled database in Appendix B. The use of the slopes as the key features of the grouping, allows to generalize previous proposals [20]. In this new framework, any traditional clustering method can be applied to group subjects according to variations in the response. Unlike the application of clustering algorithms in the original data r(i), small distances between the slope vectors m(i) provided similar characteristics in the variation of gene expression. Therefore, the use of the slopes expands the already vast world of clustering methods since these algorithms can be applied in both settings, but yielding different results. More details in Appendix A. The clustering yielded three distinct groups, evidently differentiable when clinically and biochemically compared in Table I. There were significant differences in waist circumference and BMI between the different clusters, so it would also be necessary to analyze the contribution of obesity in the expression of IL-1β that allowed these groups to be separated. Intra-cluster analysis showed that in Cluster 1, although the proposed metabolic compensation goal was reached, the decrease in FPG and HbA1c did not reach statistical significance. Also, a decrease trend in BMI and metabolic improvements in HDL-c values were observed. In Clusters 2 and 3, the compensation goal was reached as shown by a significant decrease in HbAlc and FBG. In Cluster 2 we also found a downward trend in BMI and HDL-c; but there were no anthropometric or lipid variations in Cluster 3. These results demonstrated that Cluster 3 showed the worst metabolic profile. In subsequent studies, it would be interesting to evaluate variables related to cardiovascular risk. Usually, non-parametric tests are less powerful (prone to discard real differences as nonsignificant) and the p-values can also be affected by the small sample size [25]. Consequently, the standard significance level of 5% can be too restrictive for this particular application of the statistical tests and p-values which are higher but close to 5% were considered for analysis. However, the strength of the obtained results is enforced by the large time changes considering nutrition and physical individual habits, and also by the time-varying nature of the system under study. As future work, it would be necessary to analyze a larger number of individuals to improve the individualized model and to reinforce our conclusions. Most clinical applications of gene clustering algorithms, which can be phenotype-based or gene-based, do not consider the longitudinal evolutions of gene expression. To the best of our knowledge, this approach has not yet been addressed as a clinical application in the literature. In the work of Pearson et al. [28], the consideration of longitudinal evolution was focused on phenotype follow-up, rather than gene expression and our work considered both gene expression and phenotype over time. Further investigation could profit from the use of all these perspectives to improve algorithm performance. Furthermore, works of clinical application that considered the longitudinal evolution of gene expression used supervised learning algorithms, in which the outcome variable was known and used for further predictions [29-33]. The methodology presented in this work involves unsupervised learning and can be applied when this prior knowledge is absent or limited, and new associations are required. Also, since most available gene expression data comes from countries with strong European ancestry, further research could provide data from other countries that can enrich precision medicine, based on more diverse data sources [34]. Our work used hard partitioning to automatically group individuals from the study. Many other works focused on the use of soft clustering, which is preferred for big data [35]. However, in small studies like ours, with patients undergoing large time treatments, the groups should be well-defined in order to achieve an adequate between-group comparison. A possible extension of this work could be the application of soft clustering to the slopes in longitudinal studies with a great number of subjects, allowing the determination of larger groups according to a strong association. One of the advantages of the proposed data mining procedure is that it does not require time measurements to be equal among all individuals, which is a frequent imposition for similar algorithms. However, in this study, the measurements were taken with the same protocol for every subject and do not differ with great impact in the calculations, and the algorithm easily adapts to these situations. Furthermore, the algorithm is not restricted to gene expression and performs well in other applications, or in cases in which other methods are not recommended, with few time points in which there is no prior knowledge regarding the observed phenomenon, which is a frequent issue in case studies observed in clinical investigation. Also, it is important to remark that this lack of prior knowledge allows us to search for associations between variables that are not previously thought to be linked. However, it must be pointed out that any prior knowledge regarding the application can be used to improve the algorithm, allowing the selection of specific distances between slopes. Since the presented database does not have a massive number of observations, the computational cost of K-Medoids was a drawback without major consequences. However, in other types of databases as massive databases, K-Means or Kernel-based K-means can be a better option. Another issue worth mentioning is that the proposed method is analytical and should not be used as a statistical inference tool. Any result obtained with the method should be further tested in a controlled experiment with a bigger sample size in order to attain satisfactory and pertinent inferences.
Conclusions
Our study showed that clustering individuals according to the variation in gene expression enabled us to find important clinical features that could allow the identification of differentially grouped metabolic behaviors not attained by other data analysis. With further studies, this could be translated into clinical improvement management of each individual considering the group assignment. The achieved results show that the proposed approach can significantly improve predictive performance and is effective when other established methods are not recommended due to the nature of the data, such as small sample sizes, few timepoints, heterogeneity and abrupt changes in gene expression for different timepoints. T2D is a complex and heterogeneous disease. Therefore, identifying clusters with similar clinical phenotype, will allow health professionals to evaluate increased risk, assess clinical evolutions and apply specific and personalized treatment to these groups of individuals. Precision medicine can improve the quality of life of people with T2D and help them improve glycemic control, prevent complications and provide a better quality of life.
Figure 1:


Appendix A
Figure 2:



Appendix B
Figure 3:



Appendix C
References
- 1.Bowman P, Flanagan SE, Hattersley AT. Future Roadmaps for Precision Medicine Applied to Diabetes: Rising to the Challenge of Heterogeneity. J Diabetes Res. 2018. Nov 27;2018:3061620. doi: 10.1155/2018/3061620. PMID: 30599002; PMCID: PMC6288579. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Cruz NG, Sousa LP, Sousa MO, Pietrani NT, Fernandes AP, Gomes KB. The linkage between inflammation and Type 2 diabetes mellitus. Diabetes Res Clin Pract. 2013. Feb;99(2):85-92. doi: 10.1016/j.diabres.2012.09.003. Epub 2012 Dec 14. PMID: 23245808. [DOI] [PubMed] [Google Scholar]
- 3.Nackiewicz D, Dan M, He W, Kim R, Salmi A, Rütti S, Westwell-Roper C, Cunningham A, Speck M, Schuster-Klein C, Guardiola B, Maedler K, Ehses JA. TLR2/6 and TLR4-activated macrophages contribute to islet inflammation and impair beta cell insulin gene expression via IL-1 and IL-6. Diabetologia. 2014. Aug;57(8):1645-1654. doi: 10.1007/s00125-014-3249-1. Epub 2014 May 12. PMID: 24816367. [DOI] [PubMed] [Google Scholar]
- 4.Florez JC. Precision Medicine in Diabetes: Is It Time? Diabetes Care. 2016. Jul;39(7):1085-1088. doi: 10.2337/dc16-0586. Epub 2016 Jun 11. PMID: 27289125. [DOI] [PubMed] [Google Scholar]
- 5.Ahlqvist E, Storm P, Käräjämäki A, Martinell M, Dorkhan M, Carlsson A, Vikman P, Prasad RB, Aly DM, Almgren P, Wessman Y, Shaat N, Spégel P, Mulder H, Lindholm E, Melander O, Hansson O, Malmqvist U, Lernmark Å, Lahti K, Forsén T, Tuomi T, Rosengren AH, Groop L. Novel subgroups of adult-onset diabetes and their association with outcomes: a data-driven cluster analysis of six variables. Lancet Diabetes Endocrinol. 2018. May;6(5):361-369. doi: 10.1016/S2213-8587(18)30051-2. Epub 2018 Mar 5. PMID: 29503172. [DOI] [PubMed] [Google Scholar]
- 6.Nolan JJ, Kahkoska AR, Semnani-Azad Z, Hivert MF, Ji L, Mohan V, Eckel RH, Philipson LH, Rich SS, Gruber C, Franks PW. ADA/EASD Precision Medicine in Diabetes Initiative: An International Perspective and Future Vision for Precision Medicine in Diabetes. Diabetes Care. 2022. Feb 1;45(2):261-266. doi: 10.2337/dc21-2216. PMID: 35050364; PMCID: PMC8914425. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Baldi P, Hatfield GW. DNA Microarrays and Gene Expression: From Experiments to Data Analysisand Modeling. Cambridge University Press. 2002. [Google Scholar]
- 8.Altiparmak F, Ferhatosmanoglu H, Erdal S, Trost DC. Information mining over heterogeneous and high-dimensional time-series data in clinical trials databases. IEEE Trans Inf Technol Biomed. 2006. Apr;10(2):254-263. doi: 10.1109/titb.2005.859885. PMID: 16617614. [DOI] [PubMed] [Google Scholar]
- 9.Bar-Joseph Z, Gitter A, Simon I. Studying and modelling dynamic biological processes using time-series gene expression data. Nat Rev Genet. 2012. Jul 18;13(8):552-564. doi: 10.1038/nrg3244. PMID: 22805708. [DOI] [PubMed] [Google Scholar]
- 10.Marin D, Tang M, Ayed IB, Boykov Y. Kernel Clustering: Density Biases and Solutions. IEEE Trans Pattern Anal Mach Intell. 2019. Jan;41(1):136-147. doi: 10.1109/TPAMI.2017.2780166. Epub 2017 Dec 6. PMID: 29990278. [DOI] [PubMed] [Google Scholar]
- 11.Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media. 2009. [Google Scholar]
- 12.Ernst J, Nau GJ, Bar-Joseph Z. Clustering short time series gene expression data. Bioinformatics. 2005. Jun;21 Suppl 1:i159-i168. doi: 10.1093/bioinformatics/bti1022. PMID: 15961453. [DOI] [PubMed] [Google Scholar]
- 13.Chira C, Sedano J, Camara M, Prieto C, Villar JR, Corchado E. A cluster merging method for time series microarray with production values. Int J Neural Syst. 2014. Sep;24(6):1450018. doi: 10.1142/S012906571450018X. Epub 2014 Jul 24. PMID: 25081426. [DOI] [PubMed] [Google Scholar]
- 14.Cinar O, Ilk O, Iyigun C. Clustering of short time-course gene expression data with dissimilar replicates. Annals of Operations Research. 2018;263(1-2):405-428. [Google Scholar]
- 15.Coffey N, Hinde J, Holian E. Clustering longitudinal profiles using p-splines and mixed effects models applied to time-course gene expression data. Computational Statistics & Data Analysis. 2014;71:14-29. [Google Scholar]
- 16.Futschik ME, Carlisle B. Noise-robust soft clustering of gene expression time-course data. J Bioinform Comput Biol. 2005. Aug;3(4):965-988. doi: 10.1142/s0219720005001375. PMID: 16078370. [DOI] [PubMed] [Google Scholar]
- 17.Hestilow TJ, Huang Y. Clustering of gene expression data based on shape similarity. EURASIP J Bioinform Syst Biol. 2009;2009(1):195712. doi: 10.1155/2009/195712. Epub 2009 Apr 23. PMID: 19404484; PMCID: PMC3171421. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Son YS, Baek J. A modified correlation coefficient based similarity measure for clustering time-course gene expression data. Pattern Recognition Letters. 2008;29(3):232–242. doi:10.1016/j.patrec.2007.09.015. [Google Scholar]
- 19.Chechik G, Oh E, Rando O, Weissman J, Regev A, Koller D. Activity motifs reveal principles of timing in transcriptional control of the yeast metabolic network. Nat Biotechnol. 2008. Nov;26(11):1251-1259. doi: 10.1038/nbt.1499. PMID: 18953355; PMCID: PMC2651818. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Möller-Levet CS, Klawonn F, Cho KH, Yin H, Wolkenhauer O. Clustering of unevenly sampled gene expression. Fuzzy sets and Systems. 2005;152(1):49-66. [Google Scholar]
- 21.Iglesias Molli AE, Bergonzi MF, Spalvieri MP, Linari MA, Frechtel GD, Cerrone GE. Relationship between the IL-1β serum concentration, mRNA levels and rs16944 genotype in the hyperglycemic normalization of T2D patients. Sci Rep. 2020. Jun 19;10(1):9985. doi: 10.1038/s41598-020-66751-x. PMID: 32561825; PMCID: PMC7305205. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Wierzchoń ST, Kłopotek MA. Modern algorithms of cluster analysis. Springer, 2018. [Google Scholar]
- 23.Yu J, Tian Q, Amores J, Sebe N. Toward Robust Distance Metric Analysis for Similarity Estimation. Computer Vision and Pattern Recognition. In IEEE Computer Society Conference. 2006;1:316-322. [Google Scholar]
- 24.Grundy SM, Cleeman JI, Daniels SR, Donato KA, Eckel RH, Franklin BA, Gordon DJ, Krauss RM, Savage PJ, Smith SC, Jr, Spertus JA, Costa F, American Heart Association; National Heart, Lung and Blood Institute . Diagnosis and management of the metabolic syndrome: an American Heart Association/National Heart, Lung, and Blood Institute Scientific Statement. Circulation. 2005. Oct 25;112(17):2735-2752. doi: 10.1161/CIRCULATIONAHA.105.169404. Epub 2005 Sep 12. Erratum in: Circulation. 2005 Oct 25;112(17):e297. Erratum in: Circulation. 2005 Oct 25;112(17):e298. PMID: 16157765. [DOI] [PubMed] [Google Scholar]
- 25.Ostertagova E, Ostertag O, Kováč J. Methodology and application of the Kruskal-Wallis test. Applied mechanics and materials. 2014;611115-120. [Google Scholar]
- 26.Wilcoxon F. Probability tables for individual comparisons by ranking methods. Biometrics. 1947. Sep;3(3):119-122. PMID: 18903631. [PubMed] [Google Scholar]
- 27.Milligan G, Cooper M. A study of standardization of variables in cluster analysis. Journal of Classification. 1988;5(2):181-204. [Google Scholar]
- 28.Wesolowska-Andersen A, Brorsson CA, Bizzotto R, Mari A, Tura A, Koivula R, Mahajan A, Vinuela A, Tajes JF, Sharma S, Haid M, Prehn C, Artati A, Hong MG, Musholt PB, Kurbasic A, De Masi F, Tsirigos K, Pedersen HK, Gudmundsdottir V, Thomas CE, Banasik K, Jennison C, Jones A, Kennedy G, Bell J, Thomas L, Frost G, Thomsen H, Allin K, Hansen TH, Vestergaard H, Hansen T, Rutters F, Elders P, t’Hart L, Bonnefond A, Canouil M, Brage S, Kokkola T, Heggie A, McEvoy D, Hattersley A, McDonald T, Teare H, Ridderstrale M, Walker M, Forgie I, Giordano GN, Froguel P, Pavo I, Ruetten H, Pedersen O, Dermitzakis E, Franks PW, Schwenk JM, Adamski J, Pearson E, McCarthy MI, Brunak S, IMI DIRECT Consortium . Four groups of type 2 diabetes contribute to the etiological and clinical heterogeneity in newly diagnosed individuals: An IMI DIRECT study. Cell Rep Med. 2022. Jan 4;3(1):100477. doi: 10.1016/j. xcrm.2021.100477. PMID: 35106505; PMCID: PMC8784706. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Baranzini SE, Mousavi P, Rio J, Caillier SJ, Stillman A, Villoslada P, Wyatt MM, Comabella M, Greller LD, Somogyi R, Montalban X, Oksenberg JR. Transcription-based prediction of response to IFNbeta using supervised computational methods. PLoS Biol. 2005. Jan;3(1):e2. doi: 10.1371/journal.pbio.0030002. Epub 2004 Dec 28. PMID: 15630474; PMCID: PMC539058. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Borgwardt KM, Vishwanathan SV, Kriegel HP. Class prediction from time series gene expression profiles using dynamical systems kernels. Pac Symp Biocomput. 2006:547-558. PMID: 17094268. [PubMed] [Google Scholar]
- 31.Costa IG, Schönhuth A, Hafemeister C, Schliep A. Constrained mixture estimation for analysis and robust classification of clinical time series. Bioinformatics. 2009. Jun 15;25(12):i6-i14. doi: 10.1093/bioinformatics/btp222. PMID: 19478017; PMCID: PMC2687976. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Calvano SE, Xiao W, Richards DR, Felciano RM, Baker HV, Cho RJ, Chen RO, Brownstein BH, Cobb JP, Tschoeke SK, Miller-Graziano C, Moldawer LL, Mindrinos MN, Davis RW, Tompkins RG, Lowry SF, Inflamm and Host Response to Injury Large Scale Collab . Res. Program. A network-based analysis of systemic inflammation in humans. Nature. 2005. Oct 13;437(7061):1032-1037. doi: 10.1038/nature03985. Epub 2005 Aug 31. Erratum in: Nature. 2005 Dec 1;438(7068):696. PMID: 16136080. [DOI] [PubMed] [Google Scholar]
- 33.den Teuling NGP, Pauws SC, van den Heuvel ER. A comparison of methods for clustering longitudinal data with slowly changing trends. Communications in Statistics: Simulation and Computation. 2023;52(3):621-648. [Google Scholar]
- 34.Aiming for equitable precision medicine in diabetes. Nat Med. 2022. Nov;28(11):2223. doi: 10.1038/s41591-022-02105-6. PMID: 36333401. [DOI] [PubMed] [Google Scholar]
- 35.Ben Ayed A, Ben Halima M, Alimi A. Survey on clustering methods: Towards fuzzy clustering for big data. Conference: International Conference on Computational Intelligence in Security for Information Systems. 2015;9. DOI:10.1007/978-3-319-47364-2_55 [Google Scholar]
