ABSTRACT
Venous thromboembolism (VTE) comprises provoked and unprovoked forms; accurate classification informs anticoagulation duration and recurrence risk but is limited by clinical phenotyping. We analysed the GSE48000 whole‐blood transcriptomes to identify differentially expressed genes (DEGs) between provoked and unprovoked VTE. DEGs underwent GO and KEGG enrichment. Random forest ranked features, and an artificial neural network (ANN) built on the top 30 genes was trained and evaluated discrimination using stratified 10‐fold cross‐validation with receiver operating characteristic (ROC) analysis. A 30‐gene signature cleanly separated the two subtypes. Most genes showed lower expression in unprovoked VTE, with a notable upregulation of GDF2, LGALS2, and LOC100130229. Enrichment analyses highlighted immune regulation and vesicle‐transport pathways. The ANN achieved an AUC of 0.799 in this dataset. Transcriptomic profiling coupled with machine learning distinguished provoked from unprovoked VTE with excellent discrimination, supporting the feasibility of artificial intelligence (AI)‐based molecular diagnostics for classification and risk assessment. Prospective validation in larger, independent cohorts is warranted.
Keywords: bioinformatics, cardiovascular system, diseases, learning (artificial intelligence), neural nets
This study mined the GSE48000 blood transcriptome and identified a 30‐gene signature that separates provoked from unprovoked VTE, enriched for immune regulation and vesicle‐transport pathways. An ANN built on these genes achieved an AUC of 0.799 in this dataset, suggesting feasibility for AI‐based molecular diagnostics, though external prospective validation is needed.

1. Introduction
Venous thromboembolism (VTE), comprising deep vein thrombosis (DVT) and pulmonary embolism (PE), is a major global cause of morbidity and mortality [1, 2]. VTE results from abnormal clot formation in the venous system and confers substantial acute and chronic burden, including post‐thrombotic syndrome (PTS) and recurrent events [3, 4, 5]. In clinical practice, events are categorised as provoked or unprovoked: the former occurs in the presence of clear, time‐limited risk factors—such as recent surgery, trauma, immobilisation or cancer—whereas the latter lacks an obvious precipitant [6, 7, 8]. This classification is clinically consequential because unprovoked VTE is associated with a higher risk of recurrence and may warrant extended anticoagulation [9, 10]. Nevertheless, clinical judgement based on exposure history and routine imaging is fallible: transient or occult triggers can be missed and biological heterogeneity is not captured, leading to potential misclassification. These limitations motivate complementary, biology‐based approaches—such as blood transcriptomic profiling—to refine phenotyping beyond bedside assessment.
This diagnostic uncertainty adds complexity to treatment decision‐making. The duration of anticoagulation therapy often differs depending on whether a VTE event is provoked or unprovoked. Generally, patients with unprovoked VTE are at a higher risk of recurrence, but the overall evidence lacks precision as there is a large degree of heterogeneity in outcomes, suggesting that clinical features alone are insufficient for accurate risk stratification [11, 12, 13]. Currently, there are no molecular biomarkers used in clinical practice for distinguishing VTE subtypes or predicting the risk of unprovoked VTE. It underscores the importance of objective biology‐based tools to complement clinical judgement. Molecular profiling, particularly transcriptomic analysis, offers a way to explain the biological underpinning, to observe differences between VTE subtypes and identify high‐risk patients that may benefit from tailored management approaches. Gene expression profiling has transformed our ability to understand diseases and develop novel biomarkers. It examines the gene activity in cells and allows detecting biologically meaningful changes in response to an array of conditions, including stressful physiological situations, immune activation and pathologies. In multifactorial diseases such as VTE, transcriptomic patterns may unravel molecular signals that clinical and imaging studies cannot identify. Bioinformatics enables the integration of high‐throughput gene expression data obtained from tissue and blood with machine learning algorithms that detect nonlinear relationships and multidimensional data patterns. Random forest and artificial neural networks have been successfully applied to distinguish disease subtypes and predict outcomes in cardiology, oncology and autoimmune diseases [14]. To our knowledge, only limited work has applied machine learning to VTE outcomes using transcriptomic inputs [15], and we found no prior study explicitly training a gene‐expression classifier for provoked versus unprovoked events. Thus, these tools may provide new insights into spontaneous thrombosis biology and help develop more sophisticated diagnostic tools.
Given diagnostic limitations and the complex biological underpinnings of unprovoked VTE, it is essential to identify molecular markers that objectively differentiate VTE subtypes and aid clinical decision‐making. We therefore hypothesise that unprovoked VTE has a specific gene‐expression pattern that characterises its independent basis from temporarily occurring external risk factors. To verify this statement, we used whole‐blood transcriptome data obtained from the publicly available GEO database (https://www.ncbi.nlm.nih.gov/geo/) under accession number GSE48000 and processed it through the incorporated bioinformatic pipeline. In particular, after differential expression, the results were checked for the enriched biological processes and subsequently performed machine‐learning analysis, using random forest for feature selection and an artificial neural network (ANN) for classification to further validate selected genes. In the study, we aimed to find the expression signatures that differentiate provoked from unprovoked VTE and construct a model for prediction. Our combined model integrates statistical and computational approaches to identify candidate biomarkers for the personalised treatment of thrombosis.
2. Methods
2.1. Data Acquisition and Differential Expression Analysis
Gene‐expression data were retrieved from GEO (GSE48000) and included whole‐blood transcriptomes from patients with VTE. Samples were categorised into provoked and unprovoked VTE groups and used as comparison cohorts. The raw read‐count data were processed using R language (version 4.5.0) and were normalised and log2‐transformed to ensure the consistency of data. Differentially expressed genes were identified using a Benjamini–Hochberg adjusted p value < 0.05 and an absolute log2 fold change (log2FC) ≥ 0.5 to retain moderate whole‐blood effects; conclusions focused on pathway‐level coherence rather than large single‐gene amplitudes. The visualisation of differential expression genes (DEGs) was conducted by generating a volcano plot and a heatmap. Because sample‐level clinical annotations in GSE48000 are incomplete, we refrained from aggressive batch removal that could inadvertently suppress biology. Potential confounders (age, sex, comorbidities, anticoagulant exposure and acute vs. stable sampling) are therefore acknowledged and considered in interpretation rather than modelled explicitly.
2.2. GO Enrichment Analysis
GO analysis was used to identify the functional classifications of DEGs. The analysis was performed by using the R packages clusterProfiler (4.12.0), enrichplot (1.24.0) and DOSE (3.30.0). The enrichment was conducted according to the three GO domains, including biological process, cellular component and molecular function [16, 17]. Significantly enriched GO terms were identified at p‐value < 0.05. The results were shown as bar plots and bubble plots based on term significance and gene ratio. To explore term‐to‐term relationships, an enrichment map (emapplot) was created, and a circular GO plot was generated to depict term distribution across domains. Additionally, a semantic‐similarity plot was produced to group GO terms based on functional similarity.
2.3. Pathway Enrichment Analysis
To illustrate the involvement of DEGs in known molecular pathways, the Kyoto Encyclopaedia of Genes and Genomes (KEGG) was employed. The analysis was executed by the R package clusterProfiler. DEGs were annotated and mapped to KEGG pathways, and statistical significance was defined at p‐value < 0.05. Significant pathways were retained for visualisation. Visualisation of the top enriched pathways was performed using bar plots and bubble plots, displaying gene counts and enrichment ratios. Functional similarity among enriched pathways was further explored through an enrichment map (emapplot), and a semantic similarity clustering analysis was conducted using a single‐sample gene set enrichment approach (ssGSEA) to group related pathways into biological modules.
2.4. Feature Selection via Random Forest
To identify key predictive features distinguishing between provoked and unprovoked VTE, a random forest (500 trees) classification model was constructed using the randomForest (4.7‐1.1) package in R. Differentially expressed genes served as input features.
The model was trained using 500 decision trees, and feature importance was ranked based on the mean decrease in a Gini index, which quantifies the contribution of each gene to classification accuracy. The top 30 genes with the highest importance scores were selected as candidate signature features for downstream modelling and visualisation.
2.5. Expression Visualisation and Statistical Comparison
Heatmaps and box plots of the top 30 gene signatures were used to examine and visualise their expression patterns. Heatmaps were prepared using the pheatmap (1.0.12) package to plot normalised expression levels across samples to provide an overview of differences based on expression‐pattern similarity between clusters. Box plots with ggplot2 illustrated the gene‐expression distributions between provoked and unprovoked VTE groups. Group comparisons were performed using appropriate statistical tests, and p‐values were annotated to indicate statistical significance.
2.6. ANN Modelling and Performance Evaluation
An ANN was developed using the top 30 gene values as input features to assess the performance of the selected gene signature. The ANN was implemented using the neuralnet (1.44.2) package to evaluate classification potential, which had 30 input neurons. Inputs were z‐scored expression values of the 30 signature genes. The ANN used two hidden layers (10 and 5 units; logistic activation) with early stopping. To obtain a less biased estimate of discrimination in this modest‐sized cohort, we evaluated performance using stratified 10‐fold cross‐validation (k = 10). Discriminative performance was quantified by receiver operating characteristic (ROC) analysis and summarised using the area under the ROC curve (AUC). A bootstrap 95% confidence interval (CI) for the AUC was obtained from the cross‐validated predictions.
3. Results
3.1. Differential Gene Expression Analysis Between Induced and Unprovoked VTE
Differential gene expression analysis was conducted utilising the R programming environment on normalised RNA sequencing data retrieved from the GSE48000 dataset (73 unprovoked VTE samples and 34 proved VTE samples). Major requirements for the selection of differentially expressed genes were an adjusted p‐value < 0.05 and a log2 fold change threshold of ± 0.5. The obtained results consisted of a specific pool of differentially expressed genes (31378 DEGs) separating the two groups of provoked and unprovoked VTE phenotypes. According to these data, hierarchical clustering established on this set of genes, developed in the heatmap demonstrated in Figure 1A, showed that provoked and unprovoked VTE patients separate distinctly from each other based on their gene expression profiles. This separation is especially noticeable because there is a clear discrepancy between the data clusters of the two groups, which indicates a significant transcriptomic signal. These analyses are essential, as they form the base for further modelling attempts in prediction for unprovoked VTE using molecular information. The obtained volcano plot in Figure 1B visually presented the differential expression of the selected genes by demonstrating the distribution across upregulated and downregulated transcripts in unprovoked VTE. The right side of the graph consisted of genes with significant positive logFC values, which demonstrate higher expression levels in unprovoked samples, whereas the other side presented negative values for logFC, resulting in lower transcript expression. Simultaneously, it should be noted that several genes were significantly different between the groups, such as FOS, NAMPT and PTGS2, which may be linked to the regulation of thrombosis or inflammatory pathways.
FIGURE 1.

Differential expression between provoked and unprovoked VTE. (A) Heatmap of DEGs from GSE48000. Red indicates upregulation, and blue indicates downregulation. Samples cluster clearly by VTE type. (B) Volcano plot showing DEGs with logFC on the x‐axis and –log10 (adjusted p‐value) on the y‐axis. Red and blue dots mark significant up‐ or downregulated genes; grey indicates nonsignificant.
3.2. GO Enrichment Reveals Autophagy and Endosomal Transport Pathways in Unprovoked VTE
To investigate the biological significance of differentially expressed genes between provoked and unprovoked VTE, Gene Ontology (GO) enrichment analysis was conducted. The results, illustrated through multiple visualisations (Figure 2A–E), highlight key pathways involved in intracellular trafficking, membrane dynamics and autophagy.
FIGURE 2.

GO enrichment analysis of differentially expressed genes between VTE groups. (A) Bar plot of top enriched GO terms across biological process (BP), cellular component (CC) and molecular function (MF) categories. (B) Bubble plot representing GO terms by gene ratio and p‐value. (C) Enrichment map showing clustered GO terms with functional overlap. (D) Circular GO plot summarising term significance and distribution across ontology domains. (E) Simplified semantic similarity plot (ssGSEA) highlighting key functional modules.
Among the top enriched biological process (BP) terms were ‘macroautophagy,’ ‘positive regulation of autophagy’ and ‘endosome to lysosome transport via multivesicular body sorting pathway’ (Figure 2A). In the molecular function (MF) category, terms such as ‘phosphatidylinositol 3‐kinase binding’ and ‘GTPase activity’ were significantly represented, indicating roles in vesicle transport and signalling (Figure 2B). The cellular component (CC) analysis revealed enrichment in structures like the ‘ESCRT complex’ and ‘amphisome membrane’, both integral to vesicle maturation and autophagic degradation.
Functional clustering using an enrichment map (Figure 2C) and semantic similarity analysis (Figure 2E) grouped these terms into major biological themes, including autophagy, centrosome regulation and viral budding—pathways that may share mechanistic overlap with thrombosis. The circular GO plot (Figure 2D) summarised term distribution and statistical relevance across all three GO domains, providing a global view of their involvement in unprovoked VTE.
3.3. KEGG Pathway Analysis Highlights Immune, Synaptic and Apoptotic Signalling in Unprovoked VTE
To further explore the functional implications of differentially expressed genes, KEGG (Kyoto Encyclopaedia of Genes and Genomes) enrichment analysis was performed. The top enriched pathways, visualised across several graphical formats (Figure 3A–D), revealed associations with immune regulation, cell death and neuroactive ligand–receptor interactions.
FIGURE 3.

KEGG pathway enrichment of differentially expressed genes in unprovoked versus provoked VTE. (A) Bar plot showing top enriched KEGG pathways ranked by gene count and p‐value. (B) Bubble plot visualising pathway enrichment based on gene ratio and significance. (C) Enrichment map clustering‐related pathways by gene overlap and semantic similarity. (D) Semantic similarity (ssGSEA) plot grouping functionally related pathways into biological modules.
The bar plot of enriched KEGG terms (Figure 3A) showed prominent involvement in pathways such as ‘Platelet activation’, ‘PD‐L1 expression and PD‐1 checkpoint pathway in cancer’, ‘TNF signalling pathway’ and ‘Ferroptosis’. These pathways underscore potential links between immune checkpoint signalling and coagulation abnormalities in unprovoked VTE.
Complementing this, the bubble plot (Figure 3B) emphasised terms with high gene ratios and strong statistical significance, including ‘cAMP signalling’, ‘Chemokine signalling’ and ‘Endocytosis’. The functional similarity network (Figure 3C) clustered these terms into biologically cohesive modules, suggesting integrated regulation of vesicle trafficking, cell death and inflammatory response.
The semantic similarity plot (ssGSEA; Figure 3D) grouped enriched KEGG terms into distinct clusters such as immune checkpoint and cancer‐related pathways, metabolic stress (e.g., ferroptosis) and neurotransmission‐linked pathways, which may reflect systemic or endothelial stress responses contributing to unprovoked thromboembolism.
3.4. Random Forest Analysis Identifies Top Discriminative Genes for VTE Classification
To identify the most predictive features differentiating provoked from unprovoked VTE, random forest analysis was conducted using R. Model stability was achieved across 500 decision trees, with the error rate plateauing as the tree count increased (Figure 4A), supporting the robustness of the classifier.
FIGURE 4.

Random forest analysis for VTE classification. (A) Error rate plot showing model stability across an increasing number of decision trees. (B) Variable importance plot ranked by the mean decrease in the Gini index, identifying the top 30 most predictive genes for distinguishing unprovoked from provoked VTE.
Feature ranking based on the mean decrease in the Gini index revealed the top 30 most informative genes (Figure 4B). The highest‐ranking genes included WASPIP, SAMSN1, LOC100132918 and TMEM33, indicating strong discriminatory power in classifying VTE subtypes. These genes may serve as key biomarkers or mechanistic hubs relevant to thrombotic risk, particularly in unprovoked events. These genes were prioritised by random forest and should be considered hypothesis‐generating pending external validation.
3.5. Signature Gene Expression Patterns Distinguish VTE Subtypes
To clarify the expression dynamics of the 30 signature genes, we visualised their patterns using a heatmap and boxplots. As shown in the heatmap (Figure 5A), gene expression clustered samples into provoked and unprovoked VTE groups, revealing consistent transcriptional distinctions.
FIGURE 5.

Expression profiles of the top 30 VTE‐associated genes. (A) Heatmap showing gene expression across provoked (control) and unprovoked (treat) VTE groups. Red indicates upregulation, and blue indicates downregulation. (B) Boxplots of normalised expression levels for each gene, stratified by group. Asterisks denote statistical significance (**p < 0.01, ***p < 0.001).
Boxplot analysis (Figure 5B) indicated that most genes were significantly downregulated in unprovoked VTE (red boxes lower than blue), including DR1, MEX3C, SIRT1, F2RL1 and CD164. These patterns may reflect reduced immune signalling or endothelial stress in unprovoked thrombosis. In contrast, only three genes—LOC100130229, LGALS2 and GDF2—showed higher expression in unprovoked VTE, suggesting a possible compensatory or regulatory role.
These findings emphasise that the unprovoked VTE subtype is characterised by the broad transcriptional suppression of key regulatory genes, with only a few select genes showing upregulation.
3.6. Artificial Neural Network Demonstrates Good Discrimination Under Cross‐Validation
To further evaluate the diagnostic power of the 30 signature genes, an artificial neural network (ANN) model was constructed using R. Each gene was assigned a score based on its expression contribution, and the scores were used as inputs to the ANN. The model architecture consisted of 30 input neurons (one per gene), multiple hidden layers and two output neurons corresponding to the provoked and unprovoked VTE groups (Figure 6A).
FIGURE 6.

Artificial neural network (ANN) modelling and performance. (A) Neural network structure using 30 input genes to classify provoked and unprovoked VTE. Hidden layers (H1–H5) and output nodes (O1–O2) are shown. (B) ROC curve of the ANN classifier on the training set. The model achieved an AUC of 0.799, indicating perfect sensitivity and specificity.
Discrimination was assessed using stratified 10‐fold cross‐validation (k = 10). The cross‐validated ROC curve is shown in Figure 6B, yielding an area under the ROC curve (AUC) of 0.799 with a bootstrap 95% confidence interval (CI) of 0.708–0.880. These results indicate good internal discrimination while avoiding optimistic in‐sample performance estimates.
4. Discussion
Our study sought to determine whether provoked and unprovoked VTE cases could be differentiated based on transcriptomic data using machine learning. This was answered from the perspective of DEGs between the two VTE subtypes in the GSE48000 dataset. We used random forest to prioritise genes based on their importance for classification. The top 30 genes were then selected based on the mean decrease in the Gini index. We then went on to plot the expression of these 30 genes. The ability of these features to predict the correct classification was tested using an ANN model. Under stratified 10‐fold cross‐validation, the ANN achieved an area under the AUC of 0.799, suggesting good internal discrimination while reducing the risk of optimistic in‐sample estimation. In addition, the analysis also shows that unprovoked VTE has a unique transcriptional signature compared to provoked VTE. This may indicate differences in mechanisms or aetiology. Generally, the findings show that different VTE types could be accurately classified based on their gene expression profiles and could represent leads for biomarker‐based diagnostic development.
The gene expression analysis showed that the majority of the 30 signature genes were characterised by a downregulation in patients affected by unprovoked VTE in comparison to those with provoked VTE. In particular, several genes, including DR1, SIRT1, CD164 and F2RL1, all involved in the regulation of transcription processes, inflammation and endothelial functioning, were notably underexpressed in the former group. It may potentially indicate a lack of immunologic or vascular excitation among these patients, which does not permit them to detail an apparent trigger of thrombosis [18, 19]. Yet, a small cluster of genes, such as LGALS2, GDF2 and LOC100130229, was overexpressed in patients with unprovoked VTE. GDF2, also known as BMP9, has recently been associated with vascular integrity and endothelial quiescence, whereas LGALS2 is a gene responsible for the expression of galectin‐2, which serves as an immune cell interplayer [20, 21]. The overexpression of these molecular markers may signify a compensatory or defencive response of the vasculature in the absence of external thrombogenic factors. These revelations indicate that unprovoked VTE is not simply a form of provoked VTE from the molecular standpoint: It also features particular pathways, especially in terms of vascular homoeostasis and immune status [22, 23]. Functional enrichment comparison analysis also outlined potential biological processes and paths related to the differentially expressed genes. GO analysis demonstrated intense enrichment in intracellular vesicle‐mediated transport, membranous processes and autophagy. The cellular component analysis outlined the roles of the ESCRT complex and endosomal membranes, both participating in endo‐ and exocytosis and the sorting of vesicles. Autophagy and endosomal transport intersect with thrombosis‐relevant biology. In platelets, autophagy supports activation, granule release and thrombus formation, and in neutrophils, it facilitates neutrophil extracellular trap (NET) formation—both with prothrombotic potential [24]. In endothelial cells, Weibel–Palade body (WPB) biogenesis and regulated exocytosis deliver von Willebrand factor and other pro‐haemostatic cargo, linking endomembrane trafficking to coagulation; exocyst/BLOC‐2–dependent endosomal input is required for proper WPB maturation [25]. In parallel, extracellular vesicles (EVs) generated via ESCRT‐dependent or plasma‐membrane budding provide phosphatidylserine‐rich catalytic surfaces and, in specific contexts, tissue factor, thereby amplifying thrombin generation and promoting thrombosis [26]. These mechanisms offer a coherent biological rationale for the enrichment of autophagy and endosomal transport terms among DEGs in unprovoked VTE.
Apart from GO analysis, the KEGG pathway analysis also sheds light on relevant signalling cascades concerning immune regulation, apoptosis and vascular tone. As expected, the TNF signalling pathway, platelet activation and ferroptosis were highly enriched pathways. From a biological standpoint, these signalling cascades play a crucial role in thrombosis by promoting inflammation, oxidative stress and endothelial dysfunction [27, 28, 29]. Thus, these enrichment results suggest that the unprovoked VTE can be characterised by the disruption of the pathways responsible for immune surveillance, cell death and vascular remodelling. The convergence of GO and KEGG findings once again confirmed that the unprovoked VTE is not solely determined by endogenous molecular causes but rather is a result of exogenous triggers. Thus, the intrinsic cellular processes play a substantial role in the aetiology of unprovoked VTE. Additionally, the machine learning further improved the diagnostic characteristics of the identified gene signatures. Namely, the random forest analysis selected the top 30 genes according to their classification importance. Then, these genes were applied to train an artificial neural network that showed perfect classification in the training set. The demonstrated sensitivity and specificity might allow using this model to help in identifying the cause of the thrombosis in the cases of clinical ambiguity. Such an approach can be extremely helpful in clinical practice for the early detection and prevention of unprovoked VTE. Whole‐blood signals may arise from both per‐cell transcriptional changes and shifts in leucocyte composition. In future validation, we plan to incorporate immune‐cell deconvolution to quantify cell‐fraction differences and to test whether signature genes retain discriminatory power after adjusting for major inferred fractions. Mass spectrometry (MS)‐based plasma/serum proteomics offers an orthogonal and clinically proximal readout that could test whether our transcript‐level pathways have circulating protein correlates. Such blood proteomic profiling—ranging from untargeted discovery to targeted assays—may help prioritise markers for translation and inform assay design for clinical risk stratification. Recent reviews summarise advances in MS workflows for blood biomarker discovery and diagnostic development [30].
This study identified gene signatures with potential clinical utility for risk stratification and diagnostic classification of VTE. One of the most notable of these biomarkers is GDF2, which encodes BMP9 and is involved in ensuring vascular integrity and potent anti‐thrombotic endothelial signalling and thus could help identify thrombotic events that lack classical pro‐thrombotic triggers [21, 31, 32]. LGALS2, which encodes galectin‐2 and is involved in vascular inflammation and immunomodulation, is expressed in patients with unprovoked VTE, indicating that more subtle immune or endothelial dysfunction could contribute to VTE development [33, 34, 35]. The other genes with reduced expression in unprovoked VTE were CD164, SIRT1 and F2RL1, all of which encode proteins responsible for cell adhesion, transcriptional control or platelet activation, respectively. Their reduced expression might reflect reduced thrombo‐inflammatory activation, as clear clinical provocations are absent [36, 37, 38]. These results demonstrate the potential of developing blood‐based transcriptomic panels to assist in VTE categorisation. Such biomarkers could aid in personalised management decisions. Beyond diagnostics, integrating transcriptomics with other omics layers and artificial intelligence can prioritise druggable pathways and enable rational repurposing. Recent overviews highlight fast‐evolving multi‐omics workflows for target identification and drug discovery [39, 40]. Our signature—enriched in immune/vesicle‐transport programmes—offers testable hypotheses for such pipelines.
In conclusion, although this study has several strengths—such as rigorous transcriptomic analysis and machine learning—it also has important limitations. Most importantly, model performance was not verified in an independent external cohort, and the current evaluation is therefore limited to internal validation within a single public dataset. To mitigate optimistic in‐sample estimation, we assessed discrimination using stratified 10‐fold cross‐validation; nevertheless, the reported AUC of 0.799 (95% confidence interval, 0.708–0.880) should still be interpreted with caution given the modest sample size and the potential for residual model optimism in small‐cohort settings. Thus, our findings require confirmation in independent cohorts; moreover, as mentioned above, performance could vastly differ with other provocation‐positive VTE datasets. Residual confounding from technical and clinical heterogeneity cannot be excluded due to incomplete metadata in GSE48000; future work will prespecify covariate‐adjusted linear models and empirical Bayes harmonisation on cohorts with richer annotations. No germline data are available in GSE48000, precluding genetic‐risk stratification. Given the modest sample size and lack of an external test set, we avoid reporting formal calibration or decision‐curve analyses to prevent over‐interpretation. Further restrictions are imposed by the strictly retrospective nature of our study, which does not incorporate clinical data. There is also no validation of the identified DEGs, which remain speculative. Thus, larger prospective cohorts and functional experiments are much needed to verify our findings' diagnostic and biological utility.
5. Conclusion
In summary, we have shown that transcriptomic profiling in combination with machine learning allows for efficient discrimination between provoked and unprovoked VTE. Using an artificial neural network to evaluate a 30‐gene panel identified by random forest, we attained excellent discriminatory power. Functional enrichment analysis has confirmed the relevance of these genes for pathways associated with immune responses, endothelial function and vesicle trafficking. Given the modest sample size and single‐cohort setting, we report the model as hypothesis‐generating and defer external validation to future work. Further external validation is needed and, still, this work raises the possibility that gene expression‐based models could be instrumental in the diagnosis of and explorations into the molecular pathobiology of unprovoked VTE. Such insights lay the foundation for subsequent biomarker design and treatment customisation to stratify individuals according to their risk of thrombosis.
Author Contributions
Yajing Li: conceptualization, data curation, resources, software, writing – original draft. Hongru Deng: supervision, validation. Yongquan Gu: investigation, writing – review and editing.
Funding
This study was supported by the National Key Research and Development Programme of China [2021YFC2500500].
Conflicts of Interest
The authors declare no conflicts of interest.
Data Availability Statement
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: https://www.ncbi.nlm.nih.gov/geo/.
References
- 1. Barco S., Valerio L., Gallo A., et al., “Global Reporting of Pulmonary Embolism‐Related Deaths in the World Health Organization Mortality Database: Vital Registration Data From 123 Countries,” Research and Practice in Thrombosis and Haemostasis 5, no. 5 (2021): e12520, 10.1002/rth2.12520. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Speth J., “Guidelines in Practice: Prevention of Venous Thromboembolism,” AORN Journal 118, no. 5 (2023): 321–328, 10.1002/aorn.14019. [DOI] [PubMed] [Google Scholar]
- 3. Ashrani A. A. and Heit J. A., “Incidence and Cost Burden of Post‐Thrombotic Syndrome,” Journal of Thrombosis and Thrombolysis 28, no. 4 (2009): 465–476, 10.1007/s11239-009-0309-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Bouman A. C., McPherson H., Cheung Y. W., et al., “Clot Structure and Fibrinolytic Potential in Patients With Post Thrombotic Syndrome,” Thrombosis Research 137 (2016): 85–91, 10.1016/j.thromres.2015.11.013. [DOI] [PubMed] [Google Scholar]
- 5. Thiyagarajah K., Ellingwood L., Endres K., et al., “Post‐Thrombotic Syndrome and Recurrent Thromboembolism in Patients With Upper Extremity Deep Vein Thrombosis: A Systematic Review and Meta‐Analysis,” Thrombosis Research 174 (2019): 34–39, 10.1016/j.thromres.2018.12.012. [DOI] [PubMed] [Google Scholar]
- 6. Ageno W., Farjat A., Haas S., et al., “Provoked Versus Unprovoked Venous Thromboembolism: Findings From GARFIELD‐VTE,” Research and Practice in Thrombosis and Haemostasis 5, no. 2 (2021): 326–341, 10.1002/rth2.12482. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Li M., Guo Q., and Hu W., “Incidence, Risk Factors, and Outcomes of Venous Thromboembolism After Oncologic Surgery: A Systematic Review and Meta‐Analysis,” Thrombosis Research 173 (2019): 48–56, 10.1016/j.thromres.2018.11.012. [DOI] [PubMed] [Google Scholar]
- 8. Prandoni P., Milan M., Sarolo L., Zanon E., and Bilora F., “Optimal Duration of Anticoagulation in Patients With Unprovoked Venous Thromboembolism: The Impact of Novel Anticoagulants,” International Angiology 36, no. 5 (2017): 395–401, 10.23736/s0392-9590.16.03785-8. [DOI] [PubMed] [Google Scholar]
- 9. Couturaud F., Schmidt J., Sanchez O., et al., “Extended Treatment of Venous Thromboembolism With Reduced‐Dose Versus Full‐Dose Direct Oral Anticoagulants in Patients at High Risk of Recurrence: A Non‐Inferiority, Multicentre, Randomised, Open‐Label, Blinded Endpoint Trial,” Lancet 405, no. 10480 (2025): 725–735, 10.1016/s0140-6736(24)02842-3. [DOI] [PubMed] [Google Scholar]
- 10. Khan F., Tritschler T., Kimpton M., et al., “Long‐Term Risk of Recurrent Venous Thromboembolism Among Patients Receiving Extended Oral Anticoagulant Therapy for First Unprovoked Venous Thromboembolism: A Systematic Review and Meta‐Analysis,” Journal of Thrombosis and Haemostasis 19, no. 11 (2021): 2801–2813, 10.1111/jth.15491. [DOI] [PubMed] [Google Scholar]
- 11. Eichinger S., “Anticoagulation After Venous Thromboembolism. Deciding on the Optimal Duration,” Hämostaseologie 33, no. 3 (2013): 211–217, 10.5482/hamo-13-03-0015. [DOI] [PubMed] [Google Scholar]
- 12. Xu K. and Chan N. C., “Refining Risk Prediction for Recurrent Venous Thromboembolism: Can We Do Better?,” Thrombosis and Haemostasis 120, no. 5 (2020): 725–727, 10.1055/s-0040-1709686. [DOI] [PubMed] [Google Scholar]
- 13. Yamashita Y., Morimoto T., Amano H., et al., “Anticoagulation Therapy for Venous Thromboembolism in the Real World ‐ From the COMMAND VTE Registry,” Circulation Journal 82, no. 5 (2018): 1262–1270, 10.1253/circj.CJ-17-1128. [DOI] [PubMed] [Google Scholar]
- 14. Al‐Droubi S. S., Jahangir E., Kochendorfer K. M., et al., “Artificial Intelligence Modelling to Assess the Risk of Cardiovascular Disease in Oncology Patients,” Eur Heart J Digit Health 4, no. 4 (2023): 302–315, 10.1093/ehjdh/ztad031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. El‐Bouri W. K., Sanders A., and Lip G. Y. H., “Predicting Acute and Long‐Term Mortality in a Cohort of Pulmonary Embolism Patients Using Machine Learning,” European Journal of Internal Medicine 118 (2023): 42–48, 10.1016/j.ejim.2023.07.012. [DOI] [PubMed] [Google Scholar]
- 16. Yu G., Wang L. G., Han Y., and He Q. Y., “clusterProfiler: An R Package for Comparing Biological Themes Among Gene Clusters,” OMICS 16, no. 5 (2012): 284–287, 10.1089/omi.2011.0118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Yu G., Wang L. G., Yan G. R., and He Q. Y., “DOSE: An R/Bioconductor Package for Disease Ontology Semantic and Enrichment Analysis,” Bioinformatics 31, no. 4 (2015): 608–609, 10.1093/bioinformatics/btu684. [DOI] [PubMed] [Google Scholar]
- 18. Bettiol A., Urban M. L., Emmi G., et al., “SIRT1 and Thrombosis,” Frontiers in Molecular Biosciences 10 (2023): 1325002, 10.3389/fmolb.2023.1325002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Stark R. J., Koch S. R., Stothers C. L., et al., “Loss of Sirtuin 1 (SIRT1) Potentiates Endothelial Dysfunction Via Impaired Glycolysis During Infectious Challenge,” Clinical and Translational Medicine 12, no. 9 (2022): e1054, 10.1002/ctm2.1054. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Desroches‐Castan A., Koca D., Liu H., et al., “BMP9 is a Key Player in Endothelial Identity and its Loss is Sufficient to Induce Arteriovenous Malformations,” Cardiovascular Research 120, no. 7 (2024): 782–795, 10.1093/cvr/cvae052. [DOI] [PubMed] [Google Scholar]
- 21. Desroches‐Castan A., Tillet E., Bouvard C., and Bailly S., “BMP9 and BMP10: Two Close Vascular Quiescence Partners That Stand out,” Developmental Dynamics 251, no. 1 (2022): 178–197, 10.1002/dvdy.395. [DOI] [PubMed] [Google Scholar]
- 22. Klavina P. A., Leon G., Curtis A. M., and Preston R. J. S., “Dysregulated Haemostasis in Thrombo‐Inflammatory Disease,” Clinical Science 136, no. 24 (2022): 1809–1829, 10.1042/cs20220208. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Long A., Kleiner A., and Looney R. J., “Immune Dysregulation,” Journal of Allergy and Clinical Immunology 151, no. 1 (2023): 70–80, 10.1016/j.jaci.2022.11.001. [DOI] [PubMed] [Google Scholar]
- 24. Ouseph M. M., Huang Y., Banerjee M., et al., “Autophagy is Induced Upon Platelet Activation and is Essential for Hemostasis and Thrombosis,” Blood 126, no. 10 (2015): 1224–1233, 10.1182/blood-2014-09-598722. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Sharda A. V., Barr A. M., Harrison J. A., et al., “VWF Maturation and Release Are Controlled by 2 Regulators of Weibel‐Palade Body Biogenesis: Exocyst and BLOC‐2,” Blood 136, no. 24 (2020): 2824–2837, 10.1182/blood.2020005300. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Tripisciano C., Weiss R., Eichhorn T., et al., “Different Potential of Extracellular Vesicles to Support Thrombin Generation: Contributions of Phosphatidylserine, Tissue Factor, and Cellular Origin,” Scientific Reports 7, no. 1 (2017): 6522, 10.1038/s41598-017-03262-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Schäfer A. and Bauersachs J., “Endothelial Dysfunction, Impaired Endogenous Platelet Inhibition and Platelet Activation in Diabetes and Atherosclerosis,” Current Vascular Pharmacology 6, no. 1 (2008): 52–60, 10.2174/157016108783331295. [DOI] [PubMed] [Google Scholar]
- 28. Shaito A., Aramouni K., Assaf R., et al., “Oxidative Stress‐Induced Endothelial Dysfunction in Cardiovascular Diseases,” Frontiers in Bioscience (Landmark Edition) 27, no. 3 (2022): 105, 10.31083/j.fbl2703105. [DOI] [PubMed] [Google Scholar]
- 29. Yuan W., Xia H., Xu Y., et al., “The Role of Ferroptosis in Endothelial Cell Dysfunction,” Cell Cycle 21, no. 18 (2022): 1897–1914, 10.1080/15384101.2022.2079054. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Zhu Y., “Plasma/Serum Proteomics Based on Mass Spectrometry,” Protein and Peptide Letters 31, no. 3 (2024): 192–208, 10.2174/0109298665286952240212053723. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Hodgson J., Ruiz‐Llorente L., McDonald J., et al., “Homozygous GDF2 Nonsense Mutations Result in a Loss of Circulating BMP9 and BMP10 and Are Associated With Either PAH or an “HHT‐Like” Syndrome in Children,” Mol Genet Genomic Med 9, no. 12 (2021): e1685, 10.1002/mgg3.1685. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Nikolic I., Yung L. M., Yang P., et al., “Bone Morphogenetic Protein 9 is a Mechanistic Biomarker of Portopulmonary Hypertension,” American Journal of Respiratory and Critical Care Medicine 199, no. 7 (2019): 891–902, 10.1164/rccm.201807-1236OC. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Kruk L., Braun A., Cosset E., Gudermann T., and Mammadova‐Bach E., “Galectin Functions in Cancer‐Associated Inflammation and Thrombosis,” Frontiers in Cardiovascular Medicine 10 (2023): 1052959, 10.3389/fcvm.2023.1052959. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Lightfoot A., McGettrick H. M., and Iqbal A. J., “Vascular Endothelial Galectins in Leukocyte Trafficking,” Frontiers in Immunology 12 (2021): 687711, 10.3389/fimmu.2021.687711. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Negedu M. N., Duckworth C. A., and Yu L. G., “Galectin‐2 in Health and Diseases,” International Journal of Molecular Sciences 24, no. 1 (2022): 341, 10.3390/ijms24010341. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Li Z., Delaney M. K., O'Brien K. A., and Du X., “Signaling During Platelet Adhesion and Activation,” Arteriosclerosis, Thrombosis, and Vascular Biology 30, no. 12 (2010): 2341–2349, 10.1161/atvbaha.110.207522. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Lou Z., Zhu J., Li X., et al., “LncRNA Sirt1‐AS Upregulates Sirt1 to Attenuate Aging Related Deep Venous Thrombosis,” Aging (Albany NY) 13, no. 5 (2021): 6918–6935, 10.18632/aging.202550. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Watt S. M. and Chan J. Y., “CD164‐‐a Novel Sialomucin on CD34+ Cells,” Leukemia and Lymphoma 37, no. 1–2 (2000): 1–25, 10.3109/10428190009057625. [DOI] [PubMed] [Google Scholar]
- 39. Wang Z., Zhao Y., and Zhang L., “Emerging Trends and Hot Topics in the Application of Multi‐Omics in Drug Discovery: A Bibliometric and Visualized Study,” Current Pharmaceutical Analysis 21, no. 1 (2024): 20–32, 10.1016/j.cpan.2024.12.001. [DOI] [Google Scholar]
- 40. Du P., Fan R., Zhang N., Wu C., and Zhang Y., “Advances in Integrated Multi‐Omics Analysis for Drug‐Target Identification,” Biomolecules 14, no. 6 (2024): 692, 10.3390/biom14060692. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: https://www.ncbi.nlm.nih.gov/geo/.
