Abstract
Schizophrenia spectrum disorder (SSD) is a clinically and genetically heterogeneous condition, yet few studies have integrated real-world clinical data with both common and rare genetic variation to explore this complexity. In this study, we analyzed real-world data from 22,092 individuals in the Danish iPSYCH cohort (11,046 SSD cases and 11,046 matched population controls) leveraging nationwide registry data on diagnoses, hospitalizations, and parental history. Using a variational autoencoder (VAE), we compressed these features into a latent space and identified ten clinically distinct SSD subgroups that varied in comorbidity, parental diagnoses, hospital burden, and early-life adversity. Polygenic scores (PGSs) for five psychiatric disorders showed subgroup-specific enrichment, highlighting potential links between complex clinical profiles and common variant liability. In a subset with exome data (N=5,969), we assessed rare deleterious variant burden across SCZ-informed gene sets and Protein-Protein Interaction (PPI) networks, observing suggestive network-specific trends. This framework for integrating real world-based stratification with genetic evidence is scalable and transferable across cohorts, offering a path toward biologically informed patient classification.
INTRODUCTION
Schizophrenia spectrum disorder (SSD) is a genetically complex and highly heritable condition characterized by considerable phenotypic variation and clinical heterogeneity.1,2 Although genome-wide association studies (GWAS) have identified many common risk loci,3–5 they explain only part of the estimated heritability, which twin studies place at around 80% for Schizophrenia (SCZ) and about 73% for SSD.6 Rare variants have also been associated with SSD, with ultra-rare protein-truncating variants in specific genes conferring large increases in SCZ risk (with odds ratios from ~3 to >20), and may implicate high-impact etiological pathways.7–9 Beyond genetics, many environmental and developmental risk factors have been linked to SSD, including prenatal/perinatal exposures,10 childhood adversity, cannabis use, migration, urbanicity,11 as well as immune-related factors such as infections and autoimmune conditions.12,13 This complexity motivates the need to move beyond one-size-fits-all case-control analyses towards exploring clinically meaningful subgroups that may differ in pathophysiology and genetic architecture.14 In this scenario, real-world data (RWD), including registries, electronic health records (EHRs), and biobanks, can enable comprehensive clinical and demographic characterization of psychiatric patients and have the potential to advance precision psychiatry.15–17
Early biomedical informatics efforts, such as the “Deep Patient” model, showed that unsupervised learning on EHR data could uncover latent patient representations, underscoring the power of high-dimensional clinical data for patient stratification.18 More recently, other unsupervised learning approaches have been used to stratify individuals with psychiatric disorders into clinically meaningful subgroups. Multi-Dimensional Scaling (MDS)-based clustering on registry data has revealed distinct diagnostic trajectories in schizophrenia,19 and deep learning (DL) has been used to integrate registry and genetic data to identify distinct subgroups across people with schizophrenia and depression.20 Additional work has uncovered psychiatric transdiagnostic clusters based on symptom trajectories,21 clinical-genetic profiles,22 cognitive phenotypes,23 and brain connectivity patterns,24 illustrating the potential of data-driven stratification across diagnostic boundaries.
Integrating patient stratification with genetic evidence is particularly valuable, as genetically supported drug targets are more than twice as likely to achieve regulatory approval.25,26 Large-scale exome sequencing efforts, such as the SCHEMA consortium, have uncovered ultra-rare coding variants in specific genes that confer substantial risk for schizophrenia and converge on synaptic and neuronal processes.9,27 When mapped onto brain cell-type-specific protein-protein interaction (PPI) networks derived from human induced neurons, these genetic signals can be interpreted in a functionally grounded, mechanistically rich framework, arguably the closest approximation we have to human neurobiology in disease-relevant contexts.28–30 However, it remains unclear whether genetic signals captured in PPI networks can be meaningfully mapped onto clinical subgroups of patients.
Here, we applied unsupervised DL to registry data from 22,092 individuals in the iPSYCH cohort (11,046 SSD cases and 11,046 matched population controls).31,32 We identified ten SSD subgroups, capturing differences in comorbidity, hospitalization, developmental profiles, and familial risk. We then tested whether these subgroups differed in their genome-wide polygenic scores (PGSs) for five psychiatric disorders. Additionally, in the subset of individuals with exome sequencing data available (N=5,969; 3,116 cases and 2,853 population controls), we also evaluated differences in the burden of rare, potentially deleterious coding variants across curated schizophrenia gene sets and neuronal PPI modules. We found that specific SSD subgroups exhibited distinct patterns of common genetic liability, with several clusters showing significantly elevated or reduced PGSs for different mental disorders. While rare-variant analyses were underpowered for definitive inference after multiple testing correction, we observed suggestive and directionally consistent patterns in specific clusters. By linking RWD-derived clinical subgroups with both common and, especially, rare genetic signals anchored in strong prior human genetic evidence, we offer a data-driven framework for exploring the genetic underpinnings of clinical heterogeneity. This approach can be extended to other disorders and datasets to uncover biologically grounded dimensions of disease and move toward more refined, genetically informed patient stratification.
RESULTS
Study overview
We analyzed data from 22,092 individuals in the iPSYCH cohort,31,32 including 11,046 individuals diagnosed with SSD and 11,046 matched population controls (see Table 1). We compressed RWD from national registries, including psychiatric diagnoses, hospital contacts, familial risk, and other somatic conditions, into a latent space using a VAE (Figure 1A). K-means clustering of these latent representations revealed clinically distinct SSD subgroups. These also differed in their genetic profiles, as reflected by cluster-specific patterns of PGSs and burden of rare predicted deleterious variants across SCZ-related PPI networks, prioritized using Combined Annotation Dependent Depletion (CADD) scores > 15 (see Methods). These findings support a link between clinical heterogeneity and both common and rare genetic variation. To illustrate the rare variant signal underlying these analyses, Figure 1B shows the number of CADD variants across the 32 protein coding genes passing FDR correction in the SCHEMA exome sequencing study9, together with the total number of alternate alleles observed in SSD cases. More extensive summary statistics aggregated per protein, per network, per dataset and per cluster are available in Table S1–7 and in Figure S1.
Table 1.
Descriptive statistics for all 22,092 individuals, stratified by SSD cases and population controls.
| Variable | Overall, N = 22,092 | Cases, N = 11,046 | Controls, N = 11,046 |
|---|---|---|---|
| Has exome data | 5,969 (27%) | 3,116 (28%) | 2,853 (26%) |
| iPSYCH1 | 12,035 (54%) | 4,941 (45%) | 7,094 (64%) |
| iPSYCH2 | 10,023 (45%) | 6,075 (55%) | 3,948 (36%) |
| ADHD (attention-deficit hyperactivity disorder) | 1,603 (7.3%) | 1,401 (13%) | 202 (1.8%) |
| ASD (autism spectrum disorder) | 1,358 (6.1%) | 1,216 (11%) | 142 (1.3%) |
| BIP (bipolar disorder) | 715 (3.2%) | 671 (6.1%) | 44 (0.4%) |
| MDD (major depressive disorder) | 4,157 (19%) | 3,760 (34%) | 397 (3.6%) |
| Age in 2016 | |||
| Median (IQR) | 27.1 (23.3, 30.9) | 27.1 (23.3, 30.9) | 27.1 (23.3, 30.9) |
| Range | 11.0, 35.7 | 11.0, 35.7 | 11.0, 35.7 |
| Sex assigned at birth | |||
| F | 10,637 (48%) | 5,259 (48%) | 5,378 (49%) |
| M | 11,455 (52%) | 5,787 (52%) | 5,668 (51%) |
Figure 1. Study overview and distribution of rare variant burden in schizophrenia-associated genes.
A) Schematic representation of the analytical workflow. We used a VAE on RWD from national health registries, integrating psychiatric and non-psychiatric diagnoses, hospital contacts, and clinical family history on 22,092 individuals with and without a diagnosis of SSD. We then clustered the latent space using k-means to define data-driven subgroups. On top of these clusters, we mapped genetic signal from both common variants (PGSs for psychiatric disorders) and from rare variants (burden of rare variants with CADD score > 15). The rare variant burdens were aggregated across different gene sets, including biologically informed PPI networks. B) Scatterplot showing the number of rare CADD > 15 variants per protein (y-axis) and the sum of alternate alleles (x-axis) across all 3,116 individuals diagnosed with SSD and having exome data, focusing on genes from the SCHEMA_32 list. Each dot represents one protein-coding gene, with schizophrenia-implicated genes labeled in red. Proteins such as HERC1, SRRM2, and SETD1A showed higher cumulative variant burden, highlighting genes of potential interest for subtype-specific rare variant analyses.
Two distinct clinical profiles emerged from clustering SSD cases and population controls
To assess whether we could differentiate individuals with SSD from unaffected population controls, we applied a VAE to integrate 131 registry-derived features from 22,092 individuals (11,046 SSD cases and 11,046 matched population controls), generating a 40-dimensional latent space after hyperparameter tuning (Methods). Feature importance analysis indicated that parental mental disorders and substance use, followed by individual hospital contacts and suicide attempt history, contributed most strongly to the latent structure (Figure S2A, Table S8). We then applied k-means clustering on the latent space representations of the individuals in order to stratify them. Clustering validity indices converged on a two-cluster solution as optimal, with the resulting two clusters exhibiting a clear separation (Figure 2A): Cluster A was population control-enriched (96.3% of population controls; 10,632/11,046), whereas Cluster B was case-enriched (69.8% of SSD cases; 7,710/11,046), with minimal population control overlap (3.7%). Notably, 30.2% of SSD cases (3,336/11,046) clustered with the population control-enriched group Cluster A, suggesting a subset with milder or less burdened clinical profiles. The case-enriched Cluster B showed a markedly higher clinical burden than the population control-enriched Cluster, with nearly fourfold higher psychiatric diagnosis rates (mean range: 0.012–0.56 vs 0.002–0.17), a 20-fold increase in psychiatric hospital contacts (31.7 vs 1.6 per individual), and substantially higher suicide attempt rates (39% vs 0.01%) (Figure 2A, Table S9). Except for Apgar scores, paternal age, and number of previous pregnancies, all differences were statistically significant (Bonferroni corrected p < 0.001 after linear, logistic, and Poisson/negative binomial regressions). Importantly, most features were included as inputs in the VAE model, except smoking during pregnancy, infections and other medical conditions. These were excluded due to limited availability across cohorts, or because information such as birth-related traits may not be consistently available in broader real-world clinical datasets, limiting their applicability for generalizable modeling. The cluster structure hence primarily reflected variation in the modeled clinical features. The clusters also differed on several variables that were not used as input in the VAE model, including higher parental and individual infection burden and more adverse birth-related characteristics for Cluster B. Together, these results show that VAE-derived latent embeddings can capture clinically meaningful structure in real-world registry data, separating a high-burden SSD-enriched subgroup from a lower-burden, population-control enriched group.
Figure 2. Cluster profiles of individuals with SSD reveal ten distinct clinical subgroups.

A) Heatmap visualization of the cluster profiles for the 22,092 SSD cases and population controls, divided in clusters A (population controls-enriched) and B (cases-enriched). B) Cluster profiles for ten clusters of the 11,046 SSD cases. Mental disorders are grouped by ICD-10 chapter (F1, F4–9) except for the more granular diagnoses of mood disorders: F30-Manic episode, F31-BIP, F32-MDD-single, and F33-MDD-recurrent. Hospital contacts were computed as the sum of all hospital contacts per individual for each main psychiatric diagnosis, i.e. ADHD, ASD, BIP, MDD, SCZ, SSD. The same approach was followed for the remaining count variables, from parental mental disorders to individual other medical conditions. The continuous variables from the medical birth registry were z-score normalized. Color scales reflect: Odds Ratios (ORs) from logistic regressions comparing binary outcomes across each cluster versus all others, Rate Ratios (RRs) from Poisson or negative binomial regressions applied to count data, Standardized averages (Std. Avg.) from linear regressions applied to continuous variables. Unlike beta coefficients, which may obscure subtle trends, standardized averages more clearly highlight cross-cluster patterns. All tests were adjusted for sex assigned at birth, age, iPSYCH1 membership, and the first ten principal components (to account for potential residual population stratification). The significance levels are defined as: p-value < 0.05: *, p-value < 0.01: **, p-value < 0.001: *** (all after Bonferroni correction). The variables with a grey bar next to their labels were not used as input features for the VAE model.
SSD cases stratify into ten subgroups, seven of which show severe comorbidities
We then proceeded to identify more fine-grained structure in the SSD cases. To this end, we applied k-means clustering to the latent representations from a VAE retrained on the 11,046 SSD cases. This latent space was primarily driven by parental mental disorders, particularly substance abuse, in addition to individual hospital contacts (Figure S2B, Table S10), and revealed ten clinically distinct groups (Figure 2B and Table S11 for all test p-values and effect sizes). Clusters 1 through 7 captured more severe and comorbid subgroups, though with differing clinical signatures. Cluster 1 was marked by elevated rates of substance use diagnoses, along with high SCZ-related hospital contacts and the highest hospitalization rate for SSD (Rate Ratio 6.42; average count = 114; p < 0.001, Poisson regression), but lower number of recurrent MDD diagnoses. Cluster 2 was the most comorbid, spanning multiple mood, anxiety, and developmental disorders, in addition to having the highest rates of recurrent MDD (OR 9.335, avg 0.495, p-value < 0.001, logistic regression), and the greatest hospital contact burden for recurrent MDD, ASD, and BIP. Clusters 4 and 6 ranked next in overall comorbidity after Cluster 2, with Cluster 4 showing enrichment for intellectual disability, developmental disorders, bipolar disorder, and manic episodes, and Cluster 6 marked by elevated rates of major depressive disorder and neurotic, personality, and behavioral disorders.
Other clusters showed more specific profiles, including high rates of intellectual disability but low suicidal behavior (Cluster 3), and elevated substance use and suicidal behavior alongside more non-psychiatric medical conditions (Cluster 5). Cluster 7 stood out for having higher substance use, behavioral disorders including ADHD, and ADHD hospital contacts, along with increased exposure to early-life adversities such as maternal infections and low birth weight, in addition to having younger parents. Together, these findings illustrate how unsupervised clustering of real-world registry data can reveal clinically meaningful subgroups within SSD, underscoring its clinical heterogeneity, and sets the stage for exploring their distinct genetic underpinnings.
Three SSD subgroups showed reduced clinical severity and may reflect stable or focused psychotic profiles
Clusters 8–10 reflected less severe clinical profiles of SSD cases when compared to clusters 1–7. These groups had markedly lower psychiatric comorbidity, hospital contacts, maternal mental illness, and suicidality. More specifically, Cluster 8 showed reduced hospital contacts across most disorders, including SCZ, but maintained higher SSD-related hospital use. This discrepancy suggests that psychotic presentations in Cluster 8 are less frequently classified as schizophrenia (F20) and more often fall within other psychotic-spectrum diagnoses (F21–F29). Cluster 9 had the lowest rates of psychiatric diagnoses and hospital contacts across all conditions. These individuals may represent stable or less severely affected cases, or potentially people whose diagnosis may have changed or improved over time. Cluster 10 also had lower rates of co-occurring psychiatric disorders, but average rates for substance use disorders and behavioral syndromes. However, it had higher numbers of hospital contacts for SCZ, suggesting a form of SSD that was more narrowly focused on psychotic symptoms without other major mental health issues.
Strong PGS enrichments mirror clinical severity and neurodevelopmental factors
To explore whether clinical heterogeneity within SSD reflected underlying differences in common genetic liability, we examined how the clinically defined clusters differed in terms of polygenic scores (PGSs) for five psychiatric traits: attention-deficit hyperactivity disorder (ADHD), autism spectrum disorder (ASD), bipolar disorder (BIP), major depressive disorder (MDD), and schizophrenia (SCZ). As expected, the case-enriched Cluster B had higher PGSs across all five psychiatric traits compared to the population control-enriched Cluster A (Figure 3A left, Table S12). Conversely, the PGS profiles across the ten SSD clusters revealed heterogeneous patterns.
Figure 3. Distinct genetic profiles across clinically derived SSD subgroups.
A) Heatmaps showing the results of the association analyses for genome wide PGSs of five psychiatric disorders, ADHD, MDD, ASD, BIP, and SCZ, across the initial population control vs. SSD case clustering (left) and the ten SSD clinical subgroups (right). Beta coefficients are derived from linear models comparing each cluster to all others, adjusted for sex assigned at birth, age, first ten principal components, and iPSYCH1 membership. Significant enrichments are indicated by black asterisks (after Bonferroni correction) and white asterisks (after FDR correction): p-value < 0.05: *, p-value < 0.01: **, p-value < 0.001: ***. B) Heatmaps showing rate ratios (RRs) of rare CADD >15 variant burden across curated schizophrenia gene sets and protein–protein interaction (PPI) networks. Left panel shows comparison between SSD cases and population controls; right panel displays burden across the ten SSD subgroups.
Cluster 1, defined by high SSD and SCZ related hospital contacts and high substance use, showed the highest levels of PGSs for both SCZ and BIP (with betas of 0.158 and 0.135, and standardized averages of 0.135 and 0.131, respectively, both with Bonferroni p-value < 0.001) (Figure 3A left, black asterisks, Table S13). This supported a potential link between SSD symptoms severity and common variant burden for SCZ and BIP. Cluster 7, which had higher behavioral disorders diagnoses, ADHD hospital contacts, and early-life adversity, showed significantly elevated PGSs for ADHD and MDD, pointing to shared common polygenic liability across neurodevelopmental traits and behavioral dysregulation. Individuals in Cluster 9, characterized by the lowest comorbidity and hospital contacts across all disorders, showed significantly lower PGSs for BIP and SCZ, reaching values of standardized averages of −0.132 for BIP (Beta −0.154, p-value < 0.001) and of −0.092 for SCZ (Beta −0.114 p-value < 0.05). This could potentially indicate a less genetically loaded presentation of SSD.
Modest PGS enrichments hint at alternative or non-genetic risk in specific SSD groups
Other clusters showed more modest patterns, reaching significance only after FDR correction. Cluster 3, defined by developmental delay and low overall hospital burden, was the only cluster with consistently lower PGSs across traits, reaching significance for ASD and BIP (Figure 3A right, white asterisks, Table S13). This suggested a potentially greater contribution of non-polygenic factors, such as environmental influences or rare variants. Cluster 8, which had low overall comorbidity but maintained SSD-related hospital contacts, showed higher ASD PGS and lower BIP PGS, potentially reflecting a distinct liability profile in which core neurodevelopmental genetic risk manifests without broader mood or behavioral comorbidity. Cluster 10 was the second cluster, after Cluster 1, to have significantly higher PGS for SCZ (after FDR correction, Beta 0.102, std avg 0.105, q-value 0.0089). Given its high SCZ-related hospital contact burden and limited comorbidity, this pattern is consistent with an SSD presentation predominantly centered on psychotic features and enriched for schizophrenia-specific polygenic liability. Overall, these findings show that PGSs for psychiatric disorders were unevenly enriched across SSD subgroups, reflecting specific clinical dimensions such as psychosis severity, neurodevelopmental features, and comorbidity, highlighting the value of stratified analyses to capture SSD heterogeneity.
Case-enriched cluster shows reduced rare variant burden in SCHEMA_10, unlike broader SCZ gene lists
Next, we assessed whether clinical subgroups differed in their number of rare CADD variants aggregated across SCZ related gene lists and PPI networks (see Methods). Although we could not detect any statistically significant signal, the distribution of rare CADD variant burden across gene sets revealed suggestive trends. The first pattern that stood out was related to the SCHEMA_10 gene set, comprising the ten genes most strongly associated with schizophrenia via ultra-rare coding variants in the SCHEMA study.9 In our case vs population control-enriched analysis, this gene set showed a trend in the opposite direction: individuals in the case-enriched Cluster B had a slightly lower burden of CADD variants in these genes compared to the population control-enriched Cluster A (RR 0.915, avg 0.271 vs 0.294) (Figure 3B left, Table S14). A similar pattern was observed for the HCN4_ExN protein network.
Notably, all other protein lists and particularly SCHEMA_32, a broader gene set that contains all SCHEMA_10 genes, showed a reversed direction of effect, with slightly higher CADD variants burden in the case-enriched Cluster B. This divergence may reflect statistical noise due to the small number of genes in SCHEMA_10 and HCN4_ExN (10 and 68, respectively).
CADD variant burden from SCZ-informed PPI networks show suggestive patterns across SSD clusters
This further motivated a deeper examination of the CADD burden in the within-case stratification. Here we observed that the burden of CADD variants in the SCHEMA_10 was lowest in Cluster 1, which had the most severe hospital contact profile for SCZ and SSD (Figure 3B right, Table S15). Conversely, the highest burdens for SCHEMA_10 were found in clusters 3 and 4, characterized by developmental or mixed comorbidity profiles, and not by SSD severity per se. The pattern of CADD burden across the 10 SSD clusters of the protein networks HCN4_ExN and SCZ_Module3, PPI modules identified in stem-cell derived neurons from SCZ associated genes, most closely resembled that of SCHEMA_10 (Pearson correlation coefficient of 0.52 and 0.75, respectively), with opposite direction effects for Cluster 5 and 9 (positive in SCHEMA_10, negative in the others). GRIA3_ExN, a PPI network centered on GRIA3, was the gene list that differed the most from the SCHEMA_10, HCN4_ExN, and SCZ_Module3, with correlations coefficients of −0.42, −0.36, −0.39, respectively. It was also the only protein network with higher CADD variants burden in Cluster 1, in addition to Cluster 2 and 4, while it had a lower burden in Cluster 3 and 5 compared to the rest of the clusters. Interestingly, Cluster 10, one of the least comorbid clusters, but characterized by higher hospital contacts related to SCZ diagnosis, was the only one having a lower CADD burden across all 6 gene lists. Finally, as expected, the WholeProteome background set, a list of proteins expressed in stem cell derived neurons, detected by mass spectrometry, showed the least amount of variation across both the first clustering (Figure 3B left), and across the 10 SSD clusters (Figure 3B right), with average number of CADD variants per cluster ranging from 115 to 117, compared to the overall average of 116. Together, these findings underscore the importance of integrating clinical heterogeneity when interpreting genetic signals, even from high-confidence gene sets. While preliminary and not statistically significant, these patterns suggest that rare variant load may not be uniformly elevated across all SSD presentations and may instead concentrate in more narrowly defined subgroups.
DISCUSSION
Connecting molecular risk signals with the clinical heterogeneity in schizophrenia spectrum disorders (SSDs) remains a central translational challenge in psychiatric genetics. Although GWAS and exome sequencing have identified many risk loci and rare variants for schizophrenia,5,9,33 these findings have yet to be effectively integrated with real-world clinical data to inform patient stratification. Here, we stratified 22,092 individuals with and without SSD from the iPSYCH cohort using unsupervised clustering of clinical registry data integrated through a DL model. We then mapped genome-wide polygenic scores for psychiatric disorders and rare coding variant burden from biologically informed gene sets and PPI networks across the resulting subgroups. This extends our previous work20 to more individuals with SSD and to focus on mapping of rare genetic signal.
We identified ten clinically distinct SSD subgroups with varying comorbidity, early-life adversity, hospitalization profiles, and family history. Clusters 2 and 4 were both highly comorbid, but only Cluster 2 showed elevated hospital contacts, possibly reflecting differences in access to care or disease course. Clusters 8–10 showed the lowest overall burden, while Cluster 7 stood out for its neurodevelopmental profile: lower parental age, low birth weight, high ADHD comorbidity, and familial psychiatric history, suggesting a distinct early-life risk profile. This stratification aligns with prior evidence that schizophrenia is not a homogeneous disorder but includes subgroups defined by differences in cognitive trajectories and premorbid functioning, including neurodevelopmental and affective profiles.34,35
These clinical differences were mirrored by distinct patterns of common genetic risk. Cluster 1, marked by substance use and high hospitalizations, had the highest PGSs for SCZ and BIP, possibly suggesting that their severity may, in part, be associated with or potentially driven by an elevated burden of common risk variants. Cluster 7, with neurodevelopmental traits, had elevated PGSs for ADHD and MDD. Conversely, Cluster 3, defined by developmental delay and low clinical burden, had uniformly low PGSs, pointing to alternative etiologies, including rare or environmental factors. These results are consistent with earlier research suggesting that PGSs can differentiate SSD subgroups by cognitive course, affective comorbidity, and age at onset,36,37 and that common genetic variation maps onto clinically meaningful dimensions such as cognition, externalizing behavior, and mood instability.38 Although PGSs have limited predictive power for psychiatric disorders,39 they may still capture clinically meaningful variation across symptom dimensions, potentially helping in understanding the etiological differentiation of clinical subtypes.
A key contribution of our study is the mapping of rare coding variant burden (filtered for CADD score > 15 to prioritize variants with likely functional impact) from biologically informed SSD-relevant gene sets and PPI networks. These networks represent curated modules derived from orthogonal molecular and in silico experiments spanning multiple omics layers and offer a biologically grounded scaffold for grouping rare variants with potentially shared functional consequences towards schizophrenia.30 While no comparisons reached statistical significance after multiple testing correction, suggestive patterns emerged. SCHEMA_10 and HCN4_ExN showed lower rare variant burden in the case enriched Cluster B than in the population control-enriched Cluster A, unlike broader gene sets such as SCHEMA_32 (Figure 3B). While this may reflect limited power due to small gene sets and sample size, it might also be consistent with Simpson’s paradox, where aggregate effects mask or even reverse trends present within subgroups. Indeed, within the SSD cases, the SCHEMA_10 and HCN4_ExN gene sets showed lower burden in Clusters 1 and 10 compared to other gene sets. These subgroup-specific depletions may have disproportionately influenced the first clustering (Cluster A vs B), creating an apparent depletion in Cluster B that does not reflect the broader distribution within the cases. Cluster 3, notable for its low PGSs and developmental profile, had a higher burden for the SCHEMA_10 and HCN4_ExN gene sets. This cluster may represent a SSD subgroup more influenced by deleterious variants in these gene sets, consistent with evidence that schizophrenia and neurodevelopmental disorders share pathogenic coding variants, suggesting that certain schizophrenia subtypes might lie on a continuum of neurodevelopmental impairment.40 This cluster, however, had lower rare variant burden for SCHEMA_32. Such seemingly counter-intuitive findings reinforce the need to account for clinical heterogeneity when interpreting genetic burden, even in gene sets with strong prior associations. In contrast, Clusters 1 and 10 had low rare variant burden across almost all gene sets, but high SCZ PGSs, aligning with the notion that rare and common variant contributions may inversely correlate. GRIA3_ExN burden was elevated in highly comorbid Clusters 2 and 4, suggesting that different biological modules may track distinct aspects of clinical complexity. Together, these results suggest that rare variant burden within SCZ-informed PPI networks is unevenly distributed across SSD subgroups may help link distinct genetic signals to clinical phenotypes. Prior systems genomics, multi-omics, and stem-cell studies have shown that schizophrenia risk genes converge on synaptic and neurodevelopmental pathways, supporting the biological relevance of these network-based analyses7,41,42
While exploratory, our results underscore the potential value of integrating rare genetic data with real world-based patient stratification to help improve our understanding of psychiatric disease mechanisms. Larger samples and improved annotations will be key to validating these early signals. More broadly, our approach supports the view that distinct routes to schizophrenia may exist, some shaped by polygenic risk, others by rare variants or early adversity.16 Recognizing compensatory or divergent genetic architectures across subgroups may help resolve inconsistencies across studies or patient cohorts and could ultimately inform more precise interventions.
Our findings are situated at the intersection of two major research currents in psychiatry: (1) the movement toward biologically informed nosology, and (2) the rise of multi-modal and network-based approaches to elucidate disease mechanisms. The need to move beyond DSM-based diagnostic categories has been repeatedly emphasized,43,44 and newer efforts have sought to redefine mental disorders using dimensional, data-driven approaches.45–47 Our latent clustering of registry data contributes to this goal by revealing clinically coherent subgroups that cut across traditional diagnostic boundaries, with early indications of validity supported by their alignment with genetic risk profiles. Finally, by using real-world registry data, researchers can leverage information that is increasingly available in many healthcare systems and biobanks. Combined with genetic approaches such as PGSs and network-informed rare variant mapping, our approach could be applied to other complex traits with known rare variant contribution, including mood disorders, autism, or non-psychiatric diseases marked by phenotypic and genetic heterogeneity, such as diabetes, obesity, and cardiovascular disease.48–51 In particular, many nonpsychiatric disorders are studied in far larger cohorts and therefore offer substantially greater power for rare variant analyses.
LIMITATIONS
This study has several limitations. First, although national health registries contain rich longitudinal data, our chance to model this complexity was constrained by the low number of variables available in our setting, i.e. 131 input variables. Second, while unsupervised DL enables discovery of latent clinical structure, it poses challenges for interpretability and replication. The subgroups we identified were data-driven and post hoc, requiring external validation before firm conclusions can be drawn. A further limitation concerns the size and composition of the genetic dataset. While the full clinical cohort was relatively large, the exome sequencing data was available only for a smaller subset, reducing the power of our rare variant analyses. Another limitation concerns temporal changes in diagnostic practices. During the study period (1980–2008), recognition of conditions such as autism and ADHD increased substantially, and thus individuals born earlier in the cohort were less likely to receive these diagnoses, which in turn may influence diagnosis-based features and the clinical patterns captured by the model. In addition, some gene sets used, such as SCHEMA_10, SCHEMA_32, and HCN4_ExN, were small and prone to high variance. Finally, the SCHEMA-derived gene sets and PPI networks we used in this study were derived using datasets that included individuals from our own iPSYCH cohort, potentially introducing circularity and an inflation of associations. However, given the exploratory nature of our analyses and the lack of statistically significant rare variant findings, our results do not constitute new genetic associations but rather illustrate the potential of combining clinical subtyping with biologically informed gene sets as a framework for future studies.
MATERIALS AND METHODS
Dataset
We used the iPSYCH2015 dataset from the Danish iPSYCH consortium,31,32 covering all individuals born in Denmark between 1981–2008 (N = 1,657,499). From this population, 88,831 psychiatric cases and 48,420 population-based controls were genotyped, yielding a case-cohort sample of 137,251 individuals. The cohort was assembled in two stages: iPSYCH1 (known as iPSYCH2012, with 57,377 cases and 30,000 population controls), and iPSYCH2 (known as iPSYCH2015i, with 36,741 cases, 19,982 population controls). DNA was collected at birth from dried blood spots stored in the Danish Neonatal Screening Biobank.52 Genotyping used the Infinium PsychChip v1.0 for iPSYCH1, and the Global Screening Array v2 for iPSYCH2. For both arrays, stringent quality control and imputation pipelines were applied as described in Schork et al. 2019.53 Among the 22,092 SSD cases and population controls included in this study, 5,969 individuals (belonging to the iPSYCH1 wave) had exome data that was generated in Singh et al. 20229 and that was processed following Satterstrom et al. 2019.54 Only unrelated participants of European ancestry were retained for the analysis. Diagnoses were retrieved through the Danish Psychiatric Central Research Register55 and the Danish National Patient Register.56 These registries capture all specialist-led psychiatric assessments conducted in both inpatient and outpatient settings and did not include data for primary care contacts. Diagnoses were recorded under the International Classification of Diseases, Eight Revision or Tenth Revision (ICD-8, ICD-10), and included attention-deficit/hyperactivity disorder (ADHD: F90.0 ICD-10 code), autism spectrum disorders (ASD: F84.0, F84.1, F84.5, F84.8, F84.9), bipolar disorders (BIP: F30–31), major depressive disorder (MDD: F32, F33), schizophrenia spectrum disorder (SSD: F20-F29). For this study, we included all SSD individuals as cases, totaling 11,046. We then extracted an age-matched sample of 11,046 individuals from the random population, not having been diagnosed with SSD, to use as population control. The identities of all participants were pseudonymized in compliance with data protection regulations. The study protocol was approved by the Danish Data Protection Agency, and, under Danish law, individual informed consent was not required. For the VAE analysis, we compiled a dataset of 131 variables derived from registry data, which captured both the clinical and familial aspects of patient health. These variables included age-at-diagnosis information for all psychiatric diagnoses recorded in the registry available (ICD-10 codes from F00 to F99), as well as some non-psychiatric diagnoses that were available. Additionally, we included severity measures such as attempted suicide and number of hospital contacts related to psychiatric diagnoses. Finally, we included binarized parental psychiatric diagnoses and some non-psychiatric related diagnoses as a family history indicator. Other variables were not included in the input data. For example, although birth registry information was available, this was excluded from the input data as we deemed it to be not actionable within the bigger scope of this proof-of-concept study. Moreover, we didn’t include infections and autoimmune diagnoses as these were generally available only for individuals in the iPSYCH1 cohort. Nevertheless, we incorporated these external variables into the cluster profile analysis to provide additional insight into the identified subgroups. The complete list of all variables is available in Table S16.
Preprocessing of input data to VAE model
Continuous variables, including age at diagnosis and the number of hospital contacts, were standardized using z-score normalization after training and test split, resulting in a mean of zero and unit variance for each feature. Missing values in these continuous measures were replaced with the feature’s mean value (zero after normalization). Categorical variables, such as family history and suicide attempts, were one-hot encoded and subsequently flattened for model input, with absent values represented by an all-zero vector. To ensure data robustness, only variables observed in at least 1% of cases were retained for analysis. After this filtering step, 131 variables were included as input data to the VAE model. See Table S16 for an overview of all variables.
VAE implementation and model
We integrated the multimodal data using a β-Variational Autoencoder (VAE) model, adapted from the MOVE framework.57 The model consisted of fully connected encoder and decoder networks with LeakyReLU activations,58 batch normalization, and dropout rate of 0.2, mapping inputs to a Gaussian latent space via the reparameterization trick. Reconstruction losses were computed separately for continuous and categorical variables using cross-entropy and mean squared error, respectively. The overall loss combined these reconstruction terms, which were normalized by dataset size and batch size, with a Kullback-Leibler divergence (KLD) term that regularized the latent space towards a standard normal distribution, weighted by a tunable parameter β. We defined the loss as:
Where Ecat and Econ are vectors of the reconstruction error for the categorical and continuous datasets, β is the weight applied to the KL-divergence KLD. Following Kingma et al, 2013,59 this was defined as:
where N is the batch size, and the summation is performed over the J latent dimensions for each sample. Missing continuous data were encoded as feature-wise mean values (zero after z-score normalization) and excluded from back-propagation, while missing categorical data were represented as zero vectors and handled via an ignore index in the loss computation to not be backpropagated. The VAE was optimized using the Adam optimizer,60 and the latent representation for each individual was derived by extracting the mean vector from the encoder after training. During training, we implemented a dynamic schedule to improve convergence. In particular, the batch size was dilated at predefined epochs 50, 100, and 150 by a factor of 1.5 to gradually incorporate more samples per mini-batch. We set the final number of training epochs to 300 to allow the model to train for a considerable amount of time. Additionally, we ran the training using a KLD-warmup strategy, in which the KLD weight β was set to 0 for the first epochs, and then incrementally raised from epoch 80 through epoch 150 until it reached its final chosen value of 0.001 (see Hyperparameter tuning). This approach allowed for a gradual regularization of the latent space while the model adjusted to increasing batch sizes over the course of training.
Hyperparameter tuning
We randomly selected 11,000 individuals from the full dataset (N = 22,092) and split them into 80% training and 20% test sets using stratified sampling. Continuous variables were standardized after the split to prevent data leakage. We performed a grid search to optimize the variational autoencoder (VAE), varying the number of hidden layers (1, 2, or 3), the batch size (32, 64, or 128), the number of latent dimensions (2, 10, 20, 30, or 40), and the β coefficient of the Kullback–Leibler divergence (KLD) term (0 to 10 across 14 values, see Figure S3, Table S17). KLD warm-up schedules were applied, gradually increasing β during initial epochs. We evaluated models using reconstruction error (cosine similarity for continuous variables, top-probability match for categorical ones) and test log-likelihood, excluding missing data. Based on these results, we selected 0.001 as the best β value, representing the point just before train and test reconstruction performances started to degrade (Figure S4). Among all models trained with β = 0.001, we ranked the top 10 by aggregate performance across reconstruction (train/test) and test log-likelihood (Figure S5B, Table S18). To assess stability, each model was retrained 10 times and compared via changes in cosine similarity across latent projections. The model with the smallest average difference was deemed most stable (Figure S6, Table S19). The final model used one hidden layer of 600 neurons, a batch size of 64, 40 latent dimensions, and β=0.001, and was used to generate latent representations for both the full dataset (N = 22,092) and the SSD cases-only subset (N = 11,046).
Feature importance on latent space
We quantified how individual input features influenced the latent space learned by the VAE by employing a perturbation-based feature importance method, as implemented in MOVE.57 For each categorical and continuous feature, we generated perturbed datasets in which the target feature was set to missing, represented by an all-zero vector for categorical features and a value of zero for continuous ones, while keeping all other inputs unchanged. Each perturbed dataset was passed through the trained model to obtain new latent embeddings. We then measured the per-sample difference between the original and perturbed embeddings by summing the absolute changes across latent dimensions. The absolute differences in the latent representations were then aggregated across samples to yield an overall impact score per feature (Figure S2, Table S8 and S10).
Clustering
To assess whether the VAE-derived latent space contained meaningful clustering structure, we computed the Hopkins statistic, a measure for assessing the spatial randomness of multivariate data, using the hopkins function from the clusterend R package v1.7.61 The result (Hopkins of 0.0515 for the SSD cases and population control dataset, and of 0.1235 for the SSD cases only) indicated strong deviation from randomness and the presence of underlying cluster structure. For the combined dataset of SSD cases and population controls, we used the R package NbClust v3.0.1 with 26 clustering indices to estimate the optimal number of clusters.62 As expected, the algorithm supported number of clusters k = 2, reflecting diagnostic separation between cases and population controls (Figure S7A, Table S20). Clustering the cases-only dataset posed a greater challenge, because no clear ground truth was available and clustering algorithms may identify spurious structure in high dimensional data. We thus we adopted a more robust two-step procedure combining clustering indices from NbClust and Monte Carlo Reference-based Consensus Clustering (M3C) to assess their statistical significance by comparing them to a null reference distribution generated.63 NbClust supported k = 2, 6, and 10 (Figure S7B, Table S20), but only k = 10 yielded a statistically significant M3C p-value (p < 0.05; Figure S7C–D, Table S21). Thus, we selected k = 10 as the final number of clusters for downstream analyses.
Consensus clustering and clusters stability
To ensure robust clusters, we ran k-means clustering 100 times on the VAE-derived latent space using the kmeans function from base R with identical parameters. We then built a consensus matrix capturing how often each pair of individuals was clustered together, reducing sensitivity to random initialization. Final clusters were derived by reapplying k-means to this consensus matrix. To assess clustering stability, we computed Adjusted Rand Index (ARI) values across all pairs of runs using the cl_agreement function from the clue package v0.3.65.64,65 The Rand index quantifies the agreement between two clusterings by measuring the proportion of sample pairs consistently grouped together (or apart) across runs. The ARI, ranging from 0 (no agreement) to 1 (perfect agreement), had a median of 0.46 for the full SSD cases and population control dataset and 0.56 for the SSD-only latent space (Figure S8, Table S22).
Data preprocessing for cluster profile analyses
To perform the cluster profile analysis (Figure 2), we aggregated binary, count-based and continuous variables from registry data (see Methods: Dataset). For each individual, we grouped psychiatric diagnoses by ICD-10 chapters F1 and F4–F9, with each chapter collapsed into a binary variable (1 = any diagnosis present, 0 = absent). Due to their prevalence and clinical relevance, we retained the ICD-10 diagnoses F30 (manic episode), F31 (bipolar disorder), F32 (single depressive episode), and F33 (recurrent depressive disorder) as separate binary variables. Attempted suicide and maternal smoking during pregnancy (the latter not used as input in the VAE model) were encoded as binary variables. We aggregated hospital contact variables, which included number of contacts and total days admitted, for each person and for each key psychiatric diagnostic category: ADHD, ASD, BIP, SCZ, SSD, and MDD (both single and recurrent episodes).
Parental psychiatric diagnoses were summarized by counting ICD-10 F-chapter diagnoses per parent. Thirteen somatic conditions (e.g., asthma, epilepsy, congenital anomalies) present in ≥1% of cases were summed into a single “Other medical conditions in input” variable to provide a cumulative indicator of somatic comorbidity beyond psychiatric diagnoses. For infections, we applied the same approach by summing the number of infection-related diagnoses for each individual and their parents. These variables were excluded from the VAE model input, as they were predominantly available only for individuals in the iPSYCH1 cohort. We aggregated a set of 30 non-psychiatric autoimmune and inflammatory conditions (Table S16), including, for example, type 1 diabetes, ulcerative colitis, rheumatoid arthritis, multiple sclerosis, and celiac disease, and counted how many of these diagnoses were present for each individual and for each of their parent. Again, these were not used as input to the VAE model. Finally, continuous variables from the birth registry (not used in VAE input) were z-score normalized before cluster analysis.
PGS computation
To generate the PGSs, we obtained the latest external GWAS summary statistics for ASD,66 ADHD,67 SCZ,5 MDD,68 and BIP69 provided by the Psychiatric Genomics Consortium (PGC)70 in which the iPSYCH2015 individuals were excluded from the discovery sample of GWASs. After intersecting the SNPs in the target sample (i.e., iPSYCH2015) with the SNPs in HapMap3 as the reference, 1,041,920 SNPs were retained for PGS computation. The p-value threshold was set at 0.9 to determine SNP inclusion. To rescale the SNP effect sizes in each summary statistic, we used SBayesR71, which applies Bayesian multiple regression. The analysis was performed with default parameters for the number of mixture components (π) and the scaling factor (γ), specified as --pi 0.95,0.02,0.02,0.02,0.02,0.01 and --gamma 0,0.01,0.1,1, respectively. To compute the PGSs, we applied the --score module in PLINK v2.00a2.3LM to the rescaled SNPs’ effect sizes derived from SBayesR output files. The scores were subsequently normalized by subtracting the mean score of the control population (i.e., the iPSYCH2015 random cohort) and dividing by the standard deviation of scores within the control group.
Exome data processing
All variant calls (VCF files) were originally aligned to the hg19 (GRCh37) reference genome. For consistency with more recent assemblies, these were lifted over to the GRCh38 reference using bcftools (v1.20)72 with the +liftover plugin73 to each chromosome-level VCF while supplying the hg19ToHg38.over.chain.gz chain file downloaded from the UCSC Genome Browser.74 The VCF files were then sorted with bcftools sort and normalized through bcftools norm (with the -m -any option) to split multiallelic records and left-align INDELs. This pipeline step also removed any duplicate entries, yielding one final VCF file per chromosome in compressed and indexed format. All lifted and normalized files were subsequently annotated with Ensembl Variant Effect Predictor (VEP; version 112),75 which provided information on Combined Annotation–Dependent Depletion (CADD) scores to prioritize potentially deleterious variants.76 CADD scores integrate multiple annotation metrics into a single value, with higher scores indicating an increased likelihood that a variant is deleterious and adversely affects protein function.76 We filtered the exome data to retain only rare variants (< 1% AF) with CADD score > 15, a widely used cutoff for identifying variants likely to impair protein function and which coincides with the median score for canonical splice site and missense variants.77 For each individual, we counted the number of alternate alleles located in the coding regions of each protein-coding gene. We then summed these allele counts across genes belonging to predefined gene and protein sets, defined the following section. We used the resulting value, the sum of CADD alternate alleles per gene set per individual, to perform enrichment testing for each cluster against the rest (see Method: Statistical tests).
Derivation of gene sets
To investigate rare variant burden within biologically meaningful contexts, we analyzed curated gene sets derived from large-scale genetic association studies and neuronal protein–protein interaction (PPI) networks linked to schizophrenia. From the SCHEMA exome sequencing study,9 we included two gene sets: SCHEMA_10 (10 genes with exome-wide significance, p < 2.5×10−6), and SCHEMA_32 (32 FDR-significant, q < 0.05), capturing the strongest rare variant associations to schizophrenia to date. We also analyzed three PPI modules derived from excitatory neuron-specific (ExN) networks in human induced neurons, prioritized based on multi-layer enrichment across rare and common variant associations, transcriptomic, proteomic, and phosphoproteomic data.28 Two modules were centered on HCN4 and GRIA3 (HCN4_ExN, GRIA3_ExN), while the third (SCZ_Module3) merged networks anchored on five schizophrenia-linked genes (SRRM2, SETD1A, TRIO, RB1CC1, AKAP11). As a control background, we included the WholeProteome set (~8,000 proteins), representing baseline protein expression in neurons, which serves as a neuronal proteome background against which enrichment of genetic signal in PPI modules is typically tested.30 Together, these gene sets span the strongest rare variant evidence for schizophrenia as well as genetically anchored synaptic networks, allowing us to probe whether clinical subtypes show differing rare variant burdens across mechanistically relevant biological pathways.
Computation of principal components
To account for population structure, we computed principal components (PCs) from a quality-controlled subset of the merged genetic dataset of iPSYCH, containing the iPSYCH1 and iPSYCH2 cohorts. We used imputed genotype data from 107,766 unrelated individuals of European ancestry. SNP-level quality control steps included the removal of variants with minor allele frequency (MAF) < 0.01, imputation origin p-values < 0.0001, and duplicate variant identifiers. Regions of extended linkage disequilibrium (LD) were removed, resulting in 208,928 SNPs. Additional filtering was applied thresholds of MAF ≥ 0.05, missingness rate (--geno) ≤ 0.02, and Hardy-Weinberg equilibrium (--hwe) p-value ≥ 1×10−6, leading to 185,372 SNPs. Linkage disequilibrium pruning was performed using a 500 kb window and an r2 threshold of 0.5 (--indep-pairwise 500kb 0.5), reducing the set to 99,456 approximately independent SNPs. Principal component analysis was then conducted using PLINK v2.0 with the --pca approx flag to compute the first 10 principal components.
Statistical tests
To test whether clinical subgroups were enriched for specific variables (cluster profiles, PGSs, or CADD variant counts), we used regression models matched to each outcome type. Binary outcomes (i.e. mental disorders diagnoses, attempted suicide and smoking during pregnancy) were analyzed using logistic regression; continuous outcomes (i.e birth variables and PGSs) with linear regression; and count outcomes (i.e. hospital contacts, family history and medical conditions, and number of CADD alternate alleles) with Poisson regression. If overdispersion was detected (p < 0.05 via dispersion test), we used negative binomial models (NB) instead. Zero-inflated models were not needed, as no outcome suffered from a significant excess of zeros. For each test, we created a binary indicator for cluster membership and adjusted for sex, age, iPSYCH1 membership, and the first 10 genetic PCs. Comparisons were made between each cluster and the rest combined. We excluded comparisons with <5 events for logistic and count models. Effect sizes were reported as odds ratios (ORs) for logistic models, regression coefficients (beta) for linear models, and rate ratios (RRs) for Poisson models. Multiple testing correction was applied separately for clinical, PGS, and rare variant analyses using Bonferroni and FDR. Unless otherwise noted, all plots indicate statistical significance based on Bonferroni correction within the respective analysis. For the PGS results, we additionally reported associations that were significant after FDR adjustment but not Bonferroni, to capture potentially meaningful signals that may be missed by more conservative thresholds (see results and Fig 3A). All tests were run in R v4.3.178 using the functions glm (base R) or glm.nb (MASS v7.3–60) for NB models.79
Supplementary Material
SUPPLEMENTARY DATA
Supplementary figures and tables are available in the attached files.
Table 2.
Descriptive statistics for the 5,969 individuals, stratified by SSD cases and population controls, who had exome data available.
| Variable | Overall, N = 5,969 | Cases, N = 3,116 | Controls, N = 2,853 |
|---|---|---|---|
| iPSYCH1 | 5,537 (93%) | 2,685 (86%) | 2,852 (100%) |
| iPSYCH2 | 427 (7.2%) | 427 (14%) | 0 (0%) |
| ADHD | 746 (12%) | 644 (21%) | 102 (3.6%) |
| ASD | 768 (13%) | 664 (21%) | 104 (3.6%) |
| BIP | 304 (5.1%) | 283 (9.1%) | 21 (0.7%) |
| MDD | 1,054 (18%) | 997 (32%) | 57 (2.0%) |
| Age in 2016 | |||
| Median (IQR) | 26.3 (23.5, 29.6) | 27.1 (24.2, 30.0) | 25.5 (22.7, 28.7) |
| Range | 11.8, 35.7 | 11.8, 35.7 | 13.3, 35.7 |
| Sex assigned at birth | |||
| F | 2,363 (40%) | 1,335 (43%) | 1,028 (36%) |
| M | 3,606 (60%) | 1,781 (57%) | 1,825 (64%) |
FUNDING
L.C. was supported by an unrestricted research grant from Sidera Bio Aps. M.P.A., H.W., R.H.M, and S.R. were supported by the Novo Nordisk Foundation (NNF14CC0001, NNF23SA0084103 and NNF21SA0072102). K.L.G.H was supported by Lundbeckfonden (R335-2019-2318) and the National Institute of Mental Health (R01MH130581). Y.H.H., G.P., and K.L. were supported by grants from the Stanley Center for Psychiatric Research, the US National Institute of Mental Health (R01 MH109903 and U01 MH121499), the Simons Foundation Autism Research Initiative (awards 515064 and 735604), the Lundbeck Foundation (R223-2016-721 and R350-2020-963), and the Novo Nordisk Foundation (NNF21SA0072102 and NNF21CC0073729). G.P. was supported by Simons Foundation Autism Research Initiative Bridge to Independence Award 00002804.
CONFLICT OF INTEREST
L.C. was supported by an unrestricted research grant from Sidera Bio Aps. S.R. is the founder and owner of the Danish company BioAI and has performed consulting for Sidera Bio ApS and QuantumCell ApS. K.L. is equity holder at Sidera Bio ApS, QuantumCell ApS and consultant at Sidera Bio ApS and ZS Associates. Y.H.H. and G.P are consultants at QuantumCell ApS.
Funding Statement
L.C. was supported by an unrestricted research grant from Sidera Bio Aps. M.P.A., H.W., R.H.M, and S.R. were supported by the Novo Nordisk Foundation (NNF14CC0001, NNF23SA0084103 and NNF21SA0072102). K.L.G.H was supported by Lundbeckfonden (R335-2019-2318) and the National Institute of Mental Health (R01MH130581). Y.H.H., G.P., and K.L. were supported by grants from the Stanley Center for Psychiatric Research, the US National Institute of Mental Health (R01 MH109903 and U01 MH121499), the Simons Foundation Autism Research Initiative (awards 515064 and 735604), the Lundbeck Foundation (R223-2016-721 and R350-2020-963), and the Novo Nordisk Foundation (NNF21SA0072102 and NNF21CC0073729). G.P. was supported by Simons Foundation Autism Research Initiative Bridge to Independence Award 00002804.
DATA AVAILABILITY
The iPSYCH data were securely stored and analyzed on the Danish national high-performance computing infrastructure (https://genome.au.dk/). All personal identifiers were anonymized prior to research use. The study was approved by the Danish Data Protection Agency, and, under Danish law, individual informed consent was not required. Access to iPSYCH data is governed by Danish regulations, and the iPSYCH consortium facilitates data access for qualified researchers in accordance with these legal frameworks. The code used in this project is available at https://github.com/leocob/SSD_RWE_paper.
REFERENCES
- 1.Jauhar S., Johnstone M., and McKenna P.J. (2022). Schizophrenia. The Lancet 399, 473–486. 10.1016/S0140-6736(21)01730-X. [DOI] [Google Scholar]
- 2.Owen M.J., Legge S.E., Rees E., Walters J.T.R., and O’Donovan M.C. (2023). Genomic findings in schizophrenia and their implications. Mol Psychiatry, 1–10. 10.1038/s41380-023-02293-8. [DOI] [PubMed] [Google Scholar]
- 3.Pardiñas A.F., Holmans P., Pocklington A.J., Escott-Price V., Ripke S., Carrera N., Legge S.E., Bishop S., Cameron D., Hamshere M.L., et al. (2018). Common schizophrenia alleles are enriched in mutation-intolerant genes and in regions under strong background selection. Nat Genet 50, 381–389. 10.1038/s41588-018-0059-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Purcell S.M., Wray N.R., Stone J.L., Visscher P.M., O’Donovan M.C., Sullivan P.F., Sklar P., Purcell (Leader) S.M., Stone J.L., Sullivan P.F., et al. (2009). Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460, 748–752. 10.1038/nature08185. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Trubetskoy V., Pardiñas A.F., Qi T., Panagiotaropoulou G., Awasthi S., Bigdeli T.B., Bryois J., Chen C.-Y., Dennison C.A., Hall L.S., et al. (2022). Mapping genomic loci implicates genes and synaptic biology in schizophrenia. Nature 604, 502–508. 10.1038/s41586-022-04434-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Hilker R., Helenius D., Fagerlund B., Skytthe A., Christensen K., Werge T.M., Nordentoft M., and Glenthøj B. (2018). Heritability of Schizophrenia and Schizophrenia Spectrum Based on the Nationwide Danish Twin Register. Biological Psychiatry 83, 492–498. 10.1016/j.biopsych.2017.08.017. [DOI] [PubMed] [Google Scholar]
- 7.Fromer M., Pocklington A.J., Kavanagh D.H., Williams H.J., Dwyer S., Gormley P., Georgieva L., Rees E., Palta P., Ruderfer D.M., et al. (2014). De novo mutations in schizophrenia implicate synaptic networks. Nature 506, 179–184. 10.1038/nature12929. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Purcell S.M., Moran J.L., Fromer M., Ruderfer D., Solovieff N., Roussos P., O’Dushlaine C., Chambert K., Bergen S.E., Kähler A., et al. (2014). A polygenic burden of rare disruptive mutations in schizophrenia. Nature 506, 185–190. 10.1038/nature12975. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Singh T., Poterba T., Curtis D., Akil H., Al Eissa M., Barchas J.D., Bass N., Bigdeli T.B., Breen G., Bromet E.J., et al. (2022). Rare coding variants in ten genes confer substantial risk for schizophrenia. Nature 604, 509–516. 10.1038/s41586-022-04556-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Davies C., Segre G., Estradé A., Radua J., De Micheli A., Provenzani U., Oliver D., Salazar De Pablo G., Ramella-Cravaro V., Besozzi M., et al. (2020). Prenatal and perinatal risk and protective factors for psychosis: a systematic review and meta-analysis. The Lancet Psychiatry 7, 399–410. 10.1016/S2215-0366(20)30057-2. [DOI] [PubMed] [Google Scholar]
- 11.Arango C., Dragioti E., Solmi M., Cortese S., Domschke K., Murray R.M., Jones P.B., Uher R., Carvalho A.F., Reichenberg A., et al. (2021). Risk and protective factors for mental disorders beyond genetics: an evidence-based atlas. World Psychiatry 20, 417–436. 10.1002/wps.20894. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Benros M.E., Mortensen P.B., and Eaton W.W. (2012). Autoimmune diseases and infections as risk factors for schizophrenia. Annals of the New York Academy of Sciences 1262, 56–66. 10.1111/j.1749-6632.2012.06638.x. [DOI] [PubMed] [Google Scholar]
- 13.Khandaker G.M., Cousins L., Deakin J., Lennox B.R., Yolken R., and Jones P.B. (2015). Inflammation and immunity in schizophrenia: implications for pathophysiology and treatment. The Lancet Psychiatry 2, 258–270. 10.1016/S2215-0366(14)00122-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Jablensky A. (2006). Subtyping schizophrenia: implications for genetic research. Mol Psychiatry 11, 815–836. 10.1038/sj.mp.4001857. [DOI] [PubMed] [Google Scholar]
- 15.Andreassen O.A. (2018). eHealth provides a novel opportunity to exploit the advantages of the Nordic countries in psychiatric genetic research, building on the public health care system, biobanks, and registries. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics 177, 625–629. 10.1002/ajmg.b.32561. [DOI] [Google Scholar]
- 16.Koch E., Pardiñas A.F., O’Connell K.S., Selvaggi P., Camacho Collados J., Babic A., Marshall S.E., Van Der Eycken E., Angulo C., Lu Y., et al. (2024). How Real-World Data Can Facilitate the Development of Precision Medicine Treatment in Psychiatry. Biological Psychiatry 96, 543–551. 10.1016/j.biopsych.2024.01.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Robinson J.R., Wei W.-Q., Roden D.M., and Denny J.C. (2018). Defining Phenotypes from Clinical Data to Drive Genomic Research. Annu. Rev. Biomed. Data Sci. 1, 69–92. 10.1146/annurev-biodatasci-080917-013335. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Miotto R., Li L., Kidd B.A., and Dudley J.T. (2016). Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records. Sci Rep 6, 26094. 10.1038/srep26094. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Krebs M.D., Themudo G.E., Benros M.E., Mors O., Børglum A.D., Hougaard D., Mortensen P.B., Nordentoft M., Gandal M.J., Fan C.C., et al. (2021). Associations between patterns in comorbid diagnostic trajectories of individuals with schizophrenia and etiological factors. Nat Commun 12, 6617. 10.1038/s41467-021-26903-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Allesøe R.L., Nudel R., Thompson W.K., Wang Y., Nordentoft M., Børglum A.D., Hougaard D.M., Werge T., Rasmussen S., and Benros M.E. (2022). Deep learning–based integration of genetics with registry data for stratification of schizophrenia and depression. Science Advances 8, eabi7293. 10.1126/sciadv.abi7293. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Schulte E.C., Kondofersky I., Budde M., Papiol S., Senner F., Schaupp S.K., Reich-Erkelenz D., Klöhn-Saghatolislam F., Kalman J.L., Gade K., et al. (2022). A novel longitudinal clustering approach to psychopathology across diagnostic entities in the hospital-based PsyCourse study. Schizophrenia Research 244, 29–38. 10.1016/j.schres.2022.05.001. [DOI] [PubMed] [Google Scholar]
- 22.Pelin H., Ising M., Stein F., Meinert S., Meller T., Brosch K., Winter N.R., Krug A., Leenings R., Lemke H., et al. (2021). Identification of transdiagnostic psychiatric disorder subtypes using unsupervised learning. Neuropsychopharmacol. 46, 1895–1905. 10.1038/s41386-021-01051-0. [DOI] [Google Scholar]
- 23.Green M.J., Girshkin L., Kremerskothen K., Watkeys O., and Quidé Y. (2020). A Systematic Review of Studies Reporting Data-Driven Cognitive Subtypes across the Psychosis Spectrum. Neuropsychol Rev 30, 446–460. 10.1007/s11065-019-09422-7. [DOI] [PubMed] [Google Scholar]
- 24.Tokuda T., Yamashita O., Sakai Y., and Yoshimoto J. (2021). Clustering of Multiple Psychiatric Disorders Using Functional Connectivity in the Data-Driven Brain Subnetwork. Front. Psychiatry 12, 683280. 10.3389/fpsyt.2021.683280. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Minikel E.V., Painter J.L., Dong C.C., and Nelson M.R. (2024). Refining the impact of genetic evidence on clinical success. Nature 629, 624–629. 10.1038/s41586-024-07316-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Nelson M.R., Tipney H., Painter J.L., Shen J., Nicoletti P., Shen Y., Floratos A., Sham P.C., Li M.J., Wang J., et al. (2015). The support of human genetic evidence for approved drug indications. Nat Genet 47, 856–860. 10.1038/ng.3314. [DOI] [PubMed] [Google Scholar]
- 27.Chick S.L., Holmans P., Cameron D., Grozeva D., Sims R., Williams J., Bray N.J., Owen M.J., O’Donovan M.C., Walters J.T.R., et al. (2025). Whole-exome sequencing analysis identifies risk genes for schizophrenia. Nat Commun 16, 7102. 10.1038/s41467-025-62429-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Hsu Y.-H.H., Pintacuda G., Liu R., Nacu E., Kim A., Tsafou K., Petrossian N., Crotty W., Suh J.M., Riseman J., et al. (2023). Using brain cell-type-specific protein interactomes to interpret neurodevelopmental genetic signals in schizophrenia. iScience 26, 106701. 10.1016/j.isci.2023.106701. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Pintacuda G., Hsu Y.-H.H., Tsafou K., Li K.W., Martín J.M., Riseman J., Biagini J.C., Ching J.K.T., Mena D., Gonzalez-Lozano M.A., et al. (2023). Protein interaction studies in human induced neurons indicate convergent biology underlying autism spectrum disorders. Cell Genomics 3, 100250. 10.1016/j.xgen.2022.100250. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Pintacuda G., Hsu Y.-H.H., and Petra Páleníková K.L., Dubonyte Ugne, Fornelos Nadine, Chen Miao, Mena Daya, Biagini Julia C., Botts Travis, Martorana Makayla, Rebelo Danzel, Ching Joshua K.T., Crouse Ethan, Gebre Hilena, Adiconis Xian, Haywood Nathan, Simmons Sean, Weïwer Michel, Hawes Derek, Pietilainen Olli, Werge Thomas, Li Ka Wan, Smit August B., Kirkeby Agnete, Levin Joshua Z., Nehme Ralda (2025). A foundational neuronal protein network model unifying multimodal genetic, transcriptional, and proteomic perturbations in schizophrenia. preprint.
- 31.Bybjerg-Grauholm J., Pedersen C.B., Bækvad-Hansen M., Pedersen M.G., Adamsen D., Hansen C.S., Agerbo E., Grove J., Als T.D., Schork A.J., et al. (2020). The iPSYCH2015 Case-Cohort sample: updated directions for unravelling genetic and environmental architectures of severe mental disorders 10.1101/2020.11.30.20237768. [DOI] [Google Scholar]
- 32.Pedersen C.B., Bybjerg-Grauholm J., Pedersen M.G., Grove J., Agerbo E., Bækvad-Hansen M., Poulsen J.B., Hansen C.S., McGrath J.J., Als T.D., et al. (2018). The iPSYCH2012 case–cohort sample: new directions for unravelling genetic and environmental architectures of severe mental disorders. Mol Psychiatry 23, 6–14. 10.1038/mp.2017.196. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Ripke S., Neale B.M., Corvin A., Walters J.T.R., Farh K.-H., Holmans P.A., Lee P., Bulik-Sullivan B., Collier D.A., Huang H., et al. (2014). Biological insights from 108 schizophrenia-associated genetic loci. Nature 511, 421–427. 10.1038/nature13595. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Amoretti S., Rabelo-da-Ponte F.D., Rosa A.R., Mezquida G., Sánchez-Torres A.M., Fraguas D., Cabrera B., Lobo A., González-Pinto A., Pina-Camacho L., et al. (2021). Cognitive clusters in first-episode psychosis. Schizophrenia Research 237, 31–39. 10.1016/j.schres.2021.08.021. [DOI] [PubMed] [Google Scholar]
- 35.Ferraro L., Quattrone D., La Barbera D., La Cascia C., Morgan C., Kirkbride J.B., Cardno A.G., Sham P., Tripoli G., Sideli L., et al. (2023). First-Episode Psychosis Patients Who Deteriorated in the Premorbid Period Do Not Have Higher Polygenic Risk Scores Than Others: A Cluster Analysis of EU-GEI Data. Schizophrenia Bulletin 49, 218–227. 10.1093/schbul/sbac100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Dickinson D., Zaidman S.R., Giangrande E.J., Eisenberg D.P., Gregory M.D., and Berman K.F. (2020). Distinct Polygenic Score Profiles in Schizophrenia Subgroups With Different Trajectories of Cognitive Development. AJP 177, 298–307. 10.1176/appi.ajp.2019.19050527. [DOI] [Google Scholar]
- 37.Lee P.H., Anttila V., Won H., Feng Y.-C.A., Rosenthal J., Zhu Z., Tucker-Drob E.M., Nivard M.G., Grotzinger A.D., Posthuma D., et al. (2019). Genomic Relationships, Novel Loci, and Pleiotropic Mechanisms across Eight Psychiatric Disorders. Cell 179, 1469–1482.e11. 10.1016/j.cell.2019.11.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Deans P.M., Townsley K.G., Li A., Seah C., Johnson J., Gonzalez J.G., Cao E., Schrode N., Yu A., Cartwright S., et al. (2025). Convergent and non-additive impact of schizophrenia risk genes in human neurons. Preprint at bioRxiv, https://doi.org/10.1101/2022.03.29.486286 10.1101/2022.03.29.486286. [DOI] [Google Scholar]
- 39.Andreassen O.A., Hindley G.F.L., Frei O., and Smeland O.B. (2023). New insights from the last decade of research in psychiatric genetics: discoveries, challenges and clinical implications. World Psychiatry 22, 4–24. 10.1002/wps.21034. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Rees E., Creeth H.D.J., Hwu H.-G., Chen W.J., Tsuang M., Glatt S.J., Rey R., Kirov G., Walters J.T.R., Holmans P., et al. (2021). Schizophrenia, autism spectrum disorders and developmental disorders share specific disruptive coding mutations. Nat Commun 12, 5353. 10.1038/s41467-021-25532-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Garrett L., Trümbach D., Lee D., Mandillo S., Samaco R., Flenniken A.M., Stewart M., Aguilar-Pimental J.A., Amarie O.V., Becker L., et al. (2024). Co-expression of prepulse inhibition and Schizophrenia genes in the mouse and human brain. Neuroscience Applied 3, 104075. 10.1016/j.nsa.2024.104075. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Michael Deans P., Retallick-Townsley K.G., Li A., Seah C., Johnson J., Garcia Gonzalez J., Cao E., Schrode N., Yu A., Cartwright S., et al. (2026). Functional implications of polygenic risk for schizophrenia in human neurons. Nat Commun. 10.1038/s41467-025-67959-z. [DOI] [Google Scholar]
- 43.Insel T.R., and Cuthbert B.N. (2015). Brain disorders? Precisely. Science 348, 499–500. 10.1126/science.aab2358. [DOI] [PubMed] [Google Scholar]
- 44.Tanaka M. (2024). Beyond the boundaries: Transitioning from categorical to dimensional paradigms in mental health diagnostics. Adv Clin Exp Med 33, 1295–1301. 10.17219/acem/197425. [DOI] [PubMed] [Google Scholar]
- 45.Insel T., Cuthbert B., Garvey M., Heinssen R., Pine D.S., Quinn K., Sanislow C., and Wang P. (2010). Research Domain Criteria (RDoC): Toward a New Classification Framework for Research on Mental Disorders. AJP 167, 748–751. 10.1176/appi.ajp.2010.09091379. [DOI] [Google Scholar]
- 46.Kotov R., Krueger R.F., Watson D., Achenbach T.M., Althoff R.R., Bagby R.M., Brown T.A., Carpenter W.T., Caspi A., Clark L.A., et al. (2017). The Hierarchical Taxonomy of Psychopathology (HiTOP): A dimensional alternative to traditional nosologies. Journal of Abnormal Psychology 126, 454–477. 10.1037/abn0000258. [DOI] [PubMed] [Google Scholar]
- 47.Michelini G., Palumbo I.M., DeYoung C.G., Latzman R.D., and Kotov R. (2021). Linking RDoC and HiTOP: A new interface for advancing psychiatric nosology and neuroscience. Clinical Psychology Review 86, 102025. 10.1016/j.cpr.2021.102025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Dornbos P., Koesterer R., Ruttenburg A., Nguyen T., Cole J.B., Leong A., Meigs J.B., Florez J.C., Rotter J.I., Udler M.S., et al. (2022). A combined polygenic score of 21,293 rare and 22 common variants improves diabetes diagnosis based on hemoglobin A1C levels. Nat Genet 54, 1609–1614. 10.1038/s41588-022-01200-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Smail C., Ferraro N.M., Hui Q., Durrant M.G., Aguirre M., Tanigawa Y., Keever-Keigher M.R., Rao A.S., Justesen J.M., Li X., et al. (2022). Integration of rare expression outlier-associated variants improves polygenic risk prediction. The American Journal of Human Genetics 109, 1055–1064. 10.1016/j.ajhg.2022.04.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Walsh R., Jurgens S.J., Erdmann J., and Bezzina C.R. (2023). Genome-wide association studies of cardiovascular disease. Physiological Reviews 103, 2039–2055. 10.1152/physrev.00024.2022. [DOI] [PubMed] [Google Scholar]
- 51.Weiner D.J., Nadig A., Jagadeesh K.A., Dey K.K., Neale B.M., Robinson E.B., Karczewski K.J., and O’Connor L.J. (2023). Polygenic architecture of rare coding variation across 394,783 exomes. Nature 614, 492–499. 10.1038/s41586-022-05684-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Nørgaard-Pedersen B., and Hougaard D.M. (2007). Storage policies and use of the Danish Newborn Screening Biobank. J Inherit Metab Dis 30, 530–536. 10.1007/s10545-007-0631-x. [DOI] [PubMed] [Google Scholar]
- 53.Schork A.J., Won H., Appadurai V., Nudel R., Gandal M., Delaneau O., Revsbech Christiansen M., Hougaard D.M., Bækved-Hansen M., Bybjerg-Grauholm J., et al. (2019). A genome-wide association study of shared risk across psychiatric disorders implicates gene regulation during fetal neurodevelopment. Nat Neurosci 22, 353–361. 10.1038/s41593-018-0320-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Satterstrom F.K., Walters R.K., Singh T., Wigdor E.M., Lescai F., Demontis D., Kosmicki J.A., Grove J., Stevens C., Bybjerg-Grauholm J., et al. (2019). Autism spectrum disorder and attention deficit hyperactivity disorder have a similar burden of rare protein-truncating variants. Nat Neurosci 22, 1961–1965. 10.1038/s41593-019-0527-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Mors O., Perto G.P., and Mortensen P.B. (2011). The Danish Psychiatric Central Research Register. Scand J Public Health 39, 54–57. 10.1177/1403494810395825. [DOI] [PubMed] [Google Scholar]
- 56.Lynge E., Sandegaard J.L., and Rebolj M. (2011). The Danish National Patient Register. Scand J Public Health 39, 30–33. 10.1177/1403494811401482. [DOI] [PubMed] [Google Scholar]
- 57.Allesøe R.L., Lundgaard A.T., Hernández Medina R., Aguayo-Orozco A., Johansen J., Nissen J.N., Brorsson C., Mazzoni G., Niu L., Biel J.H., et al. (2023). Discovery of drug–omics associations in type 2 diabetes with generative deep-learning models. Nat Biotechnol 41, 399–408. 10.1038/s41587-022-01520-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Maas A.L., Hannun A.Y., and Ng A.Y. (2013). Rectifier Nonlinearities Improve Neural Network Acoustic Models.
- 59.Kingma D.P., and Welling M. (2013). Auto-Encoding Variational Bayes. Preprint at arXiv, https://doi.org/10.48550/arXiv.1312.6114 10.48550/arXiv.1312.6114. [DOI] [Google Scholar]
- 60.Kingma D.P., and Ba J. (2017). Adam: A Method for Stochastic Optimization. Preprint at arXiv, https://doi.org/10.48550/arXiv.1412.6980 10.48550/arXiv.1412.6980. [DOI] [Google Scholar]
- 61.Wright K., YiLan L., and RuTong Z. (2023). clustertend: Check the Clustering Tendency.
- 62.Charrad M., Ghazzali N., Boiteau V., and Niknafs A. (2014). NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set. Journal of Statistical Software 61, 1–36. 10.18637/jss.v061.i06. [DOI] [Google Scholar]
- 63.John C.R., Watson D., Russ D., Goldmann K., Ehrenstein M., Pitzalis C., Lewis M., and Barnes M. (2020). M3C: Monte Carlo reference-based consensus clustering. Sci Rep 10, 1816. 10.1038/s41598-020-58766-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Hornik K. (2005). A CLUE for CLUster Ensembles. J. Stat. Soft. 14. 10.18637/jss.v014.i12. [DOI] [Google Scholar]
- 65.Hornik K. (2024). clue: Cluster Ensembles.
- 66.Grove J., Ripke S., Als T.D., Mattheisen M., Walters R.K., Won H., Pallesen J., Agerbo E., Andreassen O.A., Anney R., et al. (2019). Identification of common genetic risk variants for autism spectrum disorder. Nat Genet 51, 431–444. 10.1038/s41588-019-0344-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Demontis D., Walters G.B., Athanasiadis G., Walters R., Therrien K., Nielsen T.T., Farajzadeh L., Voloudakis G., Bendl J., Zeng B., et al. (2023). Genome-wide analyses of ADHD identify 27 risk loci, refine the genetic architecture and implicate several cognitive domains. Nat Genet, 1–11. 10.1038/s41588-022-01285-8. [DOI] [Google Scholar]
- 68.Howard D.M., Adams M.J., Clarke T.-K., Hafferty J.D., Gibson J., Shirali M., Coleman J.R.I., Hagenaars S.P., Ward J., Wigmore E.M., et al. (2019). Genome-wide meta-analysis of depression identifies 102 independent variants and highlights the importance of the prefrontal brain regions. Nat Neurosci 22, 343–352. 10.1038/s41593-018-0326-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Mullins N., Forstner A.J., O’Connell K.S., Coombes B., Coleman J.R.I., Qiao Z., Als T.D., Bigdeli T.B., Børte S., Bryois J., et al. (2021). Genome-wide association study of more than 40,000 bipolar disorder cases provides new insights into the underlying biology. Nat Genet 53, 817–829. 10.1038/s41588-021-00857-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.The Psychiatric GWAS Consortium Steering Committee (2009). A framework for interpreting genome-wide association studies of psychiatric disorders. Mol Psychiatry 14, 10–17. 10.1038/mp.2008.126. [DOI] [PubMed] [Google Scholar]
- 71.Lloyd-Jones L.R., Zeng J., Sidorenko J., Yengo L., Moser G., Kemper K.E., Wang H., Zheng Z., Magi R., Esko T., et al. (2019). Improved polygenic prediction by Bayesian multiple regression on summary statistics. Nat Commun 10, 5086. 10.1038/s41467-019-12653-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Danecek P., Bonfield J.K., Liddle J., Marshall J., Ohan V., Pollard M.O., Whitwham A., Keane T., McCarthy S.A., Davies R.M., et al. (2021). Twelve years of SAMtools and BCFtools. Gigascience 10, giab008. 10.1093/gigascience/giab008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Genovese G., Rockweiler N.B., Gorman B.R., Bigdeli T.B., Pato M.T., Pato C.N., Ichihara K., and McCarroll S.A. (2024). BCFtools/liftover: an accurate and comprehensive tool to convert genetic variants across genome assemblies. Bioinformatics 40, btae038. 10.1093/bioinformatics/btae038. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Perez G., Barber G.P., Benet-Pages A., Casper J., Clawson H., Diekhans M., Fischer C., Gonzalez J.N., Hinrichs A.S., Lee C.M., et al. (2025). The UCSC Genome Browser database: 2025 update. Nucleic Acids Research 53, D1243–D1249. 10.1093/nar/gkae974. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.McLaren W., Gil L., Hunt S.E., Riat H.S., Ritchie G.R.S., Thormann A., Flicek P., and Cunningham F. (2016). The Ensembl Variant Effect Predictor. Genome Biol 17, 122. 10.1186/s13059-016-0974-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Kircher M., Witten D.M., Jain P., O’Roak B.J., Cooper G.M., and Shendure J. (2014). A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet 46, 310–315. 10.1038/ng.2892. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Castel S.E., Cervera A., Mohammadi P., Aguet F., Reverter F., Wolman A., Guigo R., Iossifov I., Vasileva A., and Lappalainen T. (2018). Modified penetrance of coding variants by cis-regulatory variation contributes to disease risk. Nat Genet 50, 1327–1334. 10.1038/s41588-018-0192-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.R Core Team (2024). R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing; ). [Google Scholar]
- 79.Venables W.N., and Ripley B.D. (2002). Modern Applied Statistics with S Fourth. (Springer; ). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The iPSYCH data were securely stored and analyzed on the Danish national high-performance computing infrastructure (https://genome.au.dk/). All personal identifiers were anonymized prior to research use. The study was approved by the Danish Data Protection Agency, and, under Danish law, individual informed consent was not required. Access to iPSYCH data is governed by Danish regulations, and the iPSYCH consortium facilitates data access for qualified researchers in accordance with these legal frameworks. The code used in this project is available at https://github.com/leocob/SSD_RWE_paper.


