Skip to main content
PLOS Genetics logoLink to PLOS Genetics
. 2022 Jun 21;18(6):e1010278. doi: 10.1371/journal.pgen.1010278

Missense variants causing Wiedemann-Steiner syndrome preferentially occur in the KMT2A-CXXC domain and are accurately classified using AlphaFold2

Tinna Reynisdottir 1, Kimberley Jade Anderson 1, Leandros Boukas 2,3,*, Hans Tomas Bjornsson 1,2,4,5,*
Editor: John M Greally6
PMCID: PMC9249231  PMID: 35727845

Abstract

Wiedemann-Steiner syndrome (WDSTS) is a neurodevelopmental disorder caused by de novo variants in KMT2A, which encodes a multi-domain histone methyltransferase. To gain insight into the currently unknown pathogenesis of WDSTS, we examined the spatial distribution of likely WDSTS-causing variants across the 15 different domains of KMT2A. Compared to variants in healthy controls, WDSTS variants exhibit a 61.9-fold overrepresentation within the CXXC domain–which mediates binding to unmethylated CpGs–suggesting a major role for this domain in mediating the phenotype. In contrast, we find no significant overrepresentation within the catalytic SET domain. Corroborating these results, we find that hippocampal neurons from Kmt2a-deficient mice demonstrate disrupted histone methylation (H3K4me1 and H3K4me3) preferentially at CpG-rich regions, but this has no systematic impact on gene expression. Motivated by these results, we combine accurate prediction of the CXXC domain structure by AlphaFold2 with prior biological knowledge to develop a classification scheme for missense variants in the CXXC domain. Our classifier achieved 92.6% positive and 92.9% negative predictive value on a hold-out test set. This classification performance enabled us to subsequently perform an in silico saturation mutagenesis and classify a total of 445 variants according to their functional effects. Our results yield a novel insight into the mechanistic basis of WDSTS and provide an example of how AlphaFold2 can contribute to the in silico characterization of variant effects with very high accuracy, suggesting a paradigm potentially applicable to many other Mendelian disorders.

Author summary

Wiedemann-Steiner syndrome (WDSTS) is a neurodevelopmental pediatric disorder caused by the genetic disruption of the histone methyltransferase KMT2A. Since KMT2A has many different domains that perform different functions, we reasoned that by identifying the domains most enriched for WDSTS-causing genetic variants we would gain insights into the incompletely understood molecular pathogenesis of WDSTS. We discovered that the CXXC domain—which binds unmethylated CpGs—shows by far the greatest enrichment, suggesting that loss of the CpG-binding ability of KMT2A plays a central role in WDSTS. Next, to understand specific rules underlying the genetic disruption of the CXXC domain, we combined prior knowledge about the function/structure of the domain with 3D structure prediction by AlphaFold2 to develop an effect classifier for CXXC missense variants. We found that this classifier exhibits accurate performance, and we therefore applied it to provide classifications for any such variant that can possibly arise, in order to aid in the interpretation of such variants in the clinic. Our work provides novel insights into WDSTS and suggests a strategy for missense variant classification that can potentially be applied to many other pediatric genetic disorders.

Introduction

Wiedemann Steiner syndrome (WDSTS, OMIM: 605130) is a Mendelian disorder of the epigenetic machinery. Its phenotypic features include intellectual disability, postnatal growth deficiency, hypertrichosis, and characteristic facial features. WDSTS is typically caused by heterozygous de novo variants in the gene encoding the histone-lysine N-methyltransferase 2A (KMT2A) [1]. KMT2A, also known as MLL/MLL1, is post-translationally cleaved into N-terminal and a C-terminal fragments, which subsequently heterodimerize. Each of these fragments contains several annotated protein domains. The larger N-terminal fragment contains three AT-hooks, a cysteine-rich CXXC domain, four plant homeodomain (PHD) fingers, a bromodomain and a FYR-N domain. The smaller C-terminal fragment contains a transactivation (TAD) domain, a Win motif, FYR-C domain, a SET, and a post-SET domain [2].

In this study, we address two questions. First, how important is the role of the different KMT2A domains in the pathogenesis of WDSTS? The answer would provide important clues into the molecular basis of the disorder, since the different domains mediate different functions. Second, what are the rules that determine how the genetic disruption of the most important domain(s) causes WDSTS? The answer would enable a systematic characterization of WDSTS-causing variants, informing basic biology as well as future clinical decision-making.

To answer the first question, we adopt an unbiased genetic approach. We examine the spatial distribution of likely WDSTS-causing missense variants across the different domains, reasoning that such variants will be enriched in the domains most critical for WDSTS pathogenesis when compared against variants in healthy individuals. To address the second question, we focus on the CXXC domain responsible for binding unmethylated CpGs [3], which we find shows by far the greatest enrichment. By leveraging the recent breakthrough in protein structure prediction from primary amino acid sequence by the deep neural network-based AlphaFold2 [4], we combine accurate 3D structure prediction with existing biological data in order to create an effect classification scheme for missense variants in the CXXC domain. After evaluating our classifier, we deploy it to perform an in silico saturation mutagenesis, examining a total of 445 variants.

Results

Preferential occurrence of missense variants likely pathogenic for Wiedemann-Steiner syndrome in the CXXC domain of KMT2A

If certain domains of a protein are critical for its function, disease-causing missense variants will tend to preferentially occur in these domains. We thus set out to explore the distribution of KMT2A missense variants (MVs) across its 15 different domains. We started by identifying 68 MVs that are likely pathogenic for WDSTS (Methods; Fig 1A). As a control, we used 1403 MVs which are present in gnomAD and are therefore not expected to cause WDSTS (Methods; Fig 1A). We tested each KMT2A domain for enrichment of WDSTS MVs relative to gnomAD MVs. We discovered a 61.9-fold enrichment in the CXXC domain (Fig 1B; Fisher’s exact test, Bonferroni-adjusted p = 1.56e-18). This far exceeded the enrichment observed at other domains, with the second most enriched domain being the fourth PHD finger (Fig 1B; 16-fold enrichment; Fisher’s exact test, Bonferroni-adjusted p = 0.043), and the first PHD finger also showing significant enrichment (Fig 1B; 6.8-fold enrichment; Fisher’s exact test, Bonferroni-adjusted p = 0.031).

Fig 1. The distribution of likely pathogenic Wiedemann-Steiner syndrome missense variants across the different domains of KMT2A.

Fig 1

(A) KMT2A missense variants in gnomAD (top) and WDSTS (bottom). See Methods for filtering criteria. (B) The percentage of missense variants from gnomAD (grey dots) and WDSTS (red dots) that fall in each of the different domains of KMT2A. (C) The percentage of missense variants in gnomAD (grey dots) and likely pathogenic variants (blue dots) that fall in the CXXC domain of different epigenetic regulators. (D) Multiple sequence alignment of the amino-acid sequence of the CXXC domain of KMT2A in eight eukaryotic species. Residues known to be important for DNA binding are marked with red asterisks at the top (see Methods for details). The eight zinc ion-binding cysteines are marked with red asterisks at the bottom.

To assess the robustness of our result, we repeated our analysis using a control set of 1788 KMT2A somatic variants obtained from sequencing of tumor samples (Methods). While these variants are likely a mix of driver and passenger variants, the latter are expected to comprise the majority, justifying the use of this set of variants as controls alternative to gnomAD. We recapitulated our result, with the rank ordering of the different domains with respect to their enrichment of WDSTS MVs remaining unchanged (S1A–S1C Fig); this is consistent with the pool of somatic variants containing mostly benign passenger variants. However, across all domains, the enrichment estimates are attenuated compared to the gnomAD-vs-WDSTS comparison (S1C Fig). This suggests that the same domains that contribute to WDSTS pathogenesis contribute to the tumorigenic role of KMT2A as well. With respect to this tumorigenic role, however, our result should be interpreted with caution; gain-of-function mechanisms that are likely not captured by our analysis are probably also involved, as evidenced from KMT2A translocations that act as drivers in certain types of leukemias.

Notably, the catalytic SET domain of KMT2A does not show significant enrichment for WDSTS MVs (Fig 1B; Fisher’s exact test, p = 0.899). In contrast, it shows significant, albeit weak, enrichment, when the somatic cancer variants are compared against the gnomAD controls (odds ratio = 2.87, Fisher’s exact test, p = 8.43e-05). We note here that, while the enrichment of WDSTS-causing MVs within the CXXC domain is consistent with the high conservation of its sequence (Figs 1D and S2), conservation alone cannot explain the lack of enrichment in the SET domain, since it is as conserved as the CXXC domain (S2 Fig).

The CXXC domains of other epigenetic regulators do not show enrichment for disease-causing missense variants

We next asked if the preferential occurrence of disease-causing missense variants in the CXXC domain is unique to KMT2A, or whether this is a general phenomenon across epigenetic regulators that have this domain. Apart from KMT2A, three other CXXC-domain-containing epigenetic regulators have been linked to Mendelian diseases: KMT2B (DYS28; OMIM:617284), DNMT1 (HSN1E; OMIM:614116), and TET3 (BEFAHRS; OMIM:618798). However, we found that none of these genes exhibits significant enrichment of disease-causing MVs in the CXXC domain (Fig 1C; Methods; Fisher’s exact test, DNMT1 p = 1, KMT2B p = 1, TET3 p = 1), and verified that this is not due to inadequate power (Methods).

The disruption of histone methylation in Kmt2a-deficient mice preferentially occurs at CpG-rich regions but has no systematic effect on gene expression

Since the CXXC domain mediates binding to clusters of unmethylated CpG dinucleotides [3], our domain enrichment results imply that KMT2A exerts its most important function at CpG-rich regions. Our results also imply that its most important function is not its catalytic activity as a histone methyltransferase, given the lack of MV enrichment in the SET domain. We sought to test these implications, by comparing genome-wide histone methylation (H3K4me1 and H3K4me3) and gene expression patterns in hippocampal CA (Cornu Ammonis) neurons from mice with Kmt2a knockout in excitatory neurons (Kmt2a cKO) as well as wild-type mice (using ChIP-seq and RNA-seq data; Methods) [5].

First, we found a strong relationship between the CpG-richness of a region and the probability of H3K4me1/3 disruption. CpG-rich peaks are much more likely to exhibit disruption upon Kmt2a cKO compared to CpG-poor peaks (Fig 2A; 5 wild-type vs 3 cKO mice; p<2.2e-16; Methods). Reassuringly, there is strong concordance between the changes of the two histone marks in the mutant mice; at the vast majority of promoter regions (+/-1kb from TSS) which bear both H3K4me1 and H3K4me3 peaks, both marks show decreased intensity in the mutant mice (Figs 2B and S3A). However, although regions with the strongest evidence for histone methylation disruption (1st p-value decile) most frequently correspond to promoters (39% and 86% for H3K4me1 and H3K4me3, respectively; 3.3-fold and 1.3-fold enrichment compared to regions within the 10th p-value decile, respectively; Fisher’s exact test, p<2.2e-16; S3B Fig), we found no evidence for a systematic impact of promoter H3K4me1 or H3K4me3 disruption on gene expression (5 wild-type vs 6 cKO mice; Methods). Specifically, genes whose promoters have disrupted H3K4me1 or H3K4me3 are not significantly more likely to be differentially expressed compared to genes without histone methylation disruption at their promoter (Fig 2C and 2D; p = 0.91 and p = 0.13 for H3K4me1 and H3K4me3, respectively, when testing for a shift in p-value distribution between the 1st and 10th deciles with the one tailed Wilcoxon rank-sum test; Methods). Taken together, these results show that, at least in adult excitatory CA hippocampal neurons: a) KMT2A preferentially acts at high-CpG-density regions, and b) its catalytic activity has little effect on gene expression, suggesting that it may also not affect higher-level phenotypes.

Fig 2. The relationship between disrupted H3K4me1/3, regional observed-to-expected CpG ratio, and gene expression in Kmt2a-deficient mice.

Fig 2

(A) The percentage of disrupted H3K4me1 peaks (left) and H3K4me3 peaks (right), stratified based on the observed-to-expected CpG ratio of the underlying peak sequence. (B) Scatterplot of the log2 fold change of H3K4me1 peaks against the log2 fold change of H3K4me3 peaks at promoters (+/-1kb from the TSS) that harbor peaks for both marks. (C) The percentage of differentially expressed genes, stratified based on the p-value of associated promoter peaks (+/- 1kb from the TSS) from the differential H3K4me1 analysis (left) and H3K4me3 (right) analysis. (D) Scatterplot of the log2 fold change of H3K4me1 peaks (left) and H3K4me3 (right) against the log2 fold change of gene expression of the downstream gene. Each point corresponds to a gene-promoter pair. In cases where multiple peaks were present at the same promoter, the average log2 fold change was computed.

An AlphaFold2-based scheme classifies missense variants in the CXXC domain of KMT2A with high accuracy

Given our evidence for a central role of the KMT2A-CXXC domain in WDSTS, we sought to develop a variant classification scheme that would: a) label any possible variant in the CXXC domain as pathogenic or not, and b) in the case of pathogenic variants, provide a characterization of their functional effect. We reasoned that such a classifier should take into account the effect of variants on the secondary structures of the domain, the mean Coulombic electrostatic potential of the mutant domain structure, and the ability of the mutated domain to form hydrogen bonds with the DNA backbone.

To assess the feasibility of our approach, we first examined if AlphaFold2 (AF2)—which has recently enabled the determination of 3D-protein structures with experimental-level accuracy—accurately predicts the structure of the CXXC domain. We observed a highly confident prediction (Fig 3A; pLDDT>70 for 96.5% of residues and pLDDT>90 for 82.5% of residues; Methods). In addition, all features previously identified in solution and crystal structures of the domain (ID: 2J2S, 2JYI, 4NW3, 2KKF) are present in the predicted structure: a crescent overall shape, two antiparallel beta sheets at the N and C terminals, and four alpha helices (Fig 3B). While no single experimentally derived structure contains all these features, this can be explained by the domain existing in multiple conformational states. We then proceeded to derive a classification scheme, using a training set of 14 MVs with experimentally determined effects and prior biological knowledge about the domain (Figs 3C and S6A, S3 Table; Methods) [6]. The first decision rules of the scheme pertain to residues whose substitution effects are not captured by AF2 (Methods): the eight cysteines responsible for zinc ion binding, the residues that form direct hydrogen bonds with the DNA, and the residues forming the salt bridge. The rest of the scheme is then divided into two cascades, based on the AF2-predicted secondary structure of the mutant domain and its mean Coulombic electrostatic potential (Methods). Variants receive different classifications depending on their effect on secondary structure/electrostatic potential, as well as their position.

Fig 3. An AlphaFold2-based variant effect classification scheme for the CXXC domain of KMT2A.

Fig 3

(A) Predicted LDDT values for the CXXC domain of KMT2A. (B) The AlphaFold2-predicted and experimentally determined structures of the CXXC domain of KMT2A. (C) The variant effect classification scheme. See Methods for details on the derivation of the scheme. (D) The positive and negative predictive value that the classifier shown in (C) attained on a hold-out test set consisting of 41 missense variants (see main text and Methods for details). (E) The position of two different missense variants in the 3D structure of the CXXC domain, in conjunction with a 3D representation of the engaged DNA backbone. For comparison, the same representation is shown for the normal protein as well (PBD* ID: 4NW3). The surface of the domain is color-coded based on the electrostatic potential, with red indicating a negative charge and blue a positive charge. Both variants lead to a decreased electrostatic potential in a residue important for DNA binding.

To evaluate the performance of our classifier, we used a hold-out test set consisting of 41 MVs. Of these, 18 are MVs with strong evidence for pathogenicity for WDSTS (Methods), 13 are MVs with experimentally determined effects [6,7], and 10 are MVs seen in gnomAD/TOPMed, and are thus expected to be benign (S6B Fig, S3 Table). On this test set, our classifier attained a 92.6% positive and a 92.9% negative predictive value (Figs 3D, S6B and S6C). As an example, in Fig 3E we depict two WDSTS pathogenic variants which our classifier correctly labels as compromising DNA binding because they decrease the electrostatic potential at residues important for the formation of contacts with the DNA backbone. Notably, out of the 18 WDSTS MVs present in the CXXC domain of KMT2A, nine are positioned at cysteine residues responsible for zinc ion binding compared to none of the control variants (Fisher’s exact test, p = 0.009816).

In silico saturation mutagenesis classifies 445 variants in the CXXC domain of KMT2A

Clinical geneticists seeking to establish a diagnosis for patients are often confronted with missense variants that have not been seen before; this can make it hard to assess if they are pathogenic or not. The accuracy with which our classifier performs motivated us to perform an in silico saturation mutagenesis for the CXXC domain of KMT2A. Our goal was to create a resource that will enable rapid classification of any newly encountered missense variant in this domain, so that these classifications can be used as supporting evidence in the clinical setting. We focused on the 50/57 residues that have pLDDT>70 and high MSA coverage. In total, we assessed 450 variants (Fig 4A).

Fig 4. An in silico saturation mutagenesis of the CXXC domain of KMT2A.

Fig 4

(A) Heatmap depicting the predicted effect of each nucleotide substitution within the CXXC domain. (B) The percentage of variants for each type of predicted effect. (C) The distribution of the phyloP score of the nucleotides coding for the CXXC domain, stratified according to the number of substitutions predicted to have a damaging effect (unfolding, compromised DNA binding, stop-gain).

Out of these 450 variants, 92 (20.4%) are synonymous, and 27 variants (6%) lead to premature stop codons. The remaining 331 variants (331/450, 73.6%) were classified based on our variant classification scheme. 169 variants (169/331, 51.1%) were predicted to have no effect, 90 (90/331, 27.2%) variants were classified as causing compromised DNA binding, and 67 variants (67/331, 20.2%) were classified as causing unfolding of the domain. For 5 variants (1.5%), we provide no prediction (Fig 4B; S4 Table). To obtain orthogonal validation for our classifications, we examined the conservation of individual nucleotides coding for the CXXC domain stratified according to the number of substitutions predicted to be damaging, and found strong concordance; sites with a greater number of predicted damaging substitutions are more conserved (Fig 4C).

Discussion

Our contribution in this work is twofold. First, we provide strong genetic evidence that the domains most important for mediating the causal role of KMT2A in Wiedemann-Steiner syndrome are the CXXC domain and–to a lesser extent–the fourth and first PHD fingers. It is noteworthy that these PHD fingers have been shown to be important for stabilizing the interaction between the N- and the C-terminal KMT2A fragments [8]; their enrichment for WDSTS-causing missense variants may thus not be attributable to their histone-binding function. We emphasize here that our domain fold-enrichment estimates are based on a relatively low number of WDSTS variants and may change once more variants are reported. However, we anticipate the rank-ordering of the different domains, and the finding that the CXXC is by far the most enriched domain, to remain unchanged.

Our ChIP- and RNA-seq results suggest that lack of KMT2A recruitment to CpG-rich locations–either by missense variants in the CXXC domain or by loss-of-function variants–is central to WDSTS pathogenesis. However, they also suggest that the phenotype is not mediated via ensuing systematic defects in histone methylation-dependent gene expression, but rather that alternative mechanisms might be at play. Such mechanisms may involve defects in polymerase loading, which has previously been shown to depend on the presence, but not the catalytic activity, of some epigenetic regulators [9]. Alternatively, KMT2A may serve as a recruiter of other regulatory factors. Our results emphasize that more research aimed at elucidating such alternative pathways is warranted, and has the potential to contribute to our understanding of WDSTS. This is also consistent with the lack of overrepresentation of WDSTS missense variants within the enzymatic SET domain of KMT2A, and stands in contrast to what was previously observed in Kabuki syndrome, a Mendelian disorder of the epigenetic machinery with considerable phenotypic overlap with WDSTS caused by variants in KMT2D, which is structurally similar to KMT2A [10]. We emphasize here, however, that our findings are based only on data from adult excitatory neurons from the CA region of the hippocampus, whereas gene expression and histone marks are known to be dynamic during development and vary between different cell types. Consequently, future work is needed to assess whether our conclusions generalize to other cell types (neuronal or not) involved in WDSTS pathogenesis, and to earlier developmental stages, when the disease process most likely initiates.

Our second contribution is the demonstration that the recent breakthrough in protein structure prediction by AlphaFold2 can be leveraged in order to classify the effects of missense variants with high accuracy. We highlight that here we use AF2 to directly predict the structure of the mutant proteins; based on these mutant structures, we then assess the effect of variants on secondary structural features and electrostatic potential. This is in contrast to recent work that only uses the wild-type structures as input to algorithms that predict the effect of variants on biophysical attributes like ΔΔG [11]. Prior to our study, it has been unclear if AlphaFold2 is capable of predicting mutant structures accurately, partly because of the use of the multiple sequence alignment. While we do not provide direct evidence, our results indirectly suggest that variant effects on local structural features can be reliably predicted. This conclusion is supported by the high concordance between our predictions and experimentally validated effects, as well as by the severe drop in performance when we use our classifier with inaccurate structure predictions as input (31% negative predictive value; S7 Fig) [12]. We do not however, expect AlphaFold2-based classifications to be able to capture global effects of variants, such as destabilization of the entire domain.

There were three variants in our hold-out test set that were classified erroneously. One of them (from the WDSTS cohort) is in fact annotated as a variant of uncertain significance, suggesting that our classification is not necessarily inaccurate in that case, but rather the “true” label may be incorrect. The other two erroneously classified variants may yield insights into possible limitations of our method. The TOPMed variant misclassified as resulting in compromised DNA binding may reflect the inability of our classifier to place effects along a gradient of severity; in other words, this variant may indeed affect DNA binding as we predict, but only to a mild extent not enough to cause severe disease, and is thus still present among healthy individuals. On the other hand, the pathogenic variant misclassified as benign is located in the structurally important KFGG site in the distal loop of the domain. This raises the possibility that our classifier could be improved by including an earlier decision rule assessing whether a variant affects the KFGG site, as we currently only utilize this information in later steps.

Within the context of the ACMG variant interpretation guidelines, our classifier can be used as an in silico tool that provides supporting evidence of benign impact (BP4 evidence class) or of pathogenicity (PP3 evidence class). We highlight that it is the first such tool that incorporates accurate 3D protein structure prediction by AlphaFold2, and does not directly use inputs such as conservation and population frequency, which are used by most other tools. Thus, we believe the use of our classifier in combination with other tools may prove particularly powerful for predicting the effect of variants in the CXXC domain of KMT2A.

In summary, our work yields insights into the pathogenesis of Wiedemann-Steiner syndrome and presents a strategy for characterizing variant effects using AlphaFold2 that we anticipate will be broadly applicable to many other disease-relevant proteins.

Materials and methods

Missense variants

Missense variants (MVs) present in the general population were obtained from the Genome Aggregation Database (gnomAD; version 2.1.1 and 3.1.1), which does not include individuals with severe pediatric disorders like WDSTS [13]. Somatic MVs were obtained from the Catalogue of Somatic Mutations in Cancer (COSMIC; version 94) [14]. MVs present in individuals with disease phenotypes (Wiedemann-Steiner syndrome [KMT2A], Hereditary sensory neuropathy type 1E [DNMT1], Childhood-onset dystonia [KMT2B], and Beck-Fahrner syndrome [TET3]) were obtained from ClinVar [15]. To increase our power, we chose to include ClinVar variants labeled as Pathogenic, Likely Pathogenic, or Variants of Uncertain Significance (VUS), and filtered them using the phred-like CADD scores, acquired from Ensembl Variant Effect Predictor (VEP) [16]. Specifically, we only retained MVs with a phred-like CADD score above 20 [17]. In the case of WDSTS, we obtained 11 additional variants from the Human Gene Mutation Database (HGMD) with CADD score above 20, as well as 16 MVs from Lebrun et al, Baer et al, Miyake et al, WD Jones (all studies of individuals with a WDSTS clinical phenotype; S1 Table) [1822]. Disease MVs that are also present in gnomAD were excluded from subsequent analyses.

To ensure that our domain enrichment estimates are not artifacts driven by the inclusion of VUS’s, we also performed the domain enrichment analysis after excluding VUS’s and obtained very similar results (S1D Fig). Without VUS’s, the enrichment in the CXXC domain is even greater, but the confidence interval around the point estimate is wider (S1D Fig; confidence interval = 49.2–442.8 without VUS’s vs 24.3–173.4 when including VUS’s). The same is true for the 1st and 4th PHD finger, which again show significant enrichment as well (S1D Fig; For PHD finger 4; confidence interval = 1.9–163.0 without VUS’s vs 2.3–97.0 when including VUS’s and for PHD finger 1; confidence interval = 1.6–35.0 without VUS’s vs 1.9–20.4 when including VUS’s). Together, these results are consistent with the notion that, by excluding VUS’s, we are increasing our signal-to-noise ratio, since we are not including any variants with uncertain pathogenicity. However, this comes at the expense of less power (reflected in the greater uncertainty around the enrichment point estimates), since this analysis is inevitably conducted using fewer variants.

Finally, since it is possible that very rare MVs in gnomAD may in fact be pathogenic, we assessed whether our result still holds when excluding low-frequency variants, and found this to be true. Using only variants with MAF greater than 10e-5 (452 variants), we found that the CXXC domain still shows the greatest enrichment (odds ratio = 159.4, Fisher’s exact test, p = 3.66e-15), followed by PHD fingers 4 and 1 (for PHD finger 4; odds ratio = inf, Fisher’s exact test, p = 0.0322 and for PHD finger 1; odds ratio = 5.8, Fisher’s exact test, p = 0.122), while the SET domain shows no significant enrichment (Fisher’s exact test, p = 0.762), S1E Fig.

For the enrichment analysis in the CXXC domain of KMT2B and TET3 (Fig 1C), we tested if the observed lack of enrichment can be attributed to the low number of total counts in these genes (22 and 16, respectively). In both cases, we observe 0 MVs in the CXXC domain. In contrast, out of the MVs in gnomAD, 0.7% and 2.5% fall within the CXXC domain of KMT2B and TET3, respectively. Using these gnomAD percentages and the formula for the probability mass function of the binomial distribution, we calculated that, even if the true ratio of the percentage of disease variants falling in the CXXC domain to the corresponding percentage of gnomAD variants is 3 times less compared to the ratio for KMT2A, the probability of observing at least one MV in the CXXC domain is 84.5% for KMT2B and 99.9% for TET3. These estimates make inadequate power an unlikely explanation for the lack of MV enrichment in the CXXC domain of KMT2B and TET3. The coordinates of protein domains were attained using InterPro [23] (S5 Table).

For Figs 1A and S1A, variants in KMT2A were plotted using the Mutation Mapper tool from cBioPortal [24].

Evolutionary conservation

The amino acid sequence of the CXXC domain of KMT2A orthologs in the eukaryotic species shown in Fig 1D were obtained using the STRING Database [25]. The alignment of the KMT2A orthologous proteins was performed using the Clustal Omega multiple sequence alignment tool from EMBL-EBI [26], with default parameters. The phyloP conservation scores of the SET and CXXC domain nucleotides across 100 vertebrates were obtained using the GenomicScores R package [27].

ChIP-seq analysis

Raw ChIP-seq sequencing data (fastq files containing unpaired reads) were downloaded from GSE99250 [5]. The reads were aligned to the mouse mm10 (GRCm38) reference genome with Bowtie2 using the default settings with the -U option for aligning unpaired reads, generating a sam file output for each fastq file [28]. The sam files were converted to bam files using Samtools view with the -b option, then sorted using Samtools sort with the -O BAM option for outputting bam files [29]. Peaks were called using MACS2, using a threshold of q-value (-q option) 0.1 for significant peaks, and options “—nomodel—extsize 200” [30]. The sorted bam files and the lists of significant peaks (xls files from MACS2) were used as input to the DiffBind package in R [31]. DiffBind was then run with default settings, except for the options “fragmentSize = 0, RemoveDuplicates = TRUE, filterFun = mean, score = DBA_SCORE_READS” in the dba.count command. Differential peaks obtained in DiffBind were given genomic annotations and converted to GRanges objects using the ChIPseeker package in R [32]. Sequences of peaks were obtained using the getSeq() function from the BSgenome.Mmusculus.UCSC.mm10 package in R [33]. The observed-to-expected CpG ratio was calculated using the following formula [34]:

OECpGratio=p(CpG)p(C)p(G)

, where p(CpG) represents the proportion of CpGs in a given region (similarly for p(C) and p(G)).

RNA-seq analysis

Raw RNA-seq data (fastq files) were downloaded from GSE99250 [5]. Reads were pseudo-aligned to the mouse mm10 (GRCm38) reference transcriptome using Kallisto [35] (kallisto quant command) with the options—single for single-end reads, -l 150 as an approximation of fragment length, -s 20 as an approximation of fragment length standard deviation, and -b 100 for running 100 bootstraps. The Kallisto output was imported into R using the tximport package [36], with transcripts mapped to genes using the BiomaRt package in R [37]. The tximort software was run with options, type = “Kallisto” and ignoreTxVersion = TRUE. The data was filtered to exclude genes with less than 10 counts. Differential expression analysis was performed using DESeq2 with default settings [38].

To evaluate the robustness and reliability of our differential expression analysis, we performed the following quality control checks. First, we examined the distribution of the resulting p-values (S4A Fig), which is consistent with a two-component mixture; one component corresponding to non-differential genes (p-values distributed uniformly between 0 and 1), and one component corresponding to differential genes (p-values concentrated close to 0). Such a distribution indicates a well-calibrated differential expression test, and the existence of true differential hits. Second, we performed a principal component analysis using the expression matrix after applying a regularized log transformation, as implemented in the rlog() function in DeSeq2 with the setting “blind = TRUE”; this revealed that the mutants are separate from the wild-type on PCA space (S4B Fig). Finally, an MA plot (S4C Fig) indicates no obvious systematic biases in the expression data.

For Fig 2A and 2C, the percentage of differentially marked H3K4me1/3 peaks and differentially expressed genes were estimated from the corresponding p-value distributions using Storey’s method [39,40], as implemented in the qvalue R package. Specifically, we used the pi0est() function, with the “pi0.method” parameter set to “bootstrap”.

Derivation of the variant effect classifier

AF2 structure predictions

Structure predictions from AF2 were generated using AF2 Colaboratory (v2.0), which does not use templates (which would be expected to bias the mutant structure towards the structure of the wild-type domain [4]). We do not provide a prediction for variants which result in a structure where: a) the pLDDT value of the beta sheets, positioned at the distal ends, drops below 70, or b) the pLDDT value drops to 85 or lower for more than two residues that are not positioned and the distal ends of the domain, or at residues that fall within a secondary structure that is predicted to be absent.

Electrostatic potential calculations

Secondary structure visualization and computation of mean Coulombic values were performed using UCSF ChimeraX (version 1.2.5) [41], with the AF2 predicted structure as input. MVs were classified as causing a change in electrostatic potential (ESP) if the mean electrostatic potential value of the mutant structure deviated by more than 0.2 (in either direction) from the wild-type domain, whose mean value is equal to 6.28.

Derivation of the decision rules defining the variant effect classification scheme

We obtained the experimentally determined effects of MVs in the CXXC domain of KMT2A from Allen et al and Cierpicki et al [6,7]. We then used 14 of these variants to form our training set. These variants were chosen to ensure a representation of different types of disruption (unfolding of the domain/defective DNA binding) within the training set. Within a group of variants causing the same type of disruption, a random subset was selected for inclusion in the training set (4 out of 8 variants causing unfolding, 6 out of 11 variants compromising DNA binding), whereas the rest were included in the test set used for evaluating the performance of our classifier (see section below). Based on the training set, as well as existing biological knowledge about the structure and function of the domain [6,7], we derived the decision rules defining our classifier as follows.

Our initial decision rule classifies variants which lead to substitutions of the zinc-ion binding cysteine residues as causing unfolding of the domain, since this was the case for all such variants in the training set (C1161A, C1173A, C1189A, D1166A, R1192A). Our next decision rule then classifies variants which lead to substitutions of amino acids responsible for forming direct hydrogen bonds with the DNA (S2 Table; Cierpicki et al [7]) to amino acids that are incapable of forming hydrogen bonds (Ala, Cys, Gly, Ile, Leu, Met, Phe, Pro, Val) as resulting in compromised DNA binding [42]; this was the case with two such variants in the training set (K1186A and Q1187A) and is consistent with the known chemistry underlying protein-DNA contacts.

We subsequently divide the classification scheme into two cascades, based on whether the variant to be classified affects the secondary structure of the domain. For the first cascade, pertaining to variants that do not cause a change in CXXC structure (as determined by AlphaFold2; see previous section), we use the change in ESP (computed using ChimeraX; see previous section) and the position of the variant to determine its effect. There were two variants (N1172A, C1188A) in our training set that were experimentally shown to have no effect on the folding of the domain or on DNA binding. We found that these variants caused neither a change in secondary structure, nor a change in ESP. Therefore, we included a decision rule classifying variants that affect neither the secondary structure nor the ESP as benign. There was also one variant (R1153A) that was experimentally shown to have no effect on folding or DNA binding, which we found caused an ESP change without a concomitant secondary structure change, but the ESP change was at a functionally non-important residue (facing away from the DNA backbone and thus unlikely to impact DNA binding). We thus decided to label variants causing no change in secondary structure and changes in ESP at functionally non-important residues as benign. By contrast, there were two variants (K1176A and R1151A) experimentally shown to compromise DNA binding, which we found caused no change in secondary structure, but caused an ESP change at a functionally important residue (that is, a residue that has been experimentally implicated in electrostatic interaction with the DNA backbone, or a residue forming hydrogen bonds with the DNA, or a residue at the structurally important KFGG site (S2 Table) [7,43,44]). Therefore, we included a decision rule classifying variants which cause no change in secondary structure but cause ESP changes at these sites as resulting in compromised DNA binding.

Finally, for the second cascade, variants that demonstrate a change in secondary structure as well as a change in ESP are classified as compromising DNA binding based on the experimentally determined effect of variants R1154A (loss of beta sheets) and D1175A (loss of an alpha helix). Additionally, similar to the prior cascade, we included a decision rule labeling variants that cause a secondary structure change, do not affect ESP, but affect functionally important residues, as resulting in compromised DNA binding, based on prior knowledge about the function of the domain. Finally, variants that cause secondary structure changes but have no impact on ESP and are not positioned at functionally important residues are classified as having no effect.

Test set

As described above, we included a random subset of variants from each potential type of disruption (unfolding/compromised DNA binding). In addition, we included 18 variants seen in patients with phenotypic features of WDSTS [15,18,19,21,22], and 7 and 3 variants from gnomAD and TOPMed [13,45], respectively, which are expected to be benign. With regards to the WDSTS variants, we know that they are pathogenic, but we do not know their precise damaging effect. Therefore, for these variants we are not able to assess whether the precise label that our classifier assigns (unfolding/compromised DNA binding) is true. However, we can still assess the positive predictive value and true positive rate by asking if known damaging variants (pathogenic WDSTS variants or variants with an experimentally determined detrimental effect) are correctly classified as causing some type of damage (either unfolding or disruption of DNA binding).

Saturation mutagenesis

In our saturation mutagenesis, we chose to not provide a prediction for the single variant (R1151P) which causes major steric hindrance in the DNA binding face of the domain, since we do not have data to determine the potential impact of this variant. With respect to the MVs that were classified as disrupting the electrostatic potential, the majority caused changes approximately equal to 0.5, with the minimum being 0.31.

Supporting information

S1 Fig

(A) KMT2A missense variants in COSMIC (top) and ClinVar (bottom). (B) The percentage of missense variants in COSMIC (grey dots) and WDSTS patients (red dots) that fall in each of the different domains of KMT2A. (C) Correlation between the fold-enrichment (odds ratio) of WDSTS MVs compared to COSMIC MVs and gnomAD MVs. (D) The percentage of missense variants in gnomAD (grey dots) and WDSTS patients (red dots) that fall in each of the different domains of KMT2A, after excluding variants of uncertain significance. (E) The percentage of missense variants in gnomAD (grey dots) and WDSTS patients (red dots) that fall in each of the different domains of KMT2A, after excluding gnomAD variants with MAF<10e-5.

(PDF)

S2 Fig. PhyloP score distribution for the SET and CXXC domain nucleotides of KMT2A.

(PDF)

S3 Fig

(A) Venn diagram depicting the overlap between promoters (+/- 1kb from TSS) harboring H3K4me1 peaks and those harboring H3K4me3 peaks. (B) Genomic annotation of peaks within the 1st (yellow dots) and 10th p-value decile (purple dots) for H3K4me1 peaks (left) and H3K4me3 peaks (right).

(PDF)

S4 Fig

(A) The histogram of p-values from the differential expression RNA-seq analysis. (B) PCA plot based on the expression matrix, after a variance stabilizing transformation (see Methods). (C) MA plot of the log2 fold-change against the mean of normalized counts from the differential expression analysis. Differentially expressed genes are colored in blue.

(PDF)

S5 Fig

(A) Multiple sequence alignment depth plot and (B) predicted alignment error of KMT2A CXXC wild-type domain prediction from AlphaFold2.

(PDF)

S6 Fig

Variant classification scheme from the (A) training set and (B) test set. Variants that do not fall under the correct classification according to the scheme are underlined. (C) Confusion matrix from test set results.

(PNG)

S7 Fig. Predicted structure of the CXXC domain of KMT2A using ColabFold without using the multiple sequence alignment or templates as input.

(PDF)

S1 Table. WDSTS variant information.

(CSV)

S2 Table. Functionally important residues of the CXXC domain of KMT2A.

(PDF)

S3 Table. Variant information from training and test set.

(TXT)

S4 Table. Saturation mutagenesis results.

(CSV)

S5 Table. Amino acid co-ordinates of the proteins domains of KMT2A.

(PDF)

Data Availability

Variant data are available in the manuscript and supplementary files. ChIP-seq and RNA-seq are available on GEO GSE99250.

Funding Statement

This work was supported by a grant from the Wiedemann-Steiner Foundation to HTB (salary coverage of TR). HTB is also supported by the Louma G. Foundation, the Icelandic Research Fund (#217988, #195835, #206806) and the Icelandic Technology Development Fund (#2010588). Research reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health (#R01GM121459 to LB). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Jones Wendy D., et al., De Novo Mutations in MLL Cause Wiedemann-Steiner Syndrome. The American Journal of Human Genetics, 2012. 91(2): p. 358–364. doi: 10.1016/j.ajhg.2012.06.008 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Slany R.K., The molecular biology of mixed lineage leukemia. Haematologica, 2009. 94(7): p. 984–993. doi: 10.3324/haematol.2008.002436 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Voo K.S., et al., Cloning of a mammalian transcriptional activator that binds unmethylated CpG motifs and shares a CXXC domain with DNA methyltransferase, human trithorax, and methyl-CpG binding domain protein 1. Mol Cell Biol, 2000. 20(6): p. 2108–21. doi: 10.1128/MCB.20.6.2108-2121.2000 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Jumper J., et al., Highly accurate protein structure prediction with AlphaFold. Nature, 2021. 596(7873): p. 583–589. doi: 10.1038/s41586-021-03819-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Kerimoglu C., et al., KMT2A and KMT2B Mediate Memory Function by Affecting Distinct Genomic Regions. Cell Reports, 2017. 20(3): p. 538–548. doi: 10.1016/j.celrep.2017.06.072 [DOI] [PubMed] [Google Scholar]
  • 6.Allen M.D., et al., Solution structure of the nonmethyl-CpG-binding CXXC domain of the leukaemia-associated MLL histone methyltransferase. The EMBO Journal, 2006. 25(19): p. 4503–4512. doi: 10.1038/sj.emboj.7601340 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Cierpicki T., et al., Structure of the MLL CXXC domain-DNA complex and its functional role in MLL-AF9 leukemia. Nature structural & molecular biology, 2010. 17(1): p. 62–68. doi: 10.1038/nsmb.1714 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Yokoyama A., et al., Proteolytically cleaved MLL subunits are susceptible to distinct degradation pathways. Journal of Cell Science, 2011. 124(13): p. 2208–2219. doi: 10.1242/jcs.080523 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Dorighi K.M., et al., Mll3 and Mll4 Facilitate Enhancer RNA Synthesis and Transcription from Promoters Independently of H3K4 Monomethylation. Mol Cell, 2017. 66(4): p. 568–576.e4. doi: 10.1016/j.molcel.2017.04.018 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Faundes V., et al., A comparative analysis of KMT2D missense variants in Kabuki syndrome, cancers and the general population. Journal of Human Genetics, 2019. 64(2): p. 161–170. doi: 10.1038/s10038-018-0536-6 [DOI] [PubMed] [Google Scholar]
  • 11.Akdel M., et al., A structural biology community assessment of AlphaFold 2 applications. bioRxiv, 2021: p. 2021.09.26.461876. [Google Scholar]
  • 12.Mirdita M., et al., ColabFold—Making protein folding accessible to all. bioRxiv, 2022: p. 2021.08.15.456425. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Karczewski K.J., et al., The mutational constraint spectrum quantified from variation in 141,456 humans. Nature, 2020. 581(7809): p. 434–443. doi: 10.1038/s41586-020-2308-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Tate J.G., et al., COSMIC: the Catalogue Of Somatic Mutations In Cancer. Nucleic Acids Research, 2019. 47(D1): p. D941–D947. doi: 10.1093/nar/gky1015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Landrum M.J., et al., ClinVar: improving access to variant interpretations and supporting evidence. Nucleic acids research, 2018. 46(D1): p. D1062–D1067. doi: 10.1093/nar/gkx1153 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.McLaren W., et al., Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor. Bioinformatics (Oxford, England), 2010. 26(16): p. 2069–2070. doi: 10.1093/bioinformatics/btq330 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Kircher M., et al., A general framework for estimating the relative pathogenicity of human genetic variants. Nature Genetics, 2014. 46(3): p. 310–315. doi: 10.1038/ng.2892 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Lebrun N., et al., Molecular and cellular issues of KMT2A variants involved in Wiedemann-Steiner syndrome. Eur J Hum Genet, 2018. 26(1): p. 107–116. doi: 10.1038/s41431-017-0033-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Baer S., et al., Wiedemann-Steiner syndrome as a major cause of syndromic intellectual disability: A study of 33 French cases. Clin Genet, 2018. 94(1): p. 141–152. doi: 10.1111/cge.13254 [DOI] [PubMed] [Google Scholar]
  • 20.Miyake N., et al., Delineation of clinical features in Wiedemann–Steiner syndrome caused by KMT2A mutations. Clinical Genetics, 2016. 89(1): p. 115–119. doi: 10.1111/cge.12586 [DOI] [PubMed] [Google Scholar]
  • 21.Jones W.D., Genetic and phenotypic investigations into developmental disorders, in Wellcome Trust Sanger Institute. 2017, University of Cambridge: Newnham College. [Google Scholar]
  • 22.Stenson P.D., et al., Human Gene Mutation Database (HGMD): 2003 update. Hum Mutat, 2003. 21(6): p. 577–81. doi: 10.1002/humu.10212 [DOI] [PubMed] [Google Scholar]
  • 23.Blum M., et al., The InterPro protein families and domains database: 20 years on. Nucleic acids research, 2021. 49(D1): p. D344–D354. doi: 10.1093/nar/gkaa977 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Cerami E., et al., The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer discovery, 2012. 2(5): p. 401–404. doi: 10.1158/2159-8290.CD-12-0095 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Jensen L.J., et al., STRING 8—a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res, 2009. 37(Database issue): p. D412–6. doi: 10.1093/nar/gkn760 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Madeira F., et al., The EMBL-EBI search and sequence analysis tools APIs in 2019. Nucleic acids research, 2019. 47(W1): p. W636–W641. doi: 10.1093/nar/gkz268 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Puigdevall P., Castelo R., GenomicScores: seamless access to genomewide position-specific scores from R and Bioconductor. Bioinformatics, 2018. 34(18): p. 3208–3210. doi: 10.1093/bioinformatics/bty311 [DOI] [PubMed] [Google Scholar]
  • 28.Langmead B., Salzberg S.L., Fast gapped-read alignment with Bowtie 2. Nature Methods, 2012. 9(4): p. 357–359. doi: 10.1038/nmeth.1923 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Li H., et al., The Sequence Alignment/Map format and SAMtools. Bioinformatics, 2009. 25(16): p. 2078–9. doi: 10.1093/bioinformatics/btp352 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Zhang Y., et al., Model-based Analysis of ChIP-Seq (MACS). Genome Biology, 2008. 9(9): p. R137. doi: 10.1186/gb-2008-9-9-r137 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Stark R., Brown G., DiffBind: differential binding analysis of ChIP-Seq peak data. 2011. [Google Scholar]
  • 32.Yu G., Wang L.-G., He Q.-Y., ChIPseeker: an R/Bioconductor package for ChIP peak annotation, comparison and visualization. Bioinformatics, 2015. 31(14): p. 2382–2383. doi: 10.1093/bioinformatics/btv145 [DOI] [PubMed] [Google Scholar]
  • 33.TD T., BSgenome.Mmusculus.UCSC.mm10: Full genome sequences for Mus musculus (UCSC version mm10, based on GRCm38.p6), in R package version 1.4.3. 2021. [Google Scholar]
  • 34.Gardiner-Garden M., Frommer M., CpG islands in vertebrate genomes. J Mol Biol, 1987. 196(2): p. 261–82. doi: 10.1016/0022-2836(87)90689-9 [DOI] [PubMed] [Google Scholar]
  • 35.Bray N.L., et al., Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology, 2016. 34(5): p. 525–527. doi: 10.1038/nbt.3519 [DOI] [PubMed] [Google Scholar]
  • 36.Soneson C., Love M.I., Robinson M.D., Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Res, 2015. 4: p. 1521. doi: 10.12688/f1000research.7563.2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Durinck S., et al., Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nature Protocols, 2009. 4(8): p. 1184–1191. doi: 10.1038/nprot.2009.97 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Love M.I., Huber W., Anders S., Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 2014. 15(12): p. 550. doi: 10.1186/s13059-014-0550-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Storey J.D., A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2002. 64(3): p. 479–498. [Google Scholar]
  • 40.Storey J.D., Tibshirani R., Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences, 2003. 100(16): p. 9440–9445. doi: 10.1073/pnas.1530509100 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Pettersen E.F., et al., UCSF ChimeraX: Structure visualization for researchers, educators, and developers. Protein science: a publication of the Protein Society, 2021. 30(1): p. 70–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Hubbard R.E., Kamran Haider M., Hydrogen Bonds in Proteins: Role and Strength. 2010. [Google Scholar]
  • 43.Frauer C., et al., Different Binding Properties and Function of CXXC Zinc Finger Domains in Dnmt1 and Tet1. PLOS ONE, 2011. 6(2): p. e16627. doi: 10.1371/journal.pone.0016627 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Ayton P.M., Chen E.H., Cleary M.L., Binding to nonmethylated CpG DNA is essential for target recognition, transactivation, and myeloid transformation by an MLL oncoprotein. Molecular and cellular biology, 2004. 24(23): p. 10470–10478. doi: 10.1128/MCB.24.23.10470-10478.2004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Taliun D., et al., Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature, 2021. 590(7845): p. 290–299. doi: 10.1038/s41586-021-03205-y [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

John M Greally

15 Mar 2022

Dear Dr Bjornsson,

Thank you very much for submitting your Research Article entitled 'Missense variants causing Wiedemann-Steiner syndrome preferentially occur in the KMT2A-CXXC domain and are accurately classified using AlphaFold2' to PLOS Genetics.

The manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the current manuscript. Based on the reviews, we will not be able to accept this version of the manuscript, but we would be willing to review a much-revised version. We cannot, of course, promise publication at that time.

Should you decide to revise the manuscript for further consideration here, your revisions should address the specific points made by each reviewer. We will also require a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

If you decide to revise the manuscript for further consideration at PLOS Genetics, please aim to resubmit within the next 60 days, unless it will take extra time to address the concerns of the reviewers, in which case we would appreciate an expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments are included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool.  PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, use the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

[LINK]

We are sorry that we cannot be more positive about your manuscript at this stage. Please do not hesitate to contact us if you have any concerns or questions.

Yours sincerely,

John M. Greally, D.Med., Ph.D.

Section Editor: Epigenetics

PLOS Genetics

John Greally

Section Editor: Epigenetics

PLOS Genetics

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The work is well designed and it gives evidence from different viewpoints on the relevance of the CXXC domain on the Wiedemann Steiner Syndrome (WSS) pathogenesis. However, it needs some further explanations in order to be published:

Major comments:

- The use of COSMIC variants as evidence of the enrichement of missense variants (MV) should be used cautiously. As authors may know, the translocation of KMT2A, which is part of the pathogenesis of certain leukemias, causes a kind of gain-of-function in KMT2A. It is likely some of these somatic MV may have a similar mechanism, and therefore, it is not applicable to WSS mechanism.

- Albeit ClinVar is a good source of MVs detected in clinical settings, few variants have been extensively curated in terms of interpretation. Considering this, the exclusion of Human Gene Mutation Database (HGMD) as source of confident pathogenic variants must be explained.

- It is seems a bit confusing the inclusion and exclusion of VUSes in the initial analyses. It seems logical to use them when AlphaFold is applied for reinterpretation, not for generating the evidence of the relevance of CXXC domain in WSS.

- Considering recent evidence of the incidence of WSS in newborns (~1/8600 according to Brain. 2020 Apr; 143(4): 1099–1105) and the cutoffs proposed by Whiffin et al. 2017, the MAF cutoff for controls seems too low. Also, it has been demonstrated that genes involved in epigenetic machinery (like KMT2A) are susceptible to clonal hematopoiesis of indeterminate potential (CHIP) and therefore, very-rare control MV may still be pathogenic. These facts are especially crucial to consider, especially when SET domain was found not to be enriched in this work, which may change if another cutoff is considered.

- Although the AlphaFold predicted the best structure of CXXC domain, it would be very graphical and reader-friendly to depict some pathogenic MV detected in silico into the 4nw3 structure, and how they affect the interaction with DNA (which was also determined in that structure).

Minor Comments:

- Please in supp table S4 describe the exact variants following the HGVS nomenclature and their two interpretations (pre- and post- AlphaFold analysis).

- In the Methods section, "Missense variants" subsection, write the genes involved in each of the pehnotypes "(Wiedemann-Steiner syndrome, Hereditary sensory 221 neuropathy type 1E, Childhood-onset dystonia, and Beck-Fahrner syndrome)"

- In Figure 3B, please mention the pdb codes of the different structures.

Reviewer #2: Summary

This study examined variants in KMT2A related to Weidemann-Steiner syndrome to seek insight in to the pathogenesis of the syndrome. First, they identified the locations of pathogenic variants in KMT2A and found them overrepresented in the CXXC domain. This pattern of variant locations was found to be unique to KMT2A, as compared to three other genes which cause Mendelian diseases and did not have pathogenic variants overrepresented in their CXXC domains. To identity the biological impact of KMT2A mutations the authors analyzed previously published data to determine the effect of KMT2A loss on H3K4me3 binding (ChIP-seq) and gene expression (RNA-seq) in WT and KMT2A conditional KO mouse hippocampal neurons. They found that, while H3K4me3 was most disrupted in CpG-rich regions (matching the known function of KMT2A of binding unmethylated CpGs), including at promoters, the expression of nearby genes was not affected by the H3K4me3 loss. Lastly, they used the AlphaFold2 software in conjunction with variants with experimentally verified or in silico predicted pathogenicity to develop an AlphaFold2-based variant classification system with over 90% accuracy, then applied it to most variants in the CXXC domain of KMT2A.

Strengths

• This study has clear formulations of its goals: how important is the role of the different KMT2A domains in WSS pathogenesis and what are rules that determine how the variants – genetic disruption – in these domains causes WSS. To achieve the stated goals it uses simple, quite clear albeit novel methodology in the second goal by linking a genetic change to the protein function through the in-silico modeling of the protein folding.

• By comparing disease causing variant fraction to the control non-pathogenic variant fraction in the KMT2A domains the study identifies the CXXC domain as having the most significant role among other domains in WSS pathogenesis.

• Using KMT2A-deficient mice to show that H3K4me1 disturbances preferentially occur at CpG-rich regions, but has no systematic effect on gene expression (via ChIP and RNA-seq)

Major Weaknesses

• It is recommended to characterize the conservation properties of the domain across species in the orthologous genes. That would provide stronger evidence of the importance of CXXC domain and its intolerance to perturbations. This is missing in the study as well as the genomic coordinates and identities of the transcripts (or a single major transcript) for which the missense variants were collected.

• In the methods part the validation of the proposed classification scheme is poorly described. How exactly the training of the rule based system was performed using the 14 variants for training to obtain the rules shown in Figure 3 C ? How many splits of the data were performed in hold-out cross-validation? Just one or more? The authors should provide a crosstabulation – a confusion matrix- for validation and for the training data along with the accuracy estimates in Figure 3 D. That would help to understand the class balance in the subset of variants used for derivation of the rules.

• What is the rationale of the In silico saturation mutagenesis of 445 variants? It is not clear whether these variants were created artificially and then classified or they were taken from the existing resources. What is the rational of this analysis, how it is useful for KMT2A variant interpretation?

• The ChIP-seq and RNA-seq data were downloaded from paper that uses a mouse model that deletes KMT2A conditionally in adult excitatory forebrain neurons and hippocampal CA. Therefore, it is hard to make firm conclusions around the impact of loss of KMT2A on gene expression in WDSTS as it is a neurodevelopmental disorder and therefore KMT2A may have different targets and effects on chromatin and gene expression during development rather than in the adult brain that has already developed. Conclusions around ChIP-seq and RNA-seq data regarding deposition of histone methylation and gene expression are generalized; authors should address temporal limitation of their mouse model. Additionally, the mouse model used is only KO in neurons. This should also be considered as a limitation as gene expression is different in different brain cell types and loss of KMT2A in these other cell types may also contribute to pathology (either cell autonomously or non- autonomously, meaning loss in other cells impacts neurons as well)

• The source of the ChIP-seq and RNA-seq data (reference 4) includes ChIP-seq for H3K4me3 along with H3K4me1. Is there a particular reason only the H3K4me1 data was used?

• The author’s focused their RNA-seq analysis on genes in relation to H3K4me1 and find no systematic impact of H3K4me1 disruption on gene expression. As a positive control for the analytical approach used in this study, authors should confirm that their analytical pipeline can first replicate (within reason, expecting some variation due to differences in analytical pipelines) the previous published data (reference 4) which found “471 genes down- and 225 genes upregulated in Kmt2a cKO”?

Minor Weaknesses / Corrections

• Not clear what Figure 4 represents. It would be clearer if the domain CXXC would be represented as a track in the genome browser with overlaid variant classifications.

• It would helpful if the numerical summaries of the features used in the derivation of the classification scheme would be provided along with the training and testing data.

• It would be helpful if the proposed scheme would be discussed in view of ACMG variant interpretation guidelines. Which evidence class this method could potentially support?

• Suggest the addition of a sentence describing AlphFold2 in the introduction or a reference to guide the reader.

• The OMIM abbreviation of the syndrome is WDSTS

• From the hold-out test set on the classifier, can you elaborate on the variants that were classified incorrectly? Is there rationale to explain the misclassifications or can you comment on how these misclassified variants may highlight a limitation of this classification scheme?

• The authors should expand on their discussion at lines 187-191 to explain how their results suggest that lack of KMT2A is both “central to WSS pathogenesis” but also that “the phenotype is not mediated via ensuing defects in the deposition of the histone methylation”, in the context of what “alternative mechanisms might be at play”.

• The first reference for AlphaFold2 (ref 36) is given at line 289 in the Methods but should be given earlier such as at line 135 in the Results.

• Supplemental Table 4, add 2 columns to indicate the nucleotide position and amino acid position for each row/variant. In the text or table legend specify the transcript and protein accession numbers used for the analysis.

• The authors mentioned positive and negative prediction values for the classifier. It is confusing and can be misleading as the term “prediction value” is also often used to literally mean the predicted values, like in a regression model for instance. I assume they meant these are true positive and true negative rates (sensitivity and specificity) based on the classifier test results. It would better if they can clarify this.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: No: There is no full list of variants tested in silico, which would be really useful for diagnostic laboratories.

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Victor Faundes

Reviewer #2: No

Decision Letter 1

John M Greally

27 May 2022

Dear Dr Bjornsson,

We are pleased to inform you that your manuscript entitled "Missense variants causing Wiedemann-Steiner syndrome preferentially occur in the KMT2A-CXXC domain and are accurately classified using AlphaFold2" has been editorially accepted for publication in PLOS Genetics. Congratulations!

Before your submission can be formally accepted and sent to production you will need to complete our formatting changes, which you will receive in a follow up email. Please be aware that it may take several days for you to receive this email; during this time no action is required by you. Please note: the accept date on your published article will reflect the date of this provisional acceptance, but your manuscript will not be scheduled for publication until the required changes have been made.

Once your paper is formally accepted, an uncorrected proof of your manuscript will be published online ahead of the final version, unless you’ve already opted out via the online submission form. If, for any reason, you do not want an earlier version of your manuscript published online or are unsure if you have already indicated as such, please let the journal staff know immediately at plosgenetics@plos.org.

In the meantime, please log into Editorial Manager at https://www.editorialmanager.com/pgenetics/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production and billing process. Note that PLOS requires an ORCID iD for all corresponding authors. Therefore, please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field.  This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager.

If you have a press-related query, or would like to know about making your underlying data available (as you will be aware, this is required for publication), please see the end of this email. If your institution or institutions have a press office, please notify them about your upcoming article at this point, to enable them to help maximise its impact. Inform journal staff as soon as possible if you are preparing a press release for your article and need a publication date.

Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Genetics!

Yours sincerely,

John M. Greally, D.Med., Ph.D.

Section Editor: Epigenetics

PLOS Genetics

John Greally

Section Editor: Epigenetics

PLOS Genetics

www.plosgenetics.org

Twitter: @PLOSGenetics

----------------------------------------------------

Comments from the reviewers (if applicable):

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: In my opinion, the issues raised were addressed accordingly.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Victor Faundes

----------------------------------------------------

Data Deposition

If you have submitted a Research Article or Front Matter that has associated data that are not suitable for deposition in a subject-specific public repository (such as GenBank or ArrayExpress), one way to make that data available is to deposit it in the Dryad Digital Repository. As you may recall, we ask all authors to agree to make data available; this is one way to achieve that. A full list of recommended repositories can be found on our website.

The following link will take you to the Dryad record for your article, so you won't have to re‐enter its bibliographic information, and can upload your files directly: 

http://datadryad.org/submit?journalID=pgenetics&manu=PGENETICS-D-22-00064R1

More information about depositing data in Dryad is available at http://www.datadryad.org/depositing. If you experience any difficulties in submitting your data, please contact help@datadryad.org for support.

Additionally, please be aware that our data availability policy requires that all numerical data underlying display items are included with the submission, and you will need to provide this before we can formally accept your manuscript, if not already present.

----------------------------------------------------

Press Queries

If you or your institution will be preparing press materials for this manuscript, or if you need to know your paper's publication date for media purposes, please inform the journal staff as soon as possible so that your submission can be scheduled accordingly. Your manuscript will remain under a strict press embargo until the publication date and time. This means an early version of your manuscript will not be published ahead of your final version. PLOS Genetics may also choose to issue a press release for your article. If there's anything the journal should know or you'd like more information, please get in touch via plosgenetics@plos.org.

Acceptance letter

John M Greally

16 Jun 2022

PGENETICS-D-22-00064R1

Missense variants causing Wiedemann-Steiner syndrome preferentially occur in the KMT2A-CXXC domain and are accurately classified using AlphaFold2

Dear Dr Bjornsson,

We are pleased to inform you that your manuscript entitled "Missense variants causing Wiedemann-Steiner syndrome preferentially occur in the KMT2A-CXXC domain and are accurately classified using AlphaFold2" has been formally accepted for publication in PLOS Genetics! Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out or your manuscript is a front-matter piece, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Genetics and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Agnes Pap

PLOS Genetics

On behalf of:

The PLOS Genetics Team

Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom

plosgenetics@plos.org | +44 (0) 1223-442823

plosgenetics.org | Twitter: @PLOSGenetics

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig

    (A) KMT2A missense variants in COSMIC (top) and ClinVar (bottom). (B) The percentage of missense variants in COSMIC (grey dots) and WDSTS patients (red dots) that fall in each of the different domains of KMT2A. (C) Correlation between the fold-enrichment (odds ratio) of WDSTS MVs compared to COSMIC MVs and gnomAD MVs. (D) The percentage of missense variants in gnomAD (grey dots) and WDSTS patients (red dots) that fall in each of the different domains of KMT2A, after excluding variants of uncertain significance. (E) The percentage of missense variants in gnomAD (grey dots) and WDSTS patients (red dots) that fall in each of the different domains of KMT2A, after excluding gnomAD variants with MAF<10e-5.

    (PDF)

    S2 Fig. PhyloP score distribution for the SET and CXXC domain nucleotides of KMT2A.

    (PDF)

    S3 Fig

    (A) Venn diagram depicting the overlap between promoters (+/- 1kb from TSS) harboring H3K4me1 peaks and those harboring H3K4me3 peaks. (B) Genomic annotation of peaks within the 1st (yellow dots) and 10th p-value decile (purple dots) for H3K4me1 peaks (left) and H3K4me3 peaks (right).

    (PDF)

    S4 Fig

    (A) The histogram of p-values from the differential expression RNA-seq analysis. (B) PCA plot based on the expression matrix, after a variance stabilizing transformation (see Methods). (C) MA plot of the log2 fold-change against the mean of normalized counts from the differential expression analysis. Differentially expressed genes are colored in blue.

    (PDF)

    S5 Fig

    (A) Multiple sequence alignment depth plot and (B) predicted alignment error of KMT2A CXXC wild-type domain prediction from AlphaFold2.

    (PDF)

    S6 Fig

    Variant classification scheme from the (A) training set and (B) test set. Variants that do not fall under the correct classification according to the scheme are underlined. (C) Confusion matrix from test set results.

    (PNG)

    S7 Fig. Predicted structure of the CXXC domain of KMT2A using ColabFold without using the multiple sequence alignment or templates as input.

    (PDF)

    S1 Table. WDSTS variant information.

    (CSV)

    S2 Table. Functionally important residues of the CXXC domain of KMT2A.

    (PDF)

    S3 Table. Variant information from training and test set.

    (TXT)

    S4 Table. Saturation mutagenesis results.

    (CSV)

    S5 Table. Amino acid co-ordinates of the proteins domains of KMT2A.

    (PDF)

    Attachment

    Submitted filename: PLOS_Responses_.docx

    Data Availability Statement

    Variant data are available in the manuscript and supplementary files. ChIP-seq and RNA-seq are available on GEO GSE99250.


    Articles from PLoS Genetics are provided here courtesy of PLOS

    RESOURCES