Graphical abstract
Keywords: population variability, disease-related variants, constrained coding regions, protein functional features, liquid-liquid phase separation
Highlights
-
•
CCRs are based on human conservation and complement inter-species conservation.
-
•
CCRs assist in variant interpretation, here we mapped them onto proteins sites.
-
•
The most constrained coding sites correspond to protein sites in interactions.
-
•
These interactions include those with DNA/RNA, proteins and in catalytic active sites.
-
•
Those driving LLPS, in LIPs and in disorder–order transitions are also highly constrained.
Abstract
Constrained Coding Regions (CCRs) in the human genome have been derived from DNA sequencing data of large cohorts of healthy control populations, available in the Genome Aggregation Database (gnomAD) [1]. They identify regions depleted of protein-changing variants and thus identify segments of the genome that have been constrained during human evolution. By mapping these DNA-defined regions from genomic coordinates onto the corresponding protein positions and combining this information with protein annotations, we have explored the distribution of CCRs and compared their co-occurrence with different protein functional features, previously annotated at the amino acid level in public databases.
As expected, our results reveal that functional amino acids involved in interactions with DNA/RNA, protein–protein contacts and catalytic sites are the protein features most likely to be highly constrained for variation in the control population. More surprisingly, we also found that linear motifs, linear interacting peptides (LIPs), disorder–order transitions upon binding with other protein partners and liquid–liquid phase separating (LLPS) regions are also strongly associated with high constraint for variability. We also compared intra-species constraints in the human CCRs with inter-species conservation and functional residues to explore how such CCRs may contribute to the analysis of protein variants. As has been previously observed, CCRs are only weakly correlated with conservation, suggesting that intraspecies constraints complement interspecies conservation and can provide more information to interpret variant effects.
Introduction
Predicting the impact of variants on protein function has traditionally been based on combining information derived from protein sequences, inter-species conservation and knowledge of the structure and function of the protein. With the emergence of multiple human genome sequences from different populations, observed variation patterns in humans provide an orthogonal source of information for assessing the impact of variants. The comprehensive catalogues of genetic variations, compiled from many human population sequencing projects, have fuelled the development of different metrics that measure the general tolerance of genes to variation. Metrics like the probability of being loss-of-function intolerant (pLI) and missense Z-scores are extensively used to prioritise genes in genome interpretation of individuals2. However, it is well established that ‘all parts of a protein are not equal’. The modular presence of different domains and folds can endow different regions with different functions and structural constraints3, 4. Hence, one would expect that regions which are extremely important for the function of a protein would be depleted of protein changing variants in healthy individuals.
In 2019 Havrilla et al. defined the Constrained Coding Regions (CCRs) as regions in the human coding genome where the following protein-changing variants were depleted: missense, stop gained, stop lost, start lost, frameshift variant, initiator codon variant, rare amino acid variant, protein altering variant, inframe insertion, inframe deletion, and splice donor variant or splice acceptor variant when affecting the protein sequence. These regions were identified in whole exome and genome sequencing data from large cohorts of healthy control populations from the Genome Aggregation Database (gnomAD2.0.1), which is currently the largest and most widely used publicly available collection of data on population variation from harmonised sequencing data2, 5. The premise of Havrilla et al. was that the data in gnomAD2.0.1 came from individuals who were either healthy or did not have early onset developmental abnormalities (‘healthy control populations’) and therefore their variant loci could be considered as ‘tolerated’.
In the CCRs model, each of the variant-depleted (constrained) regions is weighted based on i) its length in base pairs, and ii) the fraction of individuals (above 50% of total individuals) having at least a 10x sequencing coverage at each bp of the region. A linear regression is then calculated comparing the weights and the CpG density of the regions, as an indicator of the region mutability upon spontaneous deamination of methylated cytosines. Regions with a greater weighted distance between protein-changing variants than expected based upon their CpG density (residual from the linear regression), are predicted to be under the greatest constraint. The residuals of the regression are ranked in CCRs percentiles (CCRpct) from 0 to 100, with 0 signifying unconstrained (i.e. having ‘tolerated’ variants in gnomAD) and 100 being the most highly constrained regions. Put simply, the longer a constrained region and the larger its CpG content, in general, the higher its CCRpct will be. Havrilla et al. observed that only ∼1% of the highly constrained regions were found to be enriched with known pathogenic variants and associated with developmental disorders, and that 72% of the genes harbouring a CCR in the 99th percentile or higher were not linked yet to any disease, suggesting that CCRs could be used to reveal regions of protein coding genes that are likely to be under potentially purifying selection.
Given their relevance, it has been proposed that analysing the presence of regions like these can complement the classical procedures of phylogenetic conservation, amino acid substitution scores, and three-dimensional protein structural characterization and aid in the process of variant interpretation1, 6. Although recent studies have used CCRpct as an extra score for assessing the pathogenicity of variants7, 8, 9, 10, 11, 12, 13, there has been no large-scale attempt to map the distribution of these constrained genomic regions to amino acids and to analyse their co-occurrence with different protein functional features.
The human proteome is a continuum where proteins can be fully ordered, intrinsically disordered (ID) or flexible/mobile, have a mixture of folded and ID regions or even exert transitions between both states upon binding with other proteins14, 15. These ID proteins and regions, can perform important and diverse functions in the cell from displaying sites for post-translational modifications (PTMs) to assembling molecular complexes that promote the phenomena of liquid–liquid phase separation (LLPS) and formation of membraneless organelles in the cell, amongst others16, 17, 18. Disease-causing mutations can occur in both ordered and ID regions19, 20, and recently the focus has turned towards variants predisposing to disease in LLPS regions, mostly related to autism spectrum disorders (ASD), cancer, neurodegeneration, and infectious diseases21, 22, 23. However, ID regions are usually not well conserved, lack a stable protein three-dimensional structure, sequence alignments have poor accuracy and most studies and tools focus on ordered regions, making it a challenge to interpret the molecular mechanism behind disease-related variants in these regions. Hence, observing CCRs in these regions may provide some insight into their constraints during human evolution.
Herein we map the CCRs onto their corresponding protein sequences and 3D structure, by assigning the CCRpct to each amino acid site (residue) spanned by each CCR. This process has the potential to highlight key functional amino acids in both ordered and disordered proteins, lying in regions of the protein which are strongly constrained. We explore the distribution of these regions across human proteins and compare their co-occurrence with different protein functional features annotated at the amino acid level. We then perform an enrichment test of Gene Ontology (GO) terms to explore which protein classes and cellular pathways are more frequently associated with genes harbouring regions with high CCRpct.
Results
Mapping the CCRs to amino acids
Our aim was to explore how CCRs are distributed across the human canonical protein sequences as defined by UniProtKB/Swiss-Prot24. For this purpose, first, we ran the CCRs model pipeline (available in the repository accompanying the work of Havrilla et al., 2019: https://quinlan-lab.github.io/ccr/examples/updates) to obtain the genomic coordinates of the CCRs, but using the gnomAD3.0 dataset of variants, which aggregates 76,156 whole genomes using coordinates from the human GRCh38 genome assembly. The resulting file with the genomic coordinates of the CCRs can be obtained from our GitHub repository (https://github.com/marciaah/CCRStoAAC/blob/main/data/rawCCRs/gnomad3_0/vep101/sort_weightedresiduals-cpg-synonymous-novariant.txt.gz). Then, we mapped the genomic coordinates of the CCRs to amino acids in UniProtKB proteins, via the Ensmebl transcripts in the GENCODE basic set (see Figure 9 A for further details), the corresponding code of the pipeline is available in our repository https://github.com/marciaah/CCRStoAAC, and the output of the mapping to amino acids can be found here https://github.com/marciaah/CCRStoAAC-output.
Figure 9.
(A) Flowchart showing the different databases and tools employed for mapping the CCRs in genomic coordinates to the amino acid coordinates in UniProtKB protein sequences. (B) Flowchart presenting the different resources and databases employed for aggregating 30 general protein feature annotations and conservation score (blue-green boxes), clinically interpreted variants (yellow boxes), CCRpct (red boxes) and disorder/mobile related protein feature annotations (blue boxes). UniProtKB/SP = UniProtKB/SwissProt.
The use of GRCh38 gives a more accurate cross-mapping of genes, transcripts and proteins in the Ensembl25 and UniProtKB databases, while using the Swiss-Prot canonical set of proteins ensures the availability of functional annotations for further analysis.
From a total of 18,583 human UniProtKB/Swiss-Prot canonical proteins that matched the Ensembl protein sequences, we were able to map CCRpct to at least one region in 17,366 of them (Figure 1(A)). 6,608 of the 17,366 proteins had partial coverage of CCRpct for their amino acid sites (residues) and 1,217 (from the expected 18,583) completely lacked CCRpct, as a consequence of low quality conflicted genomic regions (see Methods) that prevented the identification of CCRs. In total, about 9.8 million amino acid sites (Figure 1(B)) are represented with CCRpct, out of an expected 10.7 million from the total 18,583 sequences.
Figure 1.
68.8% of the 9.8 million mapped amino acid sites correspond to constrained regions, ranked by different constraint percentiles. The moderately, highly, and most highly constrained positions ([90,100] CCRpct bin) represent only 7.6% of these positions. Charts show: (A) the coverage of UniProtKB/SwissProt canonical proteins with the mapping of CCRpct, (B) the number of residues covered by the different percentiles, grouped into 5 categories, and the proportion of residues without CCRpct (not covered), and (C) the distribution of sites among the different percentiles grouped by tens, with a call-out showing the number of positions in the [90,100] CCRpct. Interval boundary numbers should be interpreted as follows: [ ] = included, () = excluded, or combinations of both.
68.8% of the 9.8 million amino acids sites that we could map are in constrained regions, i.e. no protein changing variants are reported in gnomAD3.0. The remaining 3.06 million (31.2% of the mapped residues) contain at least one variant with a minimum allele count of one (Figure 1(B)).
We categorised the CCRpct into different groups, as shown in Figure 1(B) and (C), based on the original considerations proposed by Havrilla et al, 2019, i.e., percentiles in the top 1% [99,100] for the most highly constrained regions down to 0 for unconstrained (i.e. sites having variants in gnomAD3.0). As expected, mapping from CCRs to amino acid sites gives approximately 10% of residues in each group (Figure 1(C)). The exception is for the 0–10% group, which is underpopulated at the residue level, since these CCRs are the shortest regions, with an average length of only 1.05 amino acids without variants.
To summarise, we were able to assign CCRs to 93.5% of human UniProt/SwissProt canonical proteins, equating to 91.6% of the expected residues; about two thirds of residues are constrained (i.e. without protein changing variants in gnomAD 3.0); of these sites only 0.24% residues are assigned to the top [99,100] CCRpct bin (i.e. most highly constrained) and are exclusively from 839 regions in 751 proteins.
Comparing CCRs percentiles with other whole gene scores (pLI and missense OEUF)
In clinical genomics and population genetics a number of metrics for assessing the overall intolerance to variability for a given gene or protein have become popular. The latest version of these scores is based on gnomAD2.1.12, 26. One, the pLI score, represents the probability of a gene being intolerant to heterozygous putative loss of function (pLoF) variants: nonsense (stop-gained), frameshift, splice acceptor, and splice donor variants. A pLI ≥ 0.9 has been used to highlight “essential” transcripts/genes/proteins. For missense and synonymous variants, Z-scores are used, measuring how far from the mean a gene is in terms of observed/expected (o/e) missense or synonymous variants. Accompanying these metrics, the authors recommend the use of upper bound fraction of the 90% confidence intervals around the o/e ratios (OEUF) for the different types of variants. For further details, please refer to Supplementary Methods.
Here, we assigned pLI and OEUF scores to UniProtKB/Swiss-Prot canonical proteins with calculated CCRs, via their Ensembl transcript identifier. The mapping table, based on Ensembl (version 101) and UniProtKB (version 10–2020) can be downloaded from our repository (https://github.com/marciaah/CCRStoAAC/blob/main/data/mapping_tables/ensembl_uniprot_MANE_metrics_07102020.tsv.gz). For our analysis, we used the recommended thresholds: pLI ≥ 0.9 and missense OEUF ≤ 0.35, as a simple way to define a gene/protein as highly constrained for pLoF and missense variants, respectively.
We observed 2,916 ‘essential’ proteins with pLI ≥ 0.9, and 75% of these have at least one region that scores with very high CCRpct in the range [95,100]. Also, only 113 proteins presented missense OEUF ≤ 0.35, and 93% of them have CCRpct in the [95,100] group. However, when we looked at proteins with pLI < 0.9 or OEUF > 0.35, we found that about a quarter or more of these more “variant-tolerant” proteins also include highly constrained regions, which are distributed across the proteins with all values of missense OEUF and pLI (Supplementary Figure 1). These observations highlight the importance of looking at local constraint scores, using CCRpct, in order to understand more deeply the impact of variants in protein coding genes.
The correlation between CCRs percentiles, interspecies conservation and length of the regions
We next explored how the CCRpct (based on intra-human variation) correlates with interspecies amino acid conservation for each amino acid position in the human proteome. The average interspecies conservation increases with increasing CCRs percentile (Supplementary Figure 2(A)), however, there is a surprisingly large variability within each percentile category (Supplementary Figure 2(A) and (B)) and the overall correlation is very low (Pearson = 0.11).
In a similar way, for all amino acid positions we compared the length of the CCRs (in number of amino acids) against their percentiles (Figure 2(C) and (D)). The average length of regions increases with CCRs percentile, which is expected given that CCRs are prioritised by region length. Nevertheless, there is a high variability within each CCRs percentile category. The most highly constrained regions (percentiles [99,100]) are only present in proteins of at least 100 amino acids in length (Supplementary Figure 3).
Figure 2.
Distribution of counts of amino acid sites in different bins of CCRpct and conservation scores. The 3D heat map shows counts for constrained amino acid sites, while the histogram shows the unconstrained ones.
To explore the numbers of amino acid sites having different combinations of CCRpct and conservation scores, we stratified both measures into ten groups and built a 2D matrix counting the numbers of residues in each cell. The majority of both constrained and unconstrained positions have conservation scores > 0.4 and distribute evenly in a plateau up to a conservation of 0.95 (Figure 2). Above this level, the counts increase, in particular for the two extremes of more constrained (percentiles [90,100]) and unconstrained sites (percentile 0). The 3D surface shown in Figure 2 highlights the disparity between these CCRpct and conservation scores and the high frequency (importance) of residues which are completely conserved (ScoreCons = 1)28. CCRpct are able to differentiate between such residues, according to observed variation and length of conserved regions, providing a valuable score for analysis.
Protein features and CCRs percentiles
Protein annotations from UniProtKB/Swiss-Prot24, PDBe29, VarSite,30 M-CSA31, BioLip32, MobiDB33, ELM34, Ensembl35 and ClinVar36 databases were obtained and aggregated for each amino acid site in the human canonical and annotated proteins of UniProtKB. 9.8 million sites were annotated in this way, assigning CCRpct and the 30 protein annotations listed in Figure 9 (Methods).
Figure 3(A) and (B) present a broad overview of the distribution of total sites annotated with different protein features, with the corresponding distributions of conservation scores and length of regions for such sites. For simplicity, we only present this information for the two extremes of CCRpct: the more constrained sites in percentiles [90,100] and the variable or unconstrained sites with percentile 0.
Figure 3.
Distribution of amino acid sites corresponding to A) highly constrained regions with percentiles in the interval [90,100] and B) unconstrained regions harbouring tolerated variants in gnomAD3.0, and in coincidence with the different protein features as listed in the first column. “All residues” panels at the bottom correspond to counting all the protein sites without distinction of protein features. "Average percent" = average of all the "% from Total # of residues".
In order to investigate in detail how different protein features are constrained across human populations, we calculated odds ratios (OR) to measure the enrichment of residues in CCRpct categorised in 7 groups for each of the 30 protein feature annotations, compared to a random distribution (see in Methods, Odds ratios tests for enrichment: I. CCRpct and presence of each one of the 30 protein features). Figure 4 presents forest plots for comparing the resulting OR (listed in Supplementary Table 1).
Figure 4.
Propensity of co-occurrence of amino acid positions in specific protein features or functional sites with the different CCRpct. Odds ratios (OR) forest plots with 95% confidence intervals (CI) based on two-tailed Fisher’s exact test for amino acid sites in (A) domains, globular, non-globular and compositionally-biased protein regions, (B) interactions and catalysis, (C) disordered/mobile residues, structural transitions between disorder/flexibility and order upon binding and regions driving LLPS, (D) different signalling regions, (E) post translationally modified sites, (F) coding DNA sequence (CDS) junctions and other functionally relevant regions annotated in UniProtKB, and (G) residues lacking any of the features or functional annotation considered in this work. The vertical lines through the boxes illustrate the length of the CI. The line at OR = 1 is the line of no clear difference, boxes and intervals above this represent co-occurrences more likely to happen, while boxes and intervals under the line represent the contrary. A cross (X) above a box depicts an OR where the association is statistically not significant (i.e. p-value > 0.05 or the 95% CI crosses over OR = 1). See Supplementary Table 1 for all the p-values corresponding to the two and one tailed Fisher’s exact tests.
Additionally, we performed OR tests for assessing the overall enrichment of sites with the different protein features and their co-occurrence with inter-species conservation scores and CCRpct. We did this by defining 6 different groups, as described in Methods (Odds ratios tests for enrichment: II. CCRpct and conservation with the presence of protein features). Put simply, we divided the cells of the heatmap in Figure 2 into 4 quadrants and the histogram for unconstrained sites into 2 halves, counted the numbers of residues and calculated the ORs. Supplementary Table 2 shows the resulting OR and Table 1 summarises the features enriched in each group combining CCRpc and conservation score.
Table 1.
List of features that are more enriched in the different combinations of CCRpct (row-wise) and conservation (column-wise), sorted by OR in each cell (only showing OR ≥ 1.1). OR and p-values are in parenthesis, with stars depicting significant Fisher p-values: ‘*’ ≤ 0.05, ‘**’ ≤ 0.01, ‘***’ ≤0.001. The full list of features and values is in Supplementary Table 2. In bold we highlight the most enriched features, with OR ≥ 1.5.
| Lower conservation (ScoreCons ≤ 0.5) |
Higher conservation (ScoreCons > 0.5) |
|
|---|---|---|
| Higher CCRpct (50,100] |
D_to_D (1,90***), PROPEP (1,50***), DISORDER_MOBILE (1,46***), CONTEXT_DEP (1,45*), D_to_O (1,44***), PHOSPHO (1,25***), NO_feature (1,15***), LLPS (1,13***) |
CATALYTIC (1,90***), CROSSLINK (1,63***), LLPS (1,50***), DNA/RNA (1,46***), DOMAIN (1,42***), DISULPHIDE (1,35***), MOTIF (1,34***), LIP (1,33***), PROTEIN (1,32***), LIGAND (1,31***), D_to_O (1,30***), METAL (1,28***), REPEAT (1,25***), CONTEXT_DEP (1,25**), SITE (1,22***), TRANSMEM (1,20***), LIPID (1,19**), CDSjunction (1,16***), PTM_OTHER (1,12***), REGION (1,11***), COILED (1,10***) |
| Lower CCRpct (0,50] |
TRANSIT (1,83***), PROPEP (1,79***), D_to_D (1,71***), DISORDER_MOBILE (1,62***), LOW_COMPLEXITY (1,39***), PHOSPHO (1,22***), NO_feature (1,17***), SIGNAL (1,11***) |
DISULPHIDE (1,74***), LIPID (1,56***), SIGNAL (1,27***), TRANSIT (1,16***), CARBOHYD (1,15***), PTM_OTHER (1,14***), CDSjunction (1,10***) |
| Unconstrained CCRpct [0] |
TRANSIT (1,86***), PROPEP (1,81***), D_to_D (1,69***), DISORDER_MOBILE (1,67***), LOW_COMPLEXITY (1,61***), SIGNAL (1,23***), NO_feature (1,10***) |
TRANSIT (1,58***), SIGNAL (1,24***), LOW_COMPLEXITY (1,14***), CARBOHYD (1,11***) |
For capturing the strongest associations, we set a threshold of OR ≥ 1.5 for our analysis and grouped the different features for discussing their enrichments with CCRpct and with CCRpct and conservation, as is summarised below:
I. Domains and compositionally-biassed protein regions. This group includes large features (e.g. structural domains) and biassed sequences (e.g. coiled-coils) (Figure 4(A)). The most striking ORs distribution occur for domain regions (i.e. those regions of a protein classified as being in a domain, according to Pfam or CATH) compared to those lying outside such a domain. There is a clear indication that amino acid sites within domains are more constrained than those outside. The repeated domains show a similar tendency. Surprisingly, the transmembrane regions showed little if any enrichment for highly constrained regions, perhaps reflecting the lipid environment where variants between the hydrophobic amino acids are common. The residues in low complexity and coiled-coil regions are preferentially unconstrained and rarely show the highest levels of constraint.
II. Interactions and catalytic residues. Residues involved in catalytic sites, binding to metals and/or ligands, protein–protein interactions, protein–protein cross-linking, interactions with DNA/RNA, linear motifs, and linear Interacting peptides (LIPs) are all more likely associated with medium to high percentiles of constraint (in the range [60,100]) (Figure 4(B)). Most of these residues are also less likely to be associated with unconstrained regions. Catalytic sites, in particular, presented the highest odds of having high CCRpct and high conservation (OR = 1.9, in Table 1), and the average residue conservation is consistently high in combination with all the CCRpct, including the unconstrained sites where gnomAD tolerated variants are located (Supplementary Figure 5). Disulphide bonds are an interesting exception - showing no preference to lie in a highly constrained region, but also they are rarely unconstrained. In particular, catalytic sites, disulphide bonds and cross-linking covalent linkages have ORs that suggest they are of the order of 0.5–0.6 times as likely to have tolerated variants, while for sites involved in other interactions the ORs are between 0.7 and 0.91.
III. Disorder related features. The 1.8 M amino acids that were annotated in our dataset as intrinsically disordered (ID) or mobile and with CCRpct assigned (about 18.5% from the total of 9.8 M residues) showed two contrasting tendencies. Although with weaker OR, these sites were more likely to coincide with unconstrained and very lowly constrained sites (percentiles [0,30)) and also with the top most highly constrained percentiles [99,100] (Figure 4(C)). ID regions/proteins tend to evolve faster than structured proteins at the sequence level37, 38, this is reflected in the enrichment of lower mean interspecies amino acid conservation across all levels of constraint, even for the high CCRpct residues (Table 1).
High percentiles of constraint ([95,100]) were strongly associated with residues in disorder to order (D-to-O), disorder to disorder (D-to-D) and context dependent transitions upon binding with other protein partners, with higher OR values for those undergoing D-to-O and context dependent transitions.
Residues in regions driving LLPS present the highest association we observed, with OR 9.26 times more likely to be in the most highly constrained regions (percentiles [99,100]) and with the longest average length of 75 amino acids (Figure 3). Furthermore, 29 (53%) of the 54 LLPS proteins in our dataset have LLPS driving amino acid sites with percentiles in [95,100] (Supplementary Table 5).
When considering gnomAD2.1.1 per gene variant intolerance metrics, 34 out of these 54 (63%) LLPS driving proteins have pLI >= 0.9 (i.e. are ‘essential’ genes, extremely intolerant to pLoF variants in heterozygosity), while only 7 of the 54 are highly intolerant to missense changes (missense OEUF<=0.35) (Supplementary Table 5).
IV. Signalling regions. Residues in propeptides, signal peptides and transit signalling regions tend to be in regions more likely to have unconstrained to medium constrained percentiles in the interval [0,60] (Figure 4(D)). The very few highly constrained sites in propeptides and signal peptides show much lower conservation and shorter region length (Figure 3 and Table 1).
V. Post translationally modified positions (PTMs). Although with weak OR, the overall tendency is for sites that are post translationally modified to be more likely in constrained regions at lower percentiles (Figure 4(E)). Given that these are specific sites with, in general, very short flanking motifs, it is expected that shorter regions, and therefore lower CCRpct, are associated with these positions. Also, some PTMs may not be functionally relevant and represent false positives39.
Glycosylated sites tend to be more associated with being unconstrained or at low constraint (percentiles [0,60]). The number of lipidated sites is very low, therefore statistically not significant for most of the CCRs percentile categories. However, they are less likely to be unconstrained (OR = 0.62, 95% CI: 0.53–0.72).
Other types of PTMs are more likely to be associated with highly constrained regions (percentiles [95,100]). There are 482 highly constrained sites, 302 (62%) correspond to N-acetylations, and 79% are N-acetyl lysine. This is not surprising given the abundance of the latter modification. When comparing conservation and constraint for these sites, the behaviour is mixed and ORs are overall weak. Phosphorylation sites tend to be less conserved in general for all constraint levels (Table 1), and a slightly higher prevalence is for sites with high constraint and low conservation. Lipidated sites have a stronger association with being more conserved and with low percentiles. Glycosylation and other PTMs are more associated with high conservation for all constraints.
VI. Other relevant sites and regions. The localisation of “other sites” and “other regions” was obtained from UniProt. This category corresponds to regions/positions of functional relevance for proteins, identified mostly from experimental evidence, that cannot be described by other feature annotations of UniProt. We also recorded coding DNA sequence (CDS) junctions, by translating the genomic coordinates of these sites obtained from Ensembl onto the corresponding amino acids in the UniProt sequences.
As suspected, given their relevance, the protein sites in these three categories are more likely to be constrained at high percentiles (Figure 4(F)) and also mostly related to high percentiles and high conservation, although with a weak OR (Table 1).
VI. Residues without annotations. 1.8 M amino acids did not have any of the functional sites or regions or domains that we aggregated in the present study. These were very weakly associated with unconstrained and low constraint regions and particularly less associated with higher percentiles (Figure 4(G)). They were slightly associated with all combinations of constraints and low conservation (Table 1), and slightly more with low CCRpct and low conservation. The 96.4 K and 658.7 K residues that are in CCRpct [50,100] with low and high conservation, could be explained by functional features that still remain to be discovered and annotated for some proteins, or some domains that were difficult to delimit, creating fuzziness at their boundaries.
In summary, when bringing together the classifications regarding CCRpct and protein functional features, for the 9.8 M positions that can be assigned to a CCRs percentile, it is possible to observe how the co-occurrence with certain functional features becomes more evident at higher percentiles (Figure 3(A) vs (B) and Figure 4). The 23.6 K most highly constrained residues in the human genome (CCRpct in [99,100], 0.24% of the total mapped residues) correspond to, on average, the longest linear stretches depleted of tolerated variability, and strongly highlight positions involved in DNA/RNA binding, protein binding, catalytic sites and in driving LLPS.
Amino acid sites with clinically interpreted variants, their CCR percentiles and co-occurrence with protein features
We next investigated how residues with different types of clinically interpreted variants correlate with the different percentiles of constraint Figure 5. For this purpose, we employed variants from the ClinVar database (https://www.ncbi.nlm.nih.gov/clinvar), and classified the amino acid positions in our dataset as pathogenic (including pathogenic/likely_pathogenic), benign (including benign/likely_benign) and/or VUS/conflicting (including variants of uncertain significance or with conflicting interpretations of pathogenicity), and followed a similar methodology for performing OR tests (Supplementary Figures 6 and 7).
Figure 5.
Propensity of co-occurrence of amino acid positions having clinically interpreted missense variants with the different CCRpct. The ORs with 95% CI based on two-tailed Fisher’s exact test, represent the odds that amino acid sites with a particular type of variant will co-occur in combination with one of the CCRs percentile categories, compared to the odds of having such a type of variant but with any of the other percentile categories. The vertical lines through the boxes give the length of the CI. The line at OR = 1 is the line of no clear difference; boxes and intervals above this represent more likely co-occurrences, while boxes and intervals under the line represent the converse. See Supplementary Table 3 for all the p-values corresponding to the two and one tailed Fisher’s exact tests.
Pathogenic missense variants account for only 28,398 protein sites, with the majority of missense variants in ClinVar corresponding to VUS/conflicting interpretations, affecting 194,508 residues. Residue sites with pathogenic missense variants were observed as strongly associated with higher percentiles of constraint, in the intervals between [90,100]. Surprisingly, these types of variants were also associated with unconstrained regions (percentile [0,0]), although with weaker odds. Sites with benign and VUS/conflicting missense variants had higher odds of being in unconstrained regions, although a few benign variants, affecting 476 amino acid sites, were also present in the top most highly constrained regions.
When comparing the distribution of residues with the three groups of variants with CCRpct and conservation scores, we observed that they are spread across all categories of conservation and CCRpct (Supplementary Table 4 and Supplementary Figure 4), but only the amino acid sites affected by pathogenic/likely_pathogenic were OR = 1.63 times more likely to be more conserved and with high CCRpct.
Additionally, we assessed the co-occurrence of ClinVar variants with the 30 protein feature annotations (Supplementary Figure 7). Sites with missense pathogenic variants are mostly associated with being in domains, transmembrane regions, catalytic sites, metal binding, ligand binding, protein binding, disulphide bonds, DNA/RNA binding, linear motifs, LIPs, D-to-O, lipidation, other regions, and CDS junctions. Sites with missense benign variants were more likely to be disordered/mobile, low complexity, LLPS, signal, propeptide and transit peptides, and sites without any protein features. Sites with missense VUS/conflicting variants, were mostly associated with those regions generally more difficult to characterise: repeats, coiled-coils, LIPS, D-to-O and D-to-D transitions, phosphorylation sites, other regions of biological relevance annotated in UniProt, and sites without any feature.
Over-representation of regions with high percentiles in GO protein classes and Reactome pathways
We investigated whether there was an enrichment of specific types of proteins and biological pathways in the proteins having regions with high CCRpct. For this purpose, we submitted a ‘query list’ of 6,402 protein identifiers of sequences with percentiles in the interval [95,100] to the PANTHER Classification System40 to perform a Gene Ontology (GO) Over-representation Test for two categories of terms: ‘protein class’ and ‘Reactome pathways’. We used as the ‘reference list’ the 17,366 genes/proteins for which we have CCRs estimates; however, for only 17,022 PANTHER was able to assign GO terms.
Genes with [95,100] CCRpct were enriched in 14 protein classes out of a total of 196 that were assigned to our lists of proteins. Genes with [95,100] CCRpct were enriched in 70 Reactome pathways, out of a total of 2.482. Figure 6(A) and (B) list the statistically relevant over- and under-represented protein classes and Reactome Pathways, respectively.
Figure 6.
Gene Ontology enrichment test for (A) protein class, and (B) pathways annotated in Reactome, for proteins harbouring residues in highly constrained regions with percentiles in [95,100]. The over-representation tests are based on multiple Fisher tests with Bonferroni correction, and only significant (p-values < 0.05) terms are listed and ordered by fold of enrichment. Bars in different shades of blue correspond to over represented terms (>1-fold of enrichment). Darker blues highlight ⩾ 1.5 and ⩾2.0 folds of enrichment.
Liquid-Liquid phase separation: Biological processes, variability constraints and related diseases
LLPS driving regions presented a low number of sites with Pathogenic variants, while being highly associated with high CCRpct (see Figures 3(A) and 4(E) and Supplementary Figure 7). This motivated us to further investigate the distribution of Pathogenic variants in the corresponding proteins, the types of diseases they associate with and the biological processes where such proteins are involved. The Supplementary Table 5 presents this information, and Supplementary Table 6 summarises the list of clinical conditions and number of LLPS driving genes/proteins associated with them.
We observed the majority of LLPS driving genes (35 out of the 54, 63%) act in key biological processes that facilitate DNA damage repair, epigenetic gene repression and RNA metabolism (transcription, splicing, polyadenylation, transport and translation, see column ‘Main biological process groups’ and the corresponding genes in Supplementary Table 5). The remaining 19 genes are involved in many different processes, including neuron cell growth, adhesion, axonogenesis and synaptogenesis, development, synaptic plasticity and regulation of neurotransmitter vesicles release; signal transduction pathways for cell survival, migration, proliferation, differentiation and apoptosis; protein degradation/recycling; cell cycle regulation; immune responses; nuclear transport; elasticity of organs and tissues; muscle structure and function and glomerular filtration in kidney.
34 (64%) of the 54 proteins driving LLPS had pLI>=0.9 (i.e. essential genes highly intolerant to loss of function in heterozygosity), in particular this is the case for proteins involved in RNA metabolism. 43 of the 54 LLPS proteins (79.6%) have amino acid positions which drive phase separation and are highly constrained for variation (CCRpct in [90,100]).
30 LLPS proteins (56%) had variants related to at least one disease: seventeen (31.5%) of the 54 genes were associated with severe early onset developmental disorders, including different organ malformations, oestrogen resistance with absence of sexual maturation or severe early onset immunodeficiency, with 10 of them in particular linked to neurodevelopmental disorders (Supplementary Table 5, column ‘Disease groups’). 10 genes (18.5% of the 54) were associated with later onset diseases, mostly neurodegenerative but also affecting muscles and bone and triggering earlier menopause. 5 genes (9%) were associated with different cancers of pancreas, breast, uterus and prostate, lung and leukaemia. There was also a high incidence of associations with conditions not yet described (see Supplementary Table 6, ‘not provided’ or ‘not specified’).
The remaining 24 LLPS driving proteins (44% out of the total 54) have not yet been associated to any disease by protein changing variants by the time we consulted ClinVar. Fourteen of these 24 (58%) have pLI⩾0.9 (i.e. extremely intolerant to LoF) and are associated to RNA metabolism (10 genes), DNA damage repair (1 gene), immune response (1 gene) and signal transduction for cell proliferation and differentiation (1 gene), all of them presenting multiple regions of high constraint (CCRpct [90,100]).
In summary, the LLPS driving regions are clearly biologically important, often related to disease and very constrained for variability in the human genome.
Exploring some examples: U2AF2, the splicing factor U2AF 65 kDa subunit, and SLC12A2, the solute carrier family 12 member 2
Aggregating different protein annotations such as inter-species conservation, human variability and constraint, functional features, 3D structure and presence of clinically interpreted variants can help understand why variants have different propensities in different contexts. We have chosen two examples of proteins for illustration, the first of which illustrates a protein with highly disordered/mobile regions which have nevertheless been constrained during human evolution, although the conservation across species is patchy. The second example is a transmembrane protein in which the functional ion channel residues are highlighted as highly constrained by the CCRs and also a disordered region of 20 amino acids which are variable across species but depleted of tolerated variants and include a cluster of pathogenic variants.
The first example is U2AF2, the 65 kDa subunit of the U2 auxiliary splicing factor U2AF (Figure 7), a protein that is highly constrained against variability. This is an essential splicing factor that recognizes the polypyrimidine-tract (Py) 3′ splice-site signal in pre-mRNA and initiates spliceosome assembly in the nucleus41. U2AF2 is highly constrained for missense and LoF variability in gnomAD2.1 (missense OEUF = 0.31, missense Z-score = 4.21, pLI = 1, LOEUF = 0.133), suggesting essentiality for humans. There is partial structural data for this protein, derived from 3 separate PDB files, each containing a different part of the protein. The protein contains a low complexity arginine-serine rich motif (RS) at positions 27–62, which has been proposed to initiate liquid–liquid phase separation (LLPS) to form nuclear speckle drops in the nucleus, bringing together pre-mRNAs and the proteins of the spliceosome42. Further along the sequence, the region 85–112 is a UHM ligand motif (ULM) that has been shown responsible for the interactions with U2AF1 (also known as U2AF35) the 35 kDa subunit of the splicing factor U2AF43. The two central RNA recognition motifs (RRM) are shown in the Pfam domains panel and central dashed box of Figure 7, with protein structures on top. These regions bind to the Py-tract signal in the pre-RNA, are highly mobile (light grey shading regions in the Pfam domains panel)44, 45, and are connected by a flexible/disordered linker region (231–258) that modulates the binding specificity for the proper Py-tracts in pre-mRNA46. The third RRM domain, known as U2AF Homology Motif, UHM, is atypical and has lost its RNA‐binding ability, but interacts with splicing factor 1 (SF1) (right dashed box with protein structures on top)46, 47.
Figure 7.
An example of a protein with long regions highly constrained for variability: U2AF2, the 65 kDa subunit of the U2 auxiliary splicing factor U2AF (also known as U2AF65, UniProtKB: P26368, Ensembl: ENST00000308924). From middle to bottom the different panels represent, by amino acid positions, gnomAD3.0 allele counts (AC), CCRpct, species conservation score, Pfam domain with disorder/mobility and low-complexity regions, post-translationally modified sites (PTMs), other protein features, as listed in the Figure. Pfam domains correspond to: ‘RRM_1'=RNA recognition motif, RNP-1. On top of these panels, the dashed rectangles call out regions of the protein with available PDB structures. The lollipops above the Pfam domains depict positions with de novo variants reported in bibliography as associated to developmental disorders (N12del, P138P, R149W, R150C, P157L, T252I and G265G with black circles)48, acute myeloid leukaemia (N196K), colon adenocarcinoma and castration-resistant prostate carcinoma (G301D) (black triangles)49. The unfilled triangle represents a variant of uncertain significance reported in ClinVar (G264E). By-residue ScoreCons conservation scores were obtained from VarSite. Low-complexity, mobility, disorder LLPS, and LIP annotations were obtained from the MobiDB database. PTMs and other interacting regions were obtained from UniProtKB. Interacting proteins in the PDB structures are shown in pale blue, with their interacting side chain shown in stick representation.
The protein has long, moderate to highly constrained regions (CCRs panels) that co-localize with the U2AF2-U2AF1 binding interface, the three RRMs, the flexible linker and also with regions involved in LLPS and/or linear interacting peptides (pink and violet rectangles in the “Other protein features” panel). A few variants have been reported in the literature as associated with a developmental disorder and different types of cancer (represented with lollipops above Pfam domains in Figure 7) and only a VUS is reported in ClinVar. All the pathological missense variants represent drastic physiochemical changes, and affect highly constrained and highly conserved sites.
U2AF2 illustrates how CCRpct and conservation both highlight that this protein is not only essential, but also has functional protein sites along its length. The clinical data supports this hypothesis, with variants associated with developmental disorders and cancers.
The second example is shown in Figure 8, which illustrates how CCR data, combined with protein function information, can highlight regions with potential functions that have yet to be determined. The gene is SLC12A2, which encodes for the solute carrier family 12 member 2 protein, a Na+, K+ and 2Cl− cotransporter 1 (NKCC1), which plays a critical role in the homeostasis of K+ enriched endolymph in the membranous labyrinth of the inner ear. NKCC1 subunits are ion channels that work as homodimers, and each protein monomer comprises a transmembrane domain (TMD), cytosolic N- and C-terminal domains (NTD and CTDs, respectively) and extracellular flexible loops stabilised by disulphide bonds. The dimeric interface involves interactions between TMDs and CTDs50, 51.
Figure 8.
An example where CCRs highlight regions important for protein function: the pore lining helices and dimer interface in the transmembrane region of SLC12A2 (NKCC1) solute carrier family 12 member 2 (UniProtKB: P55011-1, Ensembl: ENST00000262461) and a peculiar disordered region of this protein encoded by its exon 21 and with a cluster of pathogenic/likely_pathogenic variants related to deafness and hearing loss53. From middle to bottom the plots in the horizontal panels represent, by amino acid position, gnomAD3.0 allele counts (AC), CCRpct, species conservation scores, sites with ClinVar missense variants, Pfam domains with disorder/mobility, low-complexity and transmembrane regions and post-translationally modified sites (PTMs). Pfam domains correspond to: ‘AA permease N’=Amino acid permease N-terminal, ‘AA permease’= Amino acid permease, ‘SLC12'=Solute carrier family 12. Over these domains, the dashed rectangles call out the transmembrane region of the protein characterised in the 6PZT PDB structure as a homodimer. This structure is coloured by CCRpct and by conservation and displayed on the top panels. The location in the structure of the pathogenic/likely_pathogenic ClinVar variants A327V and N376I is depicted with triangles and squares, respectively. The small dashed rectangle fully encloses exon 21 (amino acids 977–993). Residue conservation scores, as calculated by ScoreCons, were obtained from VarSite, domains from Pfam, low-complexity, mobility and disorder from MobiDB, transmembrane regions and PTMs from UniProtKB.
SLC12A2 is overall highly constrained for missense and LoF variability in gnomAD2.1 (missense OEUF = 0.78, missense Z-score = 2.4, pLI = 0.96, LOEUF = 0.31), suggesting essentiality for humans. The protein structures (Figure 8), reveal that the highest CCRpct in this protein (in the range [90,99)) corresponds to functionally important regions: residues in pore-lining helices (involved in the ion flow through the channel), the dimer interface, and also a ‘not so obviously relevant’ disordered and lowly conserved (scores < 0.4) region in the C-terminal domain. Intriguingly, this last region, comprising amino acids 977 to 993, is fully encoded by exon 21 of SLC12A2 and coincides with the boundaries of a CCR ranked with a moderate percentile of 94 (small dashed rectangle in Figure 8). It has previously been noted that this region, whose functionality still remains unclear, is unique to SLC12A2 and is not shared with the other proteins in the SLC12 family52, suggesting that it might confers a specific functional characteristic to this protein.
Seven pathogenic missense variants have been reported in ClinVar for this protein. Two of them, associated with Delpire-McNeill neurodevelopmental syndrome, are in the TMD in highly conserved and low-medium constrained residues: N376I lining the pore (conservation = 1, CCRs pct = 37.68) and A327V adjacent to a pore-lining helix (conservation = 0.72, CCRs pct = 74.95). The other five: E979K, E980K, D981Y, P988T and P988S cause deafness and sensorineural hearing loss, and cluster in the moderately constrained region encoded by exon 21. Furthermore, functional assays in cultured cells showed that applying the variants E979K, D981Y and P988T, or skipping exon 21, significantly decreases chloride influx mediated by the SLC12A2 protein53. All this evidence and the moderately high CCRpct for this disordered and lowly conserved region, highlight its putative relevance for the function of this protein.
Both the U2AF2 and SLC12A2 proteins described above also serve to exemplify different scenarios for amino acid sites where high CCRs percentiles go hand in hand with high conservation, and the converse where high CCRs percentiles go with low conservation, and vice versa.
Discussion
In the present work we extended the characterisation of constrained coding regions in the human genome, by accurately fine-mapping these regions and their level of constraint from the Human Build 38 genomic coordinates to protein sequence coordinates in 17,366 human UniProt canonical sequences, totalling about 9.8 million amino acid positions. Furthermore, aggregating protein functional annotations, available for these positions, allowed us to analyse the distribution and correlation of the different levels of constraint and inter-species conservation with different protein features.
Overall, our results agreed with the previous observations of Havrilla et al. 1 that the correlation between the CCRpct and the average nucleotide GERP++ conservation scores27 for the regions is very low and hence the intra-species conservation in humans complements the interspecies conservation.
For the catalytic sites and interactions with different partners (small molecules, proteins, DNA/RNA, metals), we observed the expected associations between high percentiles of constraint and high conservation scores. This is in concordance with the observations of Havrilla et al.1 that domains enriched with the most highly constrained regions were involved in ion transport and in different DNA/RNA interactions (like zinc fingers, helicases and translation factors). Additionally, we observed that the unconstrained (i.e. with gnomAD3.0 variants) or lowly constrained (i.e. average shorter regions depleted of variants) regions were mostly associated with signalling regions (signal, propeptides and transit peptides), low complexity, glycosylation sites, and with more mixed inter-species conservation scores.
Surprisingly, the transmembrane regions showed little if any enrichment for highly constrained regions, but slightly higher enrichment for medium constraint (CCRpct [60,90)), i.e. on average shorter regions. Perhaps, this reflects the lipid environment where variants between the hydrophobic amino acids are common.
Among the unexpected results, we observed that disulphide bond cysteines were more prone to lie within regions with low to medium percentiles of constraint (CCRpct in (0,90)). Disulphide bonds are covalent tertiary interactions important for stabilising protein folds and/or performing physiologically relevant redox activity and hence highly conserved in evolution54. We hypothesise that the association with lower-medium CCRpct (i.e. average shorter regions depleted of variants, with mean length = 20 amino acids) reflects the fact that the formation of such bonds requires only the presence of short motifs involving only the cysteines and their immediate flanking residues55.
Perhaps the most unexpected results we observed were related to disordered and mobile regions in proteins, showing dual enrichment for unconstrained/lowly constrained and also for highly constrained percentiles, mostly in sites with low conservation, and this might relate to the multiplicity of functions, or “flavours”, of disorder that such regions can present, which depend on their length, composition and location in proteins14, 56. Disordered proteins and regions are able to fulfil a variety of tasks: they can serve as flexible linkers between structured regions or flexible binding sites for ligands, they can undergo disorder–order transitions upon binding to other proteins through specific molecular recognition features (MoRFs) within longer disordered regions, they can also have short linear motifs that work as targets for post-translational modifications or cell signalling, or longer regions which promote molecular recognition and protein–protein interactions. The characterisation of the dynamic of IDPs/IDRs has led to the identification of their plausible role in regulating enzymatic activity57 and has also been useful to investigate ligand selection for developing drugs58. This motivates the necessity of characterising disordered proteins and regions, for discovering the function and relevant mechanism where they are involved.
In the present work, in particular, residues involved in D-to-O, context dependent transitions and in driving LLPS showed association with high constraints, for conserved and also unconserved sites. In the case of residues in order–disorder transitions, our observations align with what has been proposed in terms of the binding mechanisms. D-to-O are defined by a single, well-defined, fully ordered binding configuration, mediated by a unique well-defined contact pattern that excludes ambiguities and is determined by the presence of binding motifs. Context dependent transitions involve alternative binding configurations, which change with the cellular conditions and different partners. Conversely, D-to-D transitions are defined by many different binding configurations, including alternative contact patterns, often with weak or redundant motifs15. For amino acids driving LLPS, our results suggest a strong association with constrained regions ranked with the highest percentiles ([95,100]) and that such regions are, on average, the longest stretches (75 amino acids in length) depleted of protein changing variants across the human coding genome.
The scientific community is beginning to untangle the complexity of interactions and regulations involved in LLPS, and evidence shows that these protein condensates do not follow classical rules of molecular recognition. LLPS regions are generally mobile/disordered and involve long sequence stretches that orchestrate multiple and multivalent interactions with proteins and RNA for the formation of membrane-less organelles in the cell. They are important for organising and regulating key cellular processes such as transcription, splicing, translation, chromosome condensation, synapsis and downstream signalling, all essential for tightly regulating the differential expression of genes, ensuring cell survival, correct differentiation into different tissues, and for the development and function of the neuronal and immune systems59, 60, 61, 62, 63. However, it was remarkable that amino acid positions in regions driving LLPS were not observed as significantly associated with Pathogenic protein altering variants (Supplementary Figure 7), considering that previous works have observed these genes frequently related to cancer, autism spectrum disorders, neurodegeneration, and infectious diseases21, 22, 23. Here, apart from these associations, we noticed that 31.5% of these proteins were involved in diseases with profound impact in the normal human postnatal and early development, with a high prevalence of neurodevelopmental disorders. In addition, 64% of the LLPS driving genes were highly constrained for loss-of-function in heterozygosity and also, 44% of the 54 genes were not associated with any disease with 14 of them, mostly associated to RNA metabolism being highly constrained for loss-of-function variation. We hypothesise that variants in such genes could possibly cause severe phenotypes and affect embryonic viability.
Most of the significantly enriched GO terms for proteins with highly constrained regions (CCRpct in [95,100]) were related to RNA-processing, DNA binding, protein–protein interactions, and enzymatic activities. This is in concordance with our observations that amino acid sites functionally annotated as binding DNA/RNA and/or proteins, in catalytic sites and in LIPs, and/or driving LLPS, are among the ones with the greatest odds of being highly constrained, i.e. in the, on average, longest regions in the human genome intolerant to variability.
Our results also complement and extend what the authors of the CCRs model1 have derived before by analysing the co-occurrence with Pfam domains and observing that the highly constrained regions are involved in ion transport and in different DNA/RNA interactions (like zinc fingers, helicases and translation factors), but also that about 30% of these highly constrained regions did not correspond to any protein Pfam3 domain.
Undoubtedly, the sequencing of genetically more diverse human populations will refine some CCRs further, but the data presented here has significant clinical utility. The key challenge for clinical genomics is interpreting the pathogenicity of rare variants. Identifying whether a rare variant lies within a defined constrained region of the protein facilitates consequence interpretation especially for novel variants absent from existing genomic databases.
We emphasise that combining interspecies and intraspecies (human population) conservation can help to highlight regions of individual genes that have appeared more recently in evolution or confer some degree of uniqueness/specificity to an individual paralogue. This data has the potential to facilitate the discovery of new associations between genes/variants with previously unknown phenotypes. CCRs also highlight many highly constrained regions currently not linked to any Mendelian disease. This may indicate mutations in these regions are lethal to humans or are sufficiently rare that they have not yet been identified.1
The current tools that attempt to predict the clinical relevance of a specific sequence variant have been developed mostly based on the characteristics of folded protein regions64 making it difficult to understand the effect of variants affecting intrinsically disordered/mobile and liquid–liquid phase separating regions. Furthermore, many protein functional sites still remain to be characterised and currently lack sufficient functional annotations, in particular the difficult cases, where flexible linkers/disordered regions are poorly characterised and/or can be poorly conserved across species while being constrained, at different levels, in human populations. Our mapping of CCRs to amino acids helps to define these regions in proteins more accurately and could contribute to the further annotation of these challenging regions.
Our approach, however, is limited to the analysis of single features, we are aware of the possibility that multiple features can co-occur for the same protein site or that other confounding biological factors can be present. Furthermore, because the CCRs were defined from genomic sequences, i.e. in a uni-dimensional or linear space, and the weighting of the regions takes into consideration their length, the model does not consider the possibility of having short lowly constrained regions coming together in the protein three-dimensional space to define a larger structural cluster constrained for variability. Looking forward, we believe that all these current limitations will open new avenues for further research and refinement.
Methods
Generation of the CCRs based on gnomAD3
To obtain the CCRs we ran the pipeline developed by1 (https://quinlan-lab.github.io/ccr/examples/updates) but employing the dataset of gnomAD26 version 3.0 and the corresponding files for the coordinates of the human genome in version GRCh3865 (https://www.ncbi.nlm.nih.gov/grc). We used the Variant Effect Predictor (VEP)66 of Ensembl25 version 101. We also followed the recommendations of the authors to only consider genetic variants from autosomes and chromosome X, and avoid those in conflicting genomic regions - i.e. where there are segmental duplications and/or high identity with other genomic regions (>=90% identity) or with low sequencing coverage. In the same line of recommendations, we ran the weighting of the regions for autosomes and X chromosome separately, but merged both output files into one for performing the mapping of the coordinates of the regions to protein amino acids.
Mapping the CCRs to protein amino acids
We developed an in-house pipeline in R that uses the ‘ensembldb’ Bioconductor R package67 to map the genomic coordinates of CCRs boundaries, and all the coding bases in between, to the Ensembl v101 transcripts which are part of the GENCODE68 basic set version 35. This was to ensure we were including complete and well annotated relevant transcripts. For those amino acid sites where the corresponding codon had constrained and unconstrained bases, we assigned such amino acid sites as unconstrained. We then obtained the sequence identifiers that crosslink Ensembl transcripts and proteins and UniProtKB proteins24 by querying the APIs (application programming interfaces) of both databases. Finally, the CCRpct were accurately transferred to the amino acids in UniProtKB sequences by downloading the corresponding protein sequences from Ensembl and UniProtKB and performing Blastp local alignments69 requesting 100% sequence identity (perfect match). The workflow is summarised in Figure 9(A), explained more in detail in Supplementary Methods and the corresponding scripts of the pipeline are available in this repository https://github.com/marciaah/CCRStoAAC.
Aggregation of protein features annotations and clinically interpreted variants
We developed our own pipeline in R for fetching different protein features and functional annotations from multiple resources. For this purpose, we captured annotations and based our analysis only for the 9.8 million protein sites in UniProtKB/SwissProt canonical sequences because such sequences are the main references for annotations in the databases we employed. An overview of the databases and features that we included are presented in Figure 9(B), and obtained as described more in detail in Supplementary Methods.
Gene Ontology enrichment tests
The GO statistical over-representation tests were performed using the PANTHER classification system40 (https://www.pantherdb.org/tools/index.jsp, PANTHER version 17.0 release 22-02-2022, with Reactome version 65), submitting the list of genes of interest (e.g. those presenting regions with percentiles in [95,100]) and using as “reference list’ only those genes for which we were able to map CCRpct.
Odds ratios tests for enrichment
We performed four different Odds ratio (OR) test analyses to measure the enrichment of amino acid sites presenting different combinations of CCRpct, conservation, protein features and ClinVar variants:
I. CCRpct and presence of each one of the 30 protein features: we binary assigned whether or not an amino acid site had any of the protein features (see Figure 9(B) for full list) and a CCRpct in any of the 7 bins: unconstrained = [0]; low-medium constraint= (0,30), [30,60) and [60,90); moderately constrained = [90,95), highly constrained = [95,99) and most highly constrained = [99,100].
II. CCRpct and conservation with the presence of protein features: we binary classified the amino acid sites as having or not any of the 30 protein features and any of the 6 combinations: a) CCRs unconstrained (0 pct) and conservation score ≤ 0.5, b) CCRs unconstrained (0 pct) and conservation score > 0.5, c) CCRs in (0,50] pct and conservation score ≤ 0.5, d) CCRs in (0,50] pct and conservation score > 0.5, e) CCRs in (50,100] pct and conservation score ≤ 0.5, f) CCRs in (50,100] pct and conservation score > 0.5.
III. CCRpct and presence of ClinVar variants: amino acid sites were binary assigned whether or not they had “pathogenic/likely_pathogenic”, “benign/likely_benign” or “VUS/conflicting interpretations of pathogenicity” variants and a CCRpct in any of the 7 bins mentioned in (I).
IV. CCRpct and conservation with the presence of ClinVar variants: we classified residues according to whether they had or not any of the 3 groups of variants as described in (III) and any of the 6 combinations of CCRpct and conservation as described in (II).
For the four enrichment analyses, contingency tables were constructed counting amino acid sites with the different classifications (See Supplementary Methods: OR tests for enrichment for further details) and the OR were calculated using two-tailed and one-tailed Fisher’s exact tests70 for obtaining the corresponding P-values and 95% confidence intervals (CI 95%). It is worth clarifying that when counting residues we did not request exclusivity in the intersections, i.e. a residue with a given CCRpct can intersect with being in DOMAIN, DOSORDER_MOBILE and DNA-RNA_BIND and hence will contribute to the cells in the three corresponding contingency tables.
CRediT authorship contribution statement
Marcia A. Hasenahuer: Conceptualization, Methodology, Software, Investigation, Data curation, Formal analysis, Writing – original draft. Alba Sanchis-Juan: Conceptualization, Software, Writing – review & editing. Roman A. Laskowski: Resources, Writing – original draft. James A. Baker: Writing – review & editing. James D. Stephenson: Writing – review & editing. Christine A. Orengo: Writing – review & editing, Funding acquisition. F. Lucy Raymond: Conceptualization, Writing – original draft, Supervision, Funding acquisition. Janet M. Thornton: Conceptualization, Supervision, Funding acquisition, Methodology, Writing - original draft, Writing - review & editing.
Acknowledgments
Acknowledgments
This work was funded by the Cambridge NIHR Biomedical Research Centre (BRC-1215-200014) and the 3D-FunSites Project (Wellcome Trust Ref. number 221327/Z/20/Z). The Wellcome Trust, the European Bioinformatics Institute (EMBL-EBI) and the Medical Research Council have also funded research infrastructure. We thank Prof. Raymond’s and Prof. Thornton’s group members for their valuable comments and suggestions. We also thank Dr. Gabriela A. Merino for her valuable comments on the statistical analysis.
The code and data used for the present analysis is provided in GitHub repositories, as mentioned in the Methods section. The authors welcome requests for additional information regarding the material presented in this paper.
Declaration of Competing Interest
The authors declare that they have no competing interests.
Edited by Rita Casadio
Footnotes
Supplementary data to this article can be found online at https://doi.org/10.1016/j.jmb.2022.167892.
Appendix A. Supplementary Data
The following are the Supplementary data to this article:
Supplementary Figure 1: comparison of pLI and missense OEUF and the maximum CCRs percentile by protein. Supplementary Figure 2: The higher the constrained percentile the larger the mean length of the regions and the higher the mean conservation of the amino acids within them. However, each percentile category exhibits a high variability. Supplementary Figure 3: Comparison of full length of proteins and the number of regions in each CCRs percentile that they have. Supplementary Figure 4: Distribution of the raw count of amino acid positions grouped by different categories of CCRs percentiles and conservation scores. Supplementary figure 5: Comparison of OR tests, conservation and length of regions by protein feature. Supplementary figure 6: Comparison of OR tests, conservation and length of regions by type of ClinVar variants. Supplementary Figure 7: Amino acid positions affected with clinically interpreted variants assessed by their location in the different protein features or functional sites.
OR and Fisher Exact Tests assessing the associations between protein features and the CCRpct groups
OR and Fisher Exact Tests assessing the associations between protein features and each one of the 6 groups combining different CCRpct and inter-species conservation.
OR and Fisher Exact Tests assessing the co-occurrence of amino acid positions affected by ClinVar missense variants with the different categories of CCRs percentiles.
Odd ratios (OR) of co-occurrence of amino acid positions affected by ClinVar missense variants with the different categories of CCRpct and inter-species conservation.
List of human proteins that harbour LLPS driving regions, and that have CCRpct assigned and their association with different clinical conditions.
List of human clinical conditions (ClinVar) and associated proteins which are involved in driving LLPS.
Data availability
The code and data used for the present analysis is provided in GitHub repositories, and the authors welcome requests for additional information
References
- 1.Havrilla J.M., Pedersen B.S., Layer R.M., Quinlan A.R. A map of constrained coding regions in the human genome. Nat. Genet. 2019;51:88–95. doi: 10.1038/s41588-018-0294-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Lek M., Karczewski K.J., Minikel E.V., Samocha K.E., Banks E., Fennell T., O’Donnell-Luria A.H., Ware J.S., et al. Exome Aggregation Consortium, Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536:285–291. doi: 10.1038/nature19057. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Mistry J., Chuguransky S., Williams L., Qureshi M., Salazar G.A., Sonnhammer E.L.L., Tosatto S.C.E., Paladin L., et al. Pfam: The protein families database in 2021. Nucleic Acids Res. 2021;49:D412–D419. doi: 10.1093/nar/gkaa913. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Sillitoe I., Bordin N., Dawson N., Waman V.P., Ashford P., Scholes H.M., Pang C.S.M., Woodridge L., et al. CATH: increased structural coverage of functional space. Nucleic Acids Res. 2021;49:D266–D273. doi: 10.1093/nar/gkaa1079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Gudmundsson S., Singer-Berk M., Watts N.A., Phu W., Goodrich J.K., Solomonson M., Genome Aggregation Database Consortium, Rehm H.L., et al. Variant interpretation using population databases: Lessons from gnomAD. Hum. Mutat. 2021;12 doi: 10.1002/humu.24309. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Samocha, K. E., Kosmicki, J. A., Karczewski, K. J., O’Donnell-Luria, A. H., Pierce-Hoffman, E., MacArthur, D. G., Neale, B. M. & Daly, M.J. (n.d.). Regional missense constraint improves variant deleteriousness prediction, 10.1101/148353. [DOI]
- 7.Huang Y.-F. Unified inference of missense variant effects and gene constraints in the human genome. PLoS Genet. 2020;16:e1008922. doi: 10.1371/journal.pgen.1008922. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Zhao M., Havrilla J.M., Fang L., Chen Y., Peng J., Liu C., Wu C., Sarmady M., et al. Phen2Gene: rapid phenotype-driven gene prioritization for rare diseases. NAR Genom Bioinform. 2020;2:lqaa032. doi: 10.1093/nargab/lqaa032. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Šimčíková D., Heneberg P. Refinement of evolutionary medicine predictions based on clinical evidence for the manifestations of Mendelian diseases. Sci. Rep. 2019;9:18577. doi: 10.1038/s41598-019-54976-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Evans P., Wu C., Lindy A., McKnight D.A., Lebo M., Sarmady M., Abou Tayoun A.N. Genetic variant pathogenicity prediction trained using disease-specific clinical sequencing data sets. Genome Res. 2019;29:1144–1151. doi: 10.1101/gr.240994.118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Satterstrom F.K., Kosmicki J.A., Wang J., Breen M.S., De Rubeis S., An J.-Y., Peng M., Collins R., et al. Large-scale exome sequencing study implicates both developmental and functional changes in the neurobiology of autism. Cell. 2020;180 doi: 10.1016/j.cell.2019.12.036. 568–584.e23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Sanchis-Juan A., Hasenahuer M.A., Baker J.A., McTague A., Barwick K., Kurian M.A., Duarte S.T., BioResource N.I.H.R., et al. Structural analysis of pathogenic missense mutations in GABRA2 and identification of a novel de novo variant in the desensitization gate. Mol. Genet. Genomic Med. 2020;8:e1106. doi: 10.1002/mgg3.1106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Rodger C., Flex E., Allison R.J., Sanchis-Juan A., Hasenahuer M.A., Cecchetti S., French C.E., Edgar J.R., et al. De Novo VPS4A Mutations Cause Multisystem Disease with Abnormal Neurodevelopment. Am. J. Hum. Genet. 2020;107:1129–1148. doi: 10.1016/j.ajhg.2020.10.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.van der Lee R., Buljan M., Lang B., Weatheritt R.J., Daughdrill G.W., Dunker A.K., Fuxreiter M., Gough J., et al. Classification of intrinsically disordered regions and proteins. Chem. Rev. 2014;114:6589–6631. doi: 10.1021/cr400525m. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Fuxreiter M. Classifying the binding modes of disordered proteins. Int. J. Mol. Sci. 2020;21 doi: 10.3390/ijms21228615. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Brocca S., Grandori R., Longhi S., Uversky V. Liquid-Liquid phase separation by intrinsically disordered protein regions of viruses: roles in viral life cycle and control of virus-host interactions. Int. J. Mol. Sci. 2020;21 doi: 10.3390/ijms21239045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Wright P.E., Dyson H.J. Intrinsically disordered proteins in cellular signalling and regulation. Nat. Rev. Mol. Cell Biol. 2015;16:18–29. doi: 10.1038/nrm3920. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Fusco G., Gianni S. Function, regulation, and dysfunction of intrinsically disordered proteins. Life. 2021;11:140. doi: 10.3390/life11020140. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Vacic V., Iakoucheva L.M. Disease mutations in disordered regions–exception to the rule? Mol. Biosyst. 2012;8:27–32. doi: 10.1039/c1mb05251a. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Uversky V.N., Oldfield C.J., Dunker A.K. Intrinsically disordered proteins in human diseases: introducing the D2 concept. Annu. Rev. Biophys. 2008;37:215–246. doi: 10.1146/annurev.biophys.37.032807.125924. [DOI] [PubMed] [Google Scholar]
- 21.Tsang B., Pritišanac I., Scherer S.W., Moses A.M., Forman-Kay J.D. Phase separation as a missing mechanism for interpretation of disease mutations. Cell. 2020;183:1742–1756. doi: 10.1016/j.cell.2020.11.050. [DOI] [PubMed] [Google Scholar]
- 22.Li J., Zhang Y., Chen X., Ma L., Li P., Yu H. Protein phase separation and its role in chromatin organization and diseases. Biomed. Pharmacother. 2021;138:111520. doi: 10.1016/j.biopha.2021.111520. [DOI] [PubMed] [Google Scholar]
- 23.Wang B., Zhang L., Dai T., Qin Z., Lu H., Zhang L., Zhou F. Liquid-liquid phase separation in human health and diseases. Signal Transduct Target Ther. 2021;6:290. doi: 10.1038/s41392-021-00678-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.UniProt Consortium UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 2021;49:D480–D489. doi: 10.1093/nar/gkaa1100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Howe K.L., Achuthan P., Allen J., Allen J., Alvarez-Jarreta J., Amode M.R., Armean I.M., Azov A.G., et al. Ensembl 2021. Nucleic Acids Res. 2021;49:D884–D891. doi: 10.1093/nar/gkaa942. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Karczewski K.J., Francioli L.C., Tiao G., Cummings B.B., Alföldi J., Wang Q., Collins R.L., Laricchia K.M., et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581:434–443. doi: 10.1038/s41586-020-2308-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Davydov E.V., Goode D.L., Sirota M., Cooper G.M., Sidow A., Batzoglou S. Identifying a high fraction of the human genome to be under selective constraint using GERP++ PLoS Comput. Biol. 2010;6:e1001025. doi: 10.1371/journal.pcbi.1001025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Valdar W.S.J. Scoring residue conservation. Proteins. 2002;48:227–241. doi: 10.1002/prot.10146. [DOI] [PubMed] [Google Scholar]
- 29.Velankar S., Alhroub Y., Alili A., Best C., Boutselakis H.C., Caboche S., Conroy M.J., Dana J.M., et al. PDBe: protein data bank in Europe. Nucleic Acids Res. 2011;39:D402–D410. doi: 10.1093/nar/gkq985. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Laskowski R.A., Stephenson J.D., Sillitoe I., Orengo C.A., Thornton J.M. VarSite: Disease variants and protein structure. Protein Sci. 2020;29:111–119. doi: 10.1002/pro.3746. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Ribeiro A.J.M., Holliday G.L., Furnham N., Tyzack J.D., Ferris K., Thornton J.M. Mechanism and Catalytic Site Atlas (M-CSA): a database of enzyme reaction mechanisms and active sites. Nucleic Acids Res. 2018;46:D618–D623. doi: 10.1093/nar/gkx1012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Yang J., Roy A., Zhang Y. BioLiP: a semi-manually curated database for biologically relevant ligand-protein interactions. Nucleic Acids Res. 2013;41:D1096–D1103. doi: 10.1093/nar/gks966. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Piovesan D., Necci M., Escobedo N., Monzon A.M., Hatos A., Mičetić I., Quaglia F., Paladin L., et al. MobiDB: intrinsically disordered proteins in 2021. Nucleic Acids Res. 2021;49:D361–D367. doi: 10.1093/nar/gkaa1058. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Kumar M., Michael S., Alvarado-Valverde J., Mészáros B., Sámano-Sánchez H., Zeke A., Dobson L., Lazar T., et al. The Eukaryotic Linear Motif resource: 2022 release. Nucleic Acids Res. 2022;50(2022):D497–D508. doi: 10.1093/nar/gkab975. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Cunningham F., Allen J.E., Allen J., Alvarez-Jarreta J., Amode M.R., Armean I.M., Austine-Orimoloye O., Azov A.G., et al. Ensembl 2022. Nucleic Acids Res. 2022;50:D988–D995. doi: 10.1093/nar/gkab1049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Landrum M.J., Lee J.M., Benson M., Brown G.R., Chao C., Chitipiralla S., Gu B., Hart J., et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 2018;46:D1062–D1067. doi: 10.1093/nar/gkx1153. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Brown C.J., Takayama S., Campen A.M., Vise P., Marshall T.W., Oldfield C.J., Williams C.J., Dunker A.K. Evolutionary rate heterogeneity in proteins with long disordered regions. J. Mol. Evol. 2002;55:104–110. doi: 10.1007/s00239-001-2309-6. [DOI] [PubMed] [Google Scholar]
- 38.Chen J.W., Romero P., Uversky V.N., Dunker A.K. Conservation of intrinsic disorder in protein domains and families: II. functions of conserved disorder. J. Proteome Res. 2006;5:888–898. doi: 10.1021/pr060049p. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Beltrao P., Bork P., Krogan N.J., van Noort V. Evolution and functional cross-talk of protein post-translational modifications. Mol. Syst. Biol. 2013;9:714. doi: 10.1002/msb.201304521. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Thomas P.D., Campbell M.J., Kejariwal A., Mi H., Karlak B., Daverman R., Diemer K., Muruganujan A., et al. PANTHER: a library of protein families and subfamilies indexed by function. Genome Res. 2003;13:2129–2141. doi: 10.1101/gr.772403. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Singh R., Banerjee H., Green M.R. Differential recognition of the polypyrimidine-tract by the general splicing factor U2AF65 and the splicing repressor sex-lethal. RNA. 2000;6:901–911. doi: 10.1017/s1355838200000376. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Tari M., Manceau V., de Matha Salone J., Kobayashi A., Pastré D., Maucuer A. U2AF assemblies drive sequence-specific splice site recognition. EMBO Rep. 2019;20:e47604. doi: 10.15252/embr.201847604. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Kielkopf C.L., Rodionova N.A., Green M.R., Burley S.K. A novel peptide recognition mode revealed by the X-ray structure of a core U2AF35/U2AF65 heterodimer. Cell. 2001;106:595–605. doi: 10.1016/s0092-8674(01)00480-9. [DOI] [PubMed] [Google Scholar]
- 44.Jenkins J.L., Laird K.M., Kielkopf C.L. A Broad range of conformations contribute to the solution ensemble of the essential splicing factor U2AF(65) Biochemistry. 2012;51:5223–5225. doi: 10.1021/bi300277t. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Huang J.-R., Warner L.R., Sanchez C., Gabel F., Madl T., Mackereth C.D., Sattler M., Blackledge M. Transient electrostatic interactions dominate the conformational equilibrium sampled by multidomain splicing factor U2AF65: a combined NMR and SAXS study. J. Am. Chem. Soc. 2014;136:7068–7076. doi: 10.1021/ja502030n. [DOI] [PubMed] [Google Scholar]
- 46.Kang H.-S., Sánchez-Rico C., Ebersberger S., Sutandy F.X.R., Busch A., Welte T., Stehle R., Hipp C., et al. An autoinhibitory intramolecular interaction proof-reads RNA recognition by the essential splicing factor U2AF2. Proc. Natl. Acad. Sci. U. S. A. 2020;117:7140–7149. doi: 10.1073/pnas.1913483117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Wang W., Maucuer A., Gupta A., Manceau V., Thickman K.R., Bauer W.J., Kennedy S.D., Wedekind J.E., et al. Structure of phosphorylated SF1 bound to U2AF65 in an essential splicing factor complex. Structure. 2013;21:197–208. doi: 10.1016/j.str.2012.10.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Kaplanis J., Samocha K.E., Wiel L., Zhang Z., Arvai K.J., Eberhardt R.Y., Gallone G., Lelieveld S.H., et al. Evidence for 28 genetic disorders discovered by combining healthcare and research data. Nature. 2020;586:757–762. doi: 10.1038/s41586-020-2832-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Maji D., Glasser E., Henderson S., Galardi J., Pulvino M.J., Jenkins J.L., Kielkopf C.L. Representative cancer-associated U2AF2 mutations alter RNA interactions and splicing. J. Biol. Chem. 2020;295:17148–17157. doi: 10.1074/jbc.RA120.015339. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Chew T.A., Orlando B.J., Zhang J., Latorraca N.R., Wang A., Hollingsworth S.A., Chen D.-H., Dror R.O., et al. Structure and mechanism of the cation-chloride cotransporter NKCC1. Nature. 2019;572:488–492. doi: 10.1038/s41586-019-1438-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Yang X., Wang Q., Cao E. Structure of the human cation-chloride cotransporter NKCC1 determined by single-particle electron cryo-microscopy. Nat. Commun. 2020;11:1016. doi: 10.1038/s41467-020-14790-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Gagnon K.B., Delpire E. Physiology of SLC12 transporters: lessons from inherited human genetic mutations and genetically engineered mouse knockouts. Am. J. Physiol. Cell Physiol. 2013;304:C693–C714. doi: 10.1152/ajpcell.00350.2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Mutai H., Wasano K., Momozawa Y., Kamatani Y., Miya F., Masuda S., Morimoto N., Nara K., et al. Variants encoding a restricted carboxy-terminal domain of SLC12A2 cause hereditary hearing loss in humans. PLoS Genet. 2020;16:e1008643. doi: 10.1371/journal.pgen.1008643. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Bošnjak I., Bojović V., Šegvić-Bubić T., Bielen A. Occurrence of protein disulfide bonds in different domains of life: a comparison of proteins from the Protein Data Bank. Protein Eng. Des. Sel. 2014;27:65–72. doi: 10.1093/protein/gzt063. [DOI] [PubMed] [Google Scholar]
- 55.Ferrè F., Clote P. DiANNA 1.1: an extension of the DiANNA web server for ternary cysteine classification. Nucleic Acids Res. 2006;34:W182–W185. doi: 10.1093/nar/gkl189. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Necci M., Piovesan D., Tosatto S.C.E. Large-scale analysis of intrinsic disorder flavors and associated functions in the protein sequence universe. Protein Sci. 2016;25:2164–2174. doi: 10.1002/pro.3041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Palombo M., Bonucci A., Etienne E., Ciurli S., Uversky V.N., Guigliarelli B., Belle V., Mileo E., et al. The relationship between folding and activity in UreG, an intrinsically disordered enzyme. Sci. Rep. 2017;7:5977. doi: 10.1038/s41598-017-06330-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Maity B.K., Vishvakarma V., Surendran D., Rawat A., Das A., Pramanik S., Arfin N., Maiti S. Spontaneous fluctuations can guide drug design strategies for structurally disordered proteins. Biochemistry. 2018;57:4206–4213. doi: 10.1021/acs.biochem.8b00504. [DOI] [PubMed] [Google Scholar]
- 59.Gueroussov S., Weatheritt R.J., O’Hanlon D., Lin Z.-Y., Narula A., Gingras A.-C., Blencowe B.J. Regulatory expansion in mammals of multivalent hnRNP assemblies that globally control alternative splicing. Cell. 2017;170:324–339.e23. doi: 10.1016/j.cell.2017.06.037. [DOI] [PubMed] [Google Scholar]
- 60.Hnisz D., Shrinivas K., Young R.A., Chakraborty A.K., Sharp P.A. A phase separation model for transcriptional control. Cell. 2017;169:13–23. doi: 10.1016/j.cell.2017.02.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Su X., Ditlev J.A., Hui E., Xing W., Banjade S., Okrut J., King D.S., Taunton J., et al. Phase separation of signaling molecules promotes T cell receptor signal transduction. Science. 2016;352:595–599. doi: 10.1126/science.aad9964. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Tsang B., Arsenault J., Vernon R.M., Lin H., Sonenberg N., Wang L.-Y., Bah A., Forman-Kay J.D. Phosphoregulated FMRP phase separation models activity-dependent translation through bidirectional control of mRNA granule formation. Proc. Natl. Acad. Sci. U. S. A. 2019;116:4218–4227. doi: 10.1073/pnas.1814385116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Reichheld S.E., Muiznieks L.D., Keeley F.W., Sharpe S. Direct observation of structure and dynamics during phase separation of an elastomeric protein. Proc. Natl. Acad. Sci. U. S. A. 2017;114:E4408–E4415. doi: 10.1073/pnas.1701877114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Stefl S., Nishi H., Petukh M., Panchenko A.R., Alexov E. Molecular mechanisms of disease-causing missense mutations. J. Mol. Biol. 2013;425:3919–3936. doi: 10.1016/j.jmb.2013.07.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Schneider V.A., Graves-Lindsay T., Howe K., Bouk N., Chen H.-C., Kitts P.A., Murphy T.D., Pruitt K.D., et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 2017;27:849–864. doi: 10.1101/gr.213611.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.McLaren W., Gil L., Hunt S.E., Riat H.S., Ritchie G.R.S., Thormann A., Flicek P., Cunningham F. The Ensembl Variant Effect Predictor. Genome Biol. 2016;17:122. doi: 10.1186/s13059-016-0974-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Rainer J., Gatto L., Weichenberger C.X. ensembldb: an R package to create and use Ensembl-based annotation resources. Bioinformatics. 2019;35:3151–3153. doi: 10.1093/bioinformatics/btz031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Frankish A., Diekhans M., Jungreis I., Lagarde J., Loveland J.E., Mudge J.M., Sisu C., Wright J.C., et al. GENCODE 2021. Nucleic Acids Res. 2021;49:D916–D923. doi: 10.1093/nar/gkaa1087. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
- 70.Fisher S.R.A. Oliver and Boyd; 1970. Statistical Methods for Research Workers. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplementary Figure 1: comparison of pLI and missense OEUF and the maximum CCRs percentile by protein. Supplementary Figure 2: The higher the constrained percentile the larger the mean length of the regions and the higher the mean conservation of the amino acids within them. However, each percentile category exhibits a high variability. Supplementary Figure 3: Comparison of full length of proteins and the number of regions in each CCRs percentile that they have. Supplementary Figure 4: Distribution of the raw count of amino acid positions grouped by different categories of CCRs percentiles and conservation scores. Supplementary figure 5: Comparison of OR tests, conservation and length of regions by protein feature. Supplementary figure 6: Comparison of OR tests, conservation and length of regions by type of ClinVar variants. Supplementary Figure 7: Amino acid positions affected with clinically interpreted variants assessed by their location in the different protein features or functional sites.
OR and Fisher Exact Tests assessing the associations between protein features and the CCRpct groups
OR and Fisher Exact Tests assessing the associations between protein features and each one of the 6 groups combining different CCRpct and inter-species conservation.
OR and Fisher Exact Tests assessing the co-occurrence of amino acid positions affected by ClinVar missense variants with the different categories of CCRs percentiles.
Odd ratios (OR) of co-occurrence of amino acid positions affected by ClinVar missense variants with the different categories of CCRpct and inter-species conservation.
List of human proteins that harbour LLPS driving regions, and that have CCRpct assigned and their association with different clinical conditions.
List of human clinical conditions (ClinVar) and associated proteins which are involved in driving LLPS.
Data Availability Statement
The code and data used for the present analysis is provided in GitHub repositories, and the authors welcome requests for additional information












