Skip to main content
Communications Biology logoLink to Communications Biology
. 2024 Mar 13;7:320. doi: 10.1038/s42003-024-05970-8

Classification of likely functional class for ligand binding sites identified from fragment screening

Javier S Utgés 1, Stuart A MacGowan 1, Callum M Ives 1,2, Geoffrey J Barton 1,
PMCID: PMC10937669  PMID: 38480979

Abstract

Fragment screening is used to identify binding sites and leads in drug discovery, but it is often unclear which binding sites are functionally important. Here, data from 37 experiments, and 1309 protein structures binding to 1601 ligands were analysed. A method to group ligands by binding sites is introduced and sites clustered according to profiles of relative solvent accessibility. This identified 293 unique ligand binding sites, grouped into four clusters (C1-4). C1 includes larger, buried, conserved, and population missense-depleted sites, enriched in known functional sites. C4 comprises smaller, accessible, divergent, missense-enriched sites, depleted in functional sites. A site in C1 is 28 times more likely to be functional than one in C4. Seventeen sites, which to the best of our knowledge are novel, in 13 proteins are identified as likely to be functionally important with examples from human tenascin and 5-aminolevulinate synthase highlighted. A multi-layer perceptron, and K-nearest neighbours model are presented to predict cluster labels for ligand binding sites with an accuracy of 96% and 100%, respectively, so allowing functional classification of sites for proteins not in this set. Our findings will be of interest to those studying protein-ligand interactions and developing new drugs or function modulators.

Subject terms: Machine learning, Target identification, Protein analysis


The authors classify 293 ligand binding sites from 37 proteins into four groups by their solvent accessibility profile. The groups differ in evolutionary conservation and functional enrichment, and can be predicted by a machine learning model.

Introduction

Fragment-based drug discovery or fragment screening, is widely used to identify lead compounds against a specific protein target1. Fragment screening typically uses X-ray crystallography to provide detailed information on the binding mode of small molecule fragments that bind to a target protein. Fragments can then be linked or grown to form more potent leads24. A typical fragment screening experiment will generate a collection of three-dimensional structures with fragments bound to different regions of the protein. While many fragments group around well understood catalytic or binding sites and so provide a scaffold for drug discovery, other fragments are also observed bound to regions of the protein where the functional significance is unclear. Such sites may be functionally irrelevant or could identify previously unknown allosteric or other functionally important sites worthy of experimental investigation.

In this paper, we describe a strategy to identify which fragment binding sites are most likely to be of functional importance and so prioritise sites for further investigation. The first step is to identify binding sites from the fragment data. We are not predicting ligand binding sites, as P2Rank5, Fpocket6, or molecular dynamics-based methods such as MixMD7,8, MDmix9, or SILCS10 do. Instead, from a set of experimentally determined three-dimensional structures of protein–ligand complexes, we define which ligands bind to the same site, based on their protein–ligand interactions.

In most previous studies, the focus has been on clustering ligands by root-mean-square deviation (RMSD)11 or Euclidean distances12 after ligand superposition. Ligand site prediction resources such as 3DLigandSite13,14 also define sites based on ligand structure superposition and RMSD. Here, we describe an algorithm that defines ligand binding sites from analysis of ligand interaction residues on the protein. The method allows the extent of a fragment binding site to be described without the need for superposition. We then apply unsupervised methods to group the defined sites into four robust clusters according to their relative solvent accessibility profiles and show which clusters are enriched in functionally characterised sites. Our analysis suggests which sites in a set of 39 fragment screening experiments are most likely to be of functional significance through further stratification by evolutionary conservation and human population missense-depletion15,16. We then develop a machine learning method that takes a set of interacting residues in an experimentally determined structure or a predicted ligand binding site and identifies which of the four classes best represents the site.

The work in this paper is likely to be of interest to groups focusing on fragment screening studies but wider applications to ligand site classification from experimentally determined or predicted structures are also discussed.

Results

Defined binding sites

The focus here is on human proteins to allow the additional information from human population variation data to be explored. For this reason, two of the 39 protein domains (products of the Replicase polyprotein 1ab from SARS-CoV-2 (P0DTD1)) were removed since they did not include any human homologues. The remaining 37 protein domains accounted for 1309 three-dimensional structures that included interactions with 1601 ligands of interest, of which 998 were unique. 293 ligand binding sites were defined across these domains, formed by 2664 unique ligand binding residues. The total number of binding sites per domain ranges from 1–24, with 33/37 domains presenting more than one defined binding site. The median number of sites per domain is seven.

Figure 1 illustrates three examples of the 37 domains for which ligand binding sites were defined by the algorithm presented in this work. The grouping of the ligands into the defined sites reflects the similarity between the interaction fingerprints of the different ligands with the target protein domain.

Fig. 1. Ligand clusters defined by the binding site definition algorithm.

Fig. 1

For simplicity, only one protein chain ribbon is shown in white for each example. Ligands are coloured according to the site they bind to. Identifiers are from UniProt. a There were 110 structures depicting human tyrosine-protein phosphatase non-receptor type 1 (PTPN1), P18031, binding 143 ligand molecules, 104 of which were unique. 18 binding sites were defined. b The 68 ligands, 30 unique, found across 50 structures of the chestnut blight fungus endothiapepsin (EAPA), P11838, were classified in 12 distinct binding sites. c For mouse mitogen-activated protein kinase 14 (Mapk14), P47811, 52 structures portrayed the interaction with 53 ligand molecules, 50 unique, which clustered in 10 ligand binding sites.

Figure 2 shows the 293 defined binding sites are diverse in size (number of amino acids), solvent accessibility, evolutionary divergence, and missense depletion. Binding site size ranges from 2–40 residues with a median of 9, while median site RSA ranges from 4–80%, with a median of 30%. For evolutionary divergence, the average site NShenkin spread from 0–80, peaking at 40. Lastly, MES spans −0.75 to 1.0, peaking at neutrality (MES 0).

Fig. 2. Variation in binding site features.

Fig. 2

Distribution of a size, b median RSA, c NShenkin and d MES across the 293 binding sites defined from our dataset. Black dashed lines indicate the median of each distribution.

Despite the diversity among sites, some general trends can be observed. Figure 3a shows that larger binding sites tend to be less accessible to solvent r=0.4,p0. Figure 3b illustrates that larger sites are less divergent across homologues r=0.21,p=104 while Fig. 3c presents how larger sites show lower enrichment in neutral missense variants within the human population, i.e., are on average more depleted in missense variants than sites of a smaller size (r=0.15,p=0.008). Correlations between MES and NShenkin, and RSA and NShenkin were not significant, i.e., 95% CI r ⊂ 0.

Fig. 3. Relation between different binding site properties.

Fig. 3

A regression line is fitted to all data points previous to binning, (N = 293 binding sites), Pearson’s correlation coefficient r98, associated p-value and 95% CI of r99. Data points are grouped into bins according to different binding site size intervals, represented by box and swarm plots. a Median site RSA % vs binding site size, in amino acids. b Average NShenkin vs binding site size. c Average site MES vs site size. Boxes represent the IQR, and whiskers extend to 1.5×IQR.

RSA-based binding site clustering

Figure 4a depicts the four clusters defined by our method and the RSA profiles of the sites within them while Fig. 4b illustrates six binding sites from each cluster to highlight the range of binding site size. Cluster 1 includes 46 sites, whereas 127 sites are found on C2, 91 in C3 and 29 in C4. The proportion of residues with an RSA < 25% in Fig. 4a follows a different profile in each cluster, which is confirmed in Fig. 5a. C1 is the most buried with a proportion of residues with RSA < 25% of 0.68, (pRSA<25%0.68), followed by C2 with pRSA<25%0.47, then C3, pRSA<25%0.30, and lastly C4 with pRSA<25%0.10. Figure 5b displays the difference in binding site size between the clusters. There is variation within clusters in site size, but certain patterns are still apparent. C1 includes the largest sites, with an average size of s¯=15 residues, followed by C2 with s¯=11, then C3 with s¯=8, and finally C4 with s¯=5. Figure 5c shows the two-dimensional MDS representation of the binding sites. C1 and C4 are the most distinct amongst the clusters while there is some overlap between clusters. Sites near the cluster borders are those that switch groups depending on the random initialisation of the clustering. To summarise, C1 includes on average the largest, most buried sites, whereas C4 includes the smallest and most accessible. C2 and C3 are not as different as C1 and C4, but still differ in size and burial proportion with C2 including larger and overall, less accessible sites than C3.

Fig. 4. RSA-based binding site clusters and examples.

Fig. 4

a RSA profiles of the 293 binding sites that were grouped in four, C1-C4, clusters by K-means based on the difference between their RSA profiles (UD). Each binding site is represented by a vector, plotted as a bar here. The elements of the vector represent the residues that form the binding site and are sorted according to their RSA, so buried residues are at the beginning of the vector (bottom), and more accessible residues towards the end (top). Each element of the vector, or section of the bar, is coloured according to RSA, using the matplotlib cividis colour palette. Within each cluster, binding sites are sorted based on the number of amino acids. Over each cluster, a line is drawn at RSA = 25%. b Six examples of binding sites are shown in structure for each cluster. Examples were selected to represent the range of binding site sizes within each cluster. IDs are UniProt accession codes. Binding site residues are coloured according to their RSA, using the cividis colour scheme. The rest of the protein is coloured in white. Ligands binding to the site in question are coloured in red.

Fig. 5. Binding site cluster features.

Fig. 5

a Box plot of the proportion of residues with RSA < 25% per binding site across the four clusters defined by K-means clustering. b Box plot of the binding site size, in amino acids, across clusters. Pairwise Mann–Whitney–Wilcoxon tests were performed to assess the differences between the clusters. Boxes represent the IQR, and whiskers extend to 1.5×IQR. p-value annotation legend: ns:p>0.05, *:0.01<p0.05, **:102<p103, ***:104<p103, ****:p104. c MDS representation of the 293 binding sites on 2 dimensions. Data points represent binding sites and are coloured based on the cluster they group in. d Histogram of RSA % of the residues found within the ligand binding sites in each cluster. e Histogram of NShenkin within cluster residues. f MES histogram plots for the four clusters defined.

These results support UD as a metric that effectively quantifies the difference between the solvent exposure and size properties of different binding sites with the four clusters encapsulating differences in RSA and binding site size. This effect might be explained by the negative correlation between solvent accessibility and binding site size shown in Fig. 3a.

Figure 5d shows the RSA distribution of all residues forming the binding sites within the defined clusters. This definition agrees, as expected, with Figs. 4 and 5a. C1 presents a distribution clearly different to the rest of clusters, peaking at RSA ≈ 5%, indicating a high density of buried residues. C2 still presents an excess of buried residues relative to clusters 3–4, though not as high as C1. C4 presents the most different distribution to C1, peaking around RSA ≈ 50–70%.

To further characterise the defined clusters, the distributions of the normalised Shenkin divergence score (NShenkin) and Missense Enrichment Score (MES) of the residues found in the clusters were analysed (Fig. 5e, f). Regarding evolutionary divergence (Fig. 5e), C1 also presents a different distribution to the rest of the clusters, with a peak at NShenkin≈ 5, i.e., most of the residues conforming the sites within this cluster are highly conserved. The other clusters present flatter distributions with increasing proportion of divergent residues (NShenkin > 25) pC2 = 0.55, pC3 = 0.67, and pC4 = 0.69. NShenkin is a divergence score ranging from 0 to 100, therefore residues with NShenkin < 25, pC1 = 0.58, pC2 = 0.45, pC3 = 0.33, and pC4 = 0.31, represent stronger residue conservation, or lower divergence, than NShenkin > 25. This agrees with the pattern observed on the RSA distributions (Fig. 5d), as buried residues tend to be evolutionarily conserved17,18. In terms of missense-depletion (Fig. 5f), the distribution of C1 is slightly shifted to the left, towards more negative values, i.e., more missense-depleted residues, with MES¯C1=0.17. The distributions of C2-4 are not statistically different, but present increasing average missense enrichment scores: MES¯C2=0.07, MES¯C3=0.02, and MES¯C4=+0.06. Once again, this pattern agrees with the ones observed with site size, solvent accessibility, and evolutionary divergence. Sites that are more buried tend to be bigger in size, more conserved across homologues, as well as depleted in missense variation in human.

Clusters predict differential functional enrichment

A key goal of this work is to identify which sites from a fragment screening experiment are most likely to be functional and so worth investigating further. Figure 6 shows the relative enrichment in functional sites across the four defined clusters. C1 is the most enriched in functional sites, with 17/46 sites being classed as of known function, (OR=4.46,p0). C2 was next with 21/127 (OR=1.15,p=0.75). C3 with 6/91 is depleted relative to the other clusters, (OR=0.33,p=0.01), and finally C4 with 0/29, (OR=0.16,p=0.04). RSA-based defined clusters are differentially enriched in functional sites. Based on their enrichment, a binding site found in C1 is 4, 14, and 28-fold more likely to be functional than a site in C2, C3, and C4, respectively.

Fig. 6. Binding site cluster enrichment in known functional sites.

Fig. 6

This enrichment score is an odds ratio (OR). Error bars indicate 95% CI of the OR. Y axis is in log10 scale. A pseudo-count of 1 was added to each cell of the contingency table to calculate the score.

Functional definitions in UniProt tend to lag behind the literature. A literature search found support for 12 sites in C1 that are without UniProt annotations with two examples discussed below. We found no literature support for the remaining seventeen sites in C1 suggesting they may be novel, functionally important sites. Supplementary Table 2 shows the full list of C1 sites that are predicted to be functionally important with 2/17 examples discussed below.

Example C1 site functional predictions supported by literature but not annotated in UniProt

NS3 protein from Zika virusQ32ZE1

The Zika virus (ZIKV) genome polyprotein (Q32ZE1) is 3419 amino acids long and codes for three structural proteins: capsid (C), envelope (E), and membrane (M) as well as seven non-structural proteins: (NS1, NS2A, NS2B, NS3, NS4A, NS4B, and NS5). NS3 is a critical serine proteinase for viral polyprotein processing and genomic regulation. It includes a protease domain at the N-terminus, and a helicase domain on the C-terminus. The helicase is responsible for RNA unwinding during replication, and thus makes an interesting drug target against ZIKV19.

There are 10 sites in NS3 identified from 17 structures with 17 unique ligands and all are functionally unannotated in UniProt. The analysis here shows binding site 7 (BS7) to lie in Cluster 1 and so is most likely to be functional.

The site is located between domains I–III, involving residues from η2, α3 on domain I, and α10, α11 on domain III as defined in Tian et al.20 (Fig. 7a). Mottin et al.21 predicted four RNA binding sites on NS3. One of them, the RNA exit crevice is located between domains I–III, and involves α3, α10 residues. Raubenolt et al.22 probed four different allosteric sites on this protein. One of them, D3, was manually curated and included α11, α12, and overlapped with BS7. Later, Durgam and Guruprasad23 stated that four of the ten residues forming this site: Ala264, Thr265, Lys537 and Asp540 bind to RNA when this is in complex with NS3. These results strongly suggest that this region plays an important role in RNA binding to NS3 and so is a site to target to modulate function. Moreover, the site is on average missense-depleted: MES = −0.28. A264 (NShenkin = 18, MES = −0.79), T267 (NShenkin = 53, MES = −0.55), and S293 (NShenkin = 72, MES = −0.48) are the three key positions out of the 10 forming this binding site, as they are all constrained within the human orthologues of this protein. A264 is conserved across homologues, whereas T267 and S293 are divergent while missense-depleted so could be important for binding specificity.

Fig. 7. Examples of C1 sites of interest.

Fig. 7

a Non-structural protein NS3 of Zika virus (Q32ZE1) binding to N-(2-methoxy-5-methylphenyl)glycinamide, NY7 in BS7 (PDB: 5RHG) (Godoy AS, Mesquita NCMR, Oliva G). Domains I, II, and III are coloured in pink, blue, and green, respectively. Binding site 7 which is in Cluster 1 is highlighted, the other 9 binding sites which fall in C2 (3), C3 (3) and C4 (3) are hidden. Ligand binding residues in red, and NY7 in yellow. Protein–ligand interactions are represented by black lines. b Non-structural protein NSP13 of SARS-CoV-2 (P0DTD1) binding to three ligands in BS6 + 16 (Ribbon PDB: 5RMH)31. 1A, 1B, 2A, stalk and zinc domains are coloured in yellow, pink, green, brown, and grey, respectively. Ligand binding residues in red, and ligands in yellow. Interactions are not shown here for simplicity. c Human tenascin, TN, (P24821) binding to 8 ligands in BS0. (Ribbon PDB: 5R60) (Coker JA, Bezerra GA, von Delft F, Arrowsmith CH, Bountra C, Edwards AM, Yue WW, Marsden BD). A, B, and P subdomains as defined by Yee et al.100 are coloured in blue, grey, and green, respectively. d Human erythroid-specific mitochondrial 5-aminolevulinate synthase, ALAS-E, (P22557) binding to 7 ligands in BS1. (Ribbon PDB: 5QR0) (Bezerra GA, Foster W, Bailey H, Shrestha L, Krojer T, Talon R, Brandao-Neto J, Douangamath A, Nicola BB, von Delft F, Arrowsmith CH, Edwards A, Bountra C, Brennan PE, Yue WW). Subunits A, B, C-terminal extensions A, B, as well as PLP cofactors are coloured in grey, beige, green, orange, and purple, respectively. Ligand binding residues in red, and ligands in yellow.

NSP13 protein from SARS-CoV-2P0DTD1

The Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) replicase polyprotein 1ab (P0DTD1) is 7096 amino acids long and codes for 16 non-structural proteins24. NSP13 is a helicase that unwinds dsRNA in the 5’−3’ direction to provide a single-stranded template for viral RNA amplification25. NSP13 also has NTPase activity, which provides the energy for the RNA unwinding26. NSP13 plays a fundamental role in the replication and transcription of the SARS-CoV-2 genome and is thought to be a good drug target against SARS-CoV-2 virus infection27. NSP13 has five domains. Two “RecA like” subdomains 1A and 2A, in charge of nucleotide binding and hydrolysis, as well as three other domains: an N-terminal zinc-binding domain, the helical “stalk” domain, and a beta-barrel 1B domain28. It is the most conserved protein across coronaviruses, with sequence identity >99%29.

Twenty-four sites are defined on the surface of NSP13. Our method identifies two binding sites: BS6, and BS16 as C1 (Fig. 7b). Visual inspection shows the two sites to be adjacent with a total of 16 residues. Three fragments bind to the site, which is located in the nucleotide and RNA binding interface of NSP13 between the 1B and 2 A domains. This is the region where the 5’ end of the RNA binds30. This pocket is determined to be highly druggable, and drugs binding to it might be effective against other coronaviruses, due to the pocket’s high amino acid conservation31. This agrees with our results, as this site has an average NShenkin = 32, and MES = −0.18. Of the 16 positions in this site, four show high conservation across homologues and missense depletion in human: P514 (NShenkin = 30, MES = −0.56), D534 (NShenkin = 9, MES = −0.56), T552 (NShenkin = 48, MES = −1.87), and H554 (NShenkin = 36, MES = −0.85). T552 shows highest conservation across species and lowest missense enrichment (−1.87) and so is most likely to have a key function in this protein family.

Examples of potentially novel C1 cluster functional predictions

Human tenascin (TN)—P24821

Human tenascin, is a hexameric extracellular matrix glycoprotein implicated in a variety of functions, including cell migration, cell attachment, matrix assembly and proinflammatory cytokine synthesis32. TN is known to interact with viruses and play a role in viral infections, e.g., HIV-1, and has been reported as a biomarker for disease severity33. It also plays a key role in wound healing34, and is involved in diverse cardiovascular diseases35, as well as in breast cancer36. For these reasons, there is considerable effort put into understanding better the function of TN and targeting it for therapeutic effect.

The data includes 11 structures with 11 unique ligands binding to TN, grouped in four binding sites, none annotated as functional in UniProt. One of the four binding sites is in C1, and so predicted to be of functional importance. The site is found on the Fibrinogen C-terminal domain of the protein, which functions as a molecular recognition unit that interacts with either proteins or carbohydrates (Fig. 7c). This site shows high conservation across species (NShenkin = 15), and is missense-depleted in human (MES = −0.33). Accordingly, we suggest that this site in TN is likely to be of key importance to function. Within the 15 positions forming the site, V2012 (NShenkin = 5, MES = −1.0), G2046 (NShenkin = 0, MES = −0.67), F2047 (NShenkin = 0, MES = −0.67), W2055 (NShenkin = 0, MES = −0.54), and G2057 (NShenkin = 0, MES = −0.83), are the most critical interacting residues, and extremely conserved across homologues.

5-aminolevulinate synthase (ALAS-E)—P22557

ALAS2 is a gene located on the X chromosome that codes for the human mitochondrial erythroid-specific 5-aminolevulinate synthase. This dimeric enzyme carries out the first and rate-limiting step of the haem synthesis pathway: the pyridoxal 5’-phosphate (PLP)-dependent condensation of succinyl-CoA and glycine to form aminolaevulinic acid37. Across eukaryotes, these enzymes have developed extensions surrounding the catalytic core on both the N and C-termini38. The N-terminal extensions include the mitochondrial targeting sequence39, whereas the C-terminal extension (C-ext) plays an autoinhibitory role by regulating substrate binding and product release40. Mutations affecting C-ext can result in gain-of-function, such as X-linked protoporphyria41, as well as loss-of-function disorders, e.g., X-linked sideroblastic anaemia42. Accordingly, ALAS-E is a potential therapeutic target for the treatment of such diseases.

We considered 25 structures with 33 unique ligands binding to ALAS-E, grouped in ten binding sites, only one of which is annotated as functional in UniProt. We classify three sites as C1. Two are known to be on the interface between subunits, form key interactions to maintain the assembly and are close to the PLP binding site40. However, one (BS1) is not mentioned in the literature. This site is located on a deep pocket at the N-terminal region of the protein structure (Fig. 7d). Amino acids in this site are strongly conserved as well as depleted in missense variation: NShenkin = 29, MES = −0.13. Together, this suggests the site has a functional role in the protein, perhaps as an allosteric regulator, or through interaction with a partner such as succinate-CoA ligase, SCS-α43. Out of the 16 residues forming the site, K381 (NShenkin = 38, MES = −0.94) is the most missense-depleted position in the site and should be considered for lead optimisation of a fragment binding to this site.

Discussion

In this paper, we have presented a method to identify binding sites from fragment screening data and group the sites into four robust clusters by an RSA profile metric. 29/46 sites in Cluster 1 have functional support from the literature (UniProt 17/46; Our search 12/46 — Supplementary Table 2). 17 further sites have similar profiles, but we could not find evidence in the literature of functional significance. We show two examples from this set that have compelling support from conservation and missense-depletion scores for functional significance and we list all sites in Supplementary Table 2 as a resource for further experimentation on these proteins.

As a case study, we applied the method to the SARS-CoV-2 main protease, MPro (P0DTD1). Twenty-five sites were defined from 511 structures, from which 8 were classed as C1, 12 as C2, 3 as C3 and only 2 as C4. Of the 8 C1 sites, one corresponds to the active site and three to allosteric sites 1, 2, and 3, as defined by DasGupta et al.44, respectively. A further C1 site is at the dimer interface and known to be a potential allosteric site45 (see Supplementary Fig. 4). The remaining three C1 sites may be important, but each binds only a single ligand and their function is currently unclear.

Here, we have focused on a small set of proteins heavily studied by fragment screening methods. However, our method can be applied to classify any ligand binding site or predicted site. Accordingly, future work will seek to classify all known ligand binding sites in the PDB and provide tools to predict likely functional sites predicted sites by tools such as P2Rank5, or GRaSP46 from Alphafold247,48 or other models.

It is natural to focus on sites that are most likely to be of functional significance and so possible targets to modulate function. However, binding sites identified here that are predicted to be least likely to have function may also be interesting as good locations for tagging proteins for degradation49, phosphorylation50, dephosphorylation51, or other modulation52,53.

Methods

Structure dataset

The Pan-Dataset Density Analysis (PanDDA) algorithm characterises a set of related crystallographic datasets of the same crystal form and identifies binding events by isomorphous difference maps54. Initially, 3021 three-dimensional structures determined by X-ray crystallography were selected by querying the PDBe55 for entries containing the string “PanDDA” in their title. In total, 1542 of the structures included bound ligands for 39 different proteins. Four proteins which were in multi-protein complexes including additional ligands were excluded to leave protein–ligand complexes coming from 35 different proteins and a total of 1450 three-dimensional structures. The structures presented resolutions from 0.9 to 3.3 Å, with a mean resolution of ≈1.5 Å. The preferred biological assemblies, as defined by PISA56, were downloaded from the PDBe via ProIntVar57.

Binding site definition

Ligand binding site definition or prediction approaches are usually based on the spatial superposition and clustering of the atomic coordinates of ligands according to Euclidean distances or RMSD1114. These methods rely on structural superposition but can be computationally expensive when dealing with large numbers of structures. Here, we define sites from protein–ligand interactions without the need for superposition (Fig. 8). Only non-ion ligands of interest were used for the binding site definition. These do not include water molecules, nor other by-products of the experimental conditions. Ligand contacts were determined with Arpeggio58. For a given ligand, a binding fingerprint is defined as the UniProt residue numbers the ligand interacts with. For a pair of ligands LA and LB, with their interaction fingerprints A and B, their relative intersection, Irel, is defined (Eq. 1) by dividing the intersection of sets A and B by the maximum possible intersection between the two sets, given by the minimum fingerprint length (Eq. 2). Irel ranges from 0–1.

Irel=ABABmax 1
ABmax=minlenA,lenB 2

Fig. 8. Ligand binding site definition algorithm.

Fig. 8

The method defines ligand binding sites from a set of three-dimensional structures portraying the complex of a protein of interest bound to ligands. a Protein–ligand complex (P18031). b Ligand binding fingerprint, comprised by protein residue numbers interacting with ligand. c Formula of the similarity metric: relative intersection, Irel. d Hierarchical clustering tree resulting from the similarity matrix, cut at threshold to determine distinct clusters of ligands. e Three-dimensional structure of all ligands binding to protein, coloured according to the cluster they group into. Only ligands found clusters 1–7 are in coloured based on their cluster membership. The rest are coloured in grey. The tree on (d) represents only a part of the tree, showing 7/18 binding sites defined on P18031. This is represented by a dash line pointing downwards on the tree.

Irel is thus a similarity metric that can be used to perform hierarchical clustering on the ligands. Single-linkage hierarchical clustering was performed with the OC software59. After exploring several threshold Irel values to cut the resulting tree, we settled on Irel = 0.66. Since this is a similarity metric, it means that a ligand shares at least two thirds of its binding residues with at least one other member of the same cluster. A total of 293 ligand binding sites across 37 protein domains were defined this way. For each protein, all structures were multiply aligned by STAMP60. Ligand binding sites were visualised in UCSF Chimera61.

Multiple sequence alignments

Two of the 35 proteins included fragment screening experiments targeting multiple domains, or protein products, resulting in 39 protein-fragments sets. A representative sequence was selected for each of the 39 sets of structures, and used to search SwissProt62 for homologues with jackHMMER63 with default parameters and 5 iterations to generate multiple sequence alignments. Evolutionary divergence within the alignments was quantified with the Shenkin divergence score, VShenkin64, and the normalised NShenkin, as defined in Utgés et al.65.

Human variants and enrichment

VarAlign57 was used to retrieve genetic variants from gnomAD v2.166 found in the human sequences within the multiple sequence alignment generated for each target protein. gnomAD contains exomes and genomes of 141,456 unrelated individuals with no known phenotypic conditions and is therefore a reasonable representation of the general healthy population. Variants found in the human sequences within the alignments were mapped to individual alignment columns and missense enrichment scores (MES) were calculated. MES represents the enrichment in missense variants of an alignment column relative to the average of the other columns in the alignment15. 95% confidence intervals (CI) and p-values were used to assess the significance of these ratios67. MES was also calculated for the defined ligand binding sites. The MES of a binding site represents the enrichment in missense variants of a binding site relative to the rest of protein residues. Alignment columns as well as binding sites were classified as enriched (MES > 0), depleted (MES < 0) or neutral (MES = 0). Enrichment was not calculated for two of the 39 proteins since no human homologues were identified.

Binding site clustering

Secondary structures were defined with DSSP68 via ProIntVar, and relative solvent accessibility (RSA) was calculated with the method of Tien et al.69. The defined binding sites were grouped according to the pattern of RSA as follows and summarised in Fig. 9.

Fig. 9. Binding site clustering algorithm.

Fig. 9

The method here clusters ligand binding sites defined across different proteins based on their solvent accessibility profiles. a Example of a defined ligand binding site. b Relative solvent accessibility profile of a binding site, represented by the RSA of the site residues. c Formula of our distance metric: distance U, UD. d Multidimensional scaling (MDS) representation of binding sites coloured according to the four clusters determined by the K-means algorithm. Dashed lines represent the cluster limits.

Given two binding sites, A and B, with RSA profiles rA and rB and sizes nA and nB, respectively, in amino acid residues, UA and UB can be calculated (Eq. 3). The Mann–Whitney U statistic70, as implemented in SciPy71, was chosen as it has a maximum theoretical value (Umax) (Eq. 4). A relative U value, Urel, ranging 0-1 is obtained by dividing the U value by Umax. The more similar rA and rB are, the bigger U and Urel are. Thus, Urel is a similarity score. Subtracting Urel from 1 gives the U distance, UD, (Eq. 5). UD is indicative of how different rA and rB are and can be used to cluster binding sites according to their RSA profiles.

UA=RAnAnA+12,UB=RBnBnB+12 3
UA+UB=nAnB,U=minUA,UBUmax=nAnB2 4
Urel=UUmaxUD=1-Urel 5

After calculating pairwise distances between the RSA profiles of the defined binding sites, K-means clustering72 was performed. Several clustering algorithms were tried to realise this task, including some hierarchical, or connectivity-based, such as single and complete-linkage73, unweighted average linkage clustering (UPGMA)74, or Ward linkage75, as well as centroid-based, such as K-means. Overall, the clusters obtained by the different methods were similar. Ward linkage and K-means resulted in the most similar clusters, displaying an average similarity between clusters of 85% (see Supplementary Fig. 1). Finally, multidimensional scaling (MDS)76 with N = 2 dimensions was performed to visualise the clusters. We settled on K-means, as it presented better contained clusters, i.e., less overlapping between members of distinct clusters. The silhouette77, elbow78, as well as Calinski–Harabasz index (CHI)79 and Davies-Bouldin index (DBI)80 methods were used for finding optimal K (see Supplementary Fig. 2), in conjunction with the MDS, trees resulting from hierarchical clustering algorithms, and the visual representation of the RSA profiles, to decide on a final number of K = 4 clusters: C1, C2, C3, and C4. Clustering was repeated 1000 times with different random states and 289/293 (98.6%) sites were always present in the same cluster, thus suggesting the clusters are robust.

Binding site cluster prediction

Two different predictive models were developed with the aim of classifying binding sites into the defined RSA-based clusters obtained with K-means, as described above. The first uses the K-nearest neighbour (KNN) algorithm as implemented in Scikit-learn81, with K = 3. The input for this KNN model is the rows of the UD matrix, containing the distances between pairs of binding site RSA profiles.

The second model is a multilayer perceptron (MLP)82, a type of artificial neural network (ANN) constructed with Keras83 with a single hidden fully connected layer between the input layer of 11 neurons, and the output layer of 4 neurons, one for each cluster label. RSA profiles present different lengths depending on the size (number of amino acids) of the binding site. As this input is not suitable for the neural network, binding sites were encoded as an 11-element vector. The first element of the vector encodes the size of the binding site relative to the maximum site size of 40 residues. The other 10 elements represent the proportion of residues forming the binding site with an RSA % within a 10-unit interval: [0, 10), [10, 20), …, and [90, 100]. In developing the method, we explored the hyperparameter space including number of hidden layers, neurons per layer, activation and loss functions, weight initialisers and optimisers (see Supplementary Note 1, Supplementary Fig. 3 and Supplementary Table 1).

The complete dataset (N = 293) was split into a blind test set (1/11 = 27), and a training set (10/11 = 266). Ten repeats of a stratified 10-fold cross-validation were performed to assess the robustness of the ANN and compare it with the KNN model, as well as a baseline of the same models trained on randomly shuffled data and completely random label assignment (p = 0.25). The reliability of the ANN predictions was assessed by means of a confidence score calculated as in Cuff and Barton84, which represents how certain the MLP is of each individual prediction (Eq. 6). The score is based on the difference between the top- and second-class probabilities assigned by the network to each of the classes, p1, and p2, respectively. For example, if the output of the network were p = [0.95, 0.02, 0.03, 0.0]. The probabilities would be sorted, so p1 = 0.95, p2 = 0.03, and a confidence score of 9 would be obtained.

confidencescore=10×p1p2 6

The KNN is based on distances to all training data and so, as expected, consistently gives higher classification accuracy than the ANN model where sites are represented by their binned RSA profile, and are thus completely unaware of other sites, and their distances to them (Fig. 10a). Both methods are significantly better than random. The average cross-validation accuracy across all repeats is of 98%, 90%, 33%, 31%, and 24% for KNN, ANN models, their randomly trained versions, and completely random label assignment, respectively. The baseline accuracy of the randomly trained models is higher than 25% since the dataset is unbalanced, with classes, C1 and C2 overrepresented.

Fig. 10. MLP cross-validation and blind test results.

Fig. 10

a Average accuracy of the 10-repeat 10-fold (N = 100) cross-validation of the KNN, and ANN predictive models compared to a baseline of the same models trained on randomly shuffled data, as well as complete random prediction (p=0.25). The box represents the central 50% of the data, i.e., Q1 — median (Q2) — Q3, also known as interquartile range (IQR). Whiskers extend to 1.5 × IQR, and beyond them are the outliers. b Cross-validation accuracy and proportion of binding sites against cumulative confidence score from the trained ANN. Sites presenting a confidence score greater or equal to 5, the average accuracy is 97%, and the percentage of sites with this score is 75%. Predictions are for the 2660 cross-validation data points, 10 different repeats of 10 distinct splits of 26–27 binding sites each. Accuracy error bars indicate 95% CI of the proportion101. c MDS representation of the 293 binding sites. Training data are coloured according to the average confidence of their prediction in the cross-validation. Test data are coloured according to whether they were correctly predicted or not. Dashed lines indicate the limits of K-means clusters.

Figure 10b shows the confidence of the ANN predictions across the 10 repeats of the 10-fold cross-validation. The overall accuracy is 90%. Those predictions presenting a confidence score greater or equal to 5 present an accuracy of 97% and cover 75% of all predictions. Finally, Fig. 10c shows the same two-dimensional representation of the K-means clusters found on Fig. 5c and demonstrates that those binding sites with lower prediction confidence are mostly located at the borders between clusters. Sites that switch cluster labels depending on the seed are also located in these regions.

Once the model hyperparameters were optimised, 50 models were trained on 10/11 of the data (N = 266) for the ten different seeds used to initialise the models. From a final pool of 500 models, the one presenting the highest validation accuracy and lowest validation loss was chosen, with a validation accuracy of 96%. This model, as well as KNN were used to predict on the blind test set. There is no significant difference in performance of the ANN and KNN models. Accuracies are 26/27 = 0.96, 95% CI = [0.82, 0.99], and 27/27 = 1.0, 95% CI = [0.88, 1.0], for ANN and KNN, respectively. The adjusted Rand index (ARI)85,86, as well as adjusted mutual information (AMI)87,88 were calculated. ARIANN = 0.93, 95% CI = [0.81, 1.0]89, AMIANN = 0.93, 95% CI = [0.82, 1.0]. ARIKNN = 1.0, AMIKNN = 1.0. 95% CI of AMI was calculated by bootstrap resampling (N = 10,000). The three metrics all agree on the high performance of the MLP. Figure 10c illustrates how the binding site, which label was wrongly predicted by the ANN model is located on the limits between adjacent clusters C3 and C4. This result agrees with the K-means clustering reliability, and confidence score analysis of the cross-validation, where the same inter-cluster regions are highlighted due to their lower clustering reliability, and low confidence prediction. This suggests that the core of the clusters is stable, and that the ANN confidence score may be used to identify binding sites that are at the borders of clusters.

Site function classification

Ligand binding sites were divided into two groups known function and unknown function by searching UniProt90 for feature annotations indicative of function, e.g., metal, substrate binding, active site, etc via the UniProt proteins API91. Seventeen out of the 35 proteins presented at least one UniProt annotated residue in one binding site. Manual curation using protein homology within the proteins in the dataset added 9 more functionally annotated proteins. This gave a total of 44 sites from 26 proteins classified as of known function. All other sites were classified as unknown function.

Statistics and reproducibility

All data analysis was carried out primarily with the following Python libraries: NumPy92, Pandas93,94 and SciPy. Keras and Scikit-learn were used for machine learning, with Matplotlib95, and Seaborn96 for plotting. All statistical tests performed are two-tailed, and significance level α = 0.05. Sample sizes and measures of significance are reported and described in the text, figures and legends.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Supplementary information

Peer Review File (5.1MB, pdf)
Reporting Summary (1.6MB, pdf)

Acknowledgements

We thank Drs. Marek Gierliński, and James Abbott for their insightful suggestions. We also thank the IT service of the University of Dundee for their support of the HPC infrastructure this study was carried out on. We thank a referee for suggesting the MPro case study that highlights the utility of our work in that system. This work was supported by grants to G.J.B. from UKRI-Biotechnology and Biological Sciences [BB/J019364/1; BB/R014752/1] and Wellcome Trust [101651/Z/13/Z; 218259/Z/19/Z]. J.S.U. was supported by a BBSRC EASTBIO Ph.D. Studentship [BB/J01446X/1].

Author contributions

G.J.B., J.S.U., S.A.M. and C.M.I. conceived, designed, and developed the research. J.S.U. and C.M.I. analysed the data. J.S.U., C.M.I., and S.A.M. developed the software. J.S.U. and G.J.B. wrote the manuscript. J.S.U. and G.J.B. reviewed and edited the manuscript. G.J.B. secured funding and supervised.

Peer review

Peer review information

Communications Biology thanks Pedro Torres, Debarati DasGupta, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: Dario Ummarino and Gene Chong. A peer review file is available.

Data availability

The main summary result tables resulting from this analysis are available in the following repository: https://github.com/bartongroup/FRAGSYS (10.5281/zenodo.10606595)97.

Code availability

Software developed to carry out this analysis is also found in our GitHub repository: https://github.com/bartongroup/FRAGSYS (10.5281/zenodo.10606595).

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

The online version contains supplementary material available at 10.1038/s42003-024-05970-8.

References

  • 1.Murray CW, Rees DC. The rise of fragment-based drug discovery. Nat. Chem. 2009;1:187–192. doi: 10.1038/nchem.217. [DOI] [PubMed] [Google Scholar]
  • 2.Congreve M, et al. A ‘rule of three’ for fragment-based lead discovery? Drug Discov. Today. 2003;8:876–877. doi: 10.1016/S1359-6446(03)02831-9. [DOI] [PubMed] [Google Scholar]
  • 3.Rees DC, et al. Fragment-based lead discovery. Nat. Rev. Drug Discov. 2004;3:660–672. doi: 10.1038/nrd1467. [DOI] [PubMed] [Google Scholar]
  • 4.Schiebel J, et al. Six biophysical screening methods miss a large proportion of crystallographically discovered fragment hits: a case study. ACS Chem. Biol. 2016;11:1693–1701. doi: 10.1021/acschembio.5b01034. [DOI] [PubMed] [Google Scholar]
  • 5.Krivak R, Hoksza D. P2Rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure. J. Cheminform. 2018;10:39. doi: 10.1186/s13321-018-0285-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Le Guilloux V, Schmidtke P, Tuffery P. Fpocket: an open source platform for ligand pocket detection. BMC Bioinforma. 2009;10:168. doi: 10.1186/1471-2105-10-168. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Lexa KW, Carlson HA. Full protein flexibility is essential for proper hot-spot mapping. J. Am. Chem. Soc. 2011;133:200–202. doi: 10.1021/ja1079332. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Ghanakota P, et al. Large-scale validation of mixed-solvent simulations to assess hotspots at protein-protein interaction interfaces. J. Chem. Inf. Model. 2018;58:784–793. doi: 10.1021/acs.jcim.7b00487. [DOI] [PubMed] [Google Scholar]
  • 9.Alvarez-Garcia D, Barril X. Molecular simulations with solvent competition quantify water displaceability and provide accurate interaction maps of protein binding sites. J. Med. Chem. 2014;57:8530–8539. doi: 10.1021/jm5010418. [DOI] [PubMed] [Google Scholar]
  • 10.Faller CE, et al. Site identification by ligand competitive saturation (SILCS) simulations for fragment-based drug design. Methods Mol. Biol. 2015;1289:75–87. doi: 10.1007/978-1-4939-2486-8_7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Shin JM, Cho DH. PDB-Ligand: a ligand database based on PDB for the automated and customized classification of ligand-binding structures. Nucleic Acids Res. 2005;33:D238–D241. doi: 10.1093/nar/gki059. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Kozakov D, et al. Optimal clustering for detecting near-native conformations in protein docking. Biophys. J. 2005;89:867–875. doi: 10.1529/biophysj.104.058768. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Wass MN, Kelley LA, Sternberg MJ. 3DLigandSite: predicting ligand-binding sites using similar structures. Nucleic Acids Res. 2010;38:W469–W473. doi: 10.1093/nar/gkq406. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.McGreig, J. E. et al. 3DLigandSite: structure-based prediction of protein-ligand binding sites. Nucleic Acids Res.50, W13–W20 (2022). [DOI] [PMC free article] [PubMed]
  • 15.MacGowan, S. A. et al. Human missense variation is constrained by domain structure and highlights functional and pathogenic residues. Preprint at https://www.biorxiv.org/content/biorxiv/early/2017/04/13/127050.full.pdf (2017).
  • 16.MacGowan, S. A. et al. A unified approach to evolutionary conservation and population constraint in protein domains highlights structural features and pathogenic sites. Preprint at https://europepmc.org/article/PPR/PPR691021 (2023). [DOI] [PMC free article] [PubMed]
  • 17.Chothia C, Lesk AM. The relation between the divergence of sequence and structure in proteins. EMBO J. 1986;5:823–826. doi: 10.1002/j.1460-2075.1986.tb04288.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Russell RB, Barton GJ. Structural features can be unconserved in proteins with similar folds. An analysis of side-chain to side-chain contacts secondary structure and accessibility. J. Mol. Biol. 1994;244:332–350. doi: 10.1006/jmbi.1994.1733. [DOI] [PubMed] [Google Scholar]
  • 19.Luo D, Vasudevan SG, Lescar J. The flavivirus NS2B-NS3 protease-helicase as a target for antiviral drug development. Antiviral Res. 2015;118:148–158. doi: 10.1016/j.antiviral.2015.03.014. [DOI] [PubMed] [Google Scholar]
  • 20.Tian H, et al. The crystal structure of Zika virus helicase: basis for antiviral drug design. Protein Cell. 2016;7:450–454. doi: 10.1007/s13238-016-0275-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Mottin M, et al. Molecular dynamics simulations of Zika virus NS3 helicase: insights into RNA binding site activity. Biochem. Biophys. Res. Commun. 2017;492:643–651. doi: 10.1016/j.bbrc.2017.03.070. [DOI] [PubMed] [Google Scholar]
  • 22.Raubenolt BA, Wong K, Rick SW. Molecular dynamics simulations of allosteric motions and competitive inhibition of the Zika virus helicase. J. Mol. Graph. Model. 2021;108:108001. doi: 10.1016/j.jmgm.2021.108001. [DOI] [PubMed] [Google Scholar]
  • 23.Durgam L, Guruprasad L. Molecular mechanism of ATP and RNA binding to Zika virus NS3 helicase and identification of repurposed drugs using molecular dynamics simulations. J. Biomol. Struct. Dyn. 2022;40:12642–12659. doi: 10.1080/07391102.2021.1973909. [DOI] [PubMed] [Google Scholar]
  • 24.Naqvi AAT, et al. Insights into SARS-CoV-2 genome, structure, evolution, pathogenesis and therapies: Structural genomics approach. Biochim. Biophys. Acta Mol. Basis Dis. 2020;1866:165878. doi: 10.1016/j.bbadis.2020.165878. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Yue K, et al. The stalk domain of SARS-CoV-2 NSP13 is essential for its helicase activity. Biochem. Biophys. Res. Commun. 2022;601:129–136. doi: 10.1016/j.bbrc.2022.02.068. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Shu T, et al. SARS-coronavirus-2 Nsp13 possesses NTPase and RNA helicase activities that can be inhibited by bismuth salts. Virol. Sin. 2020;35:321–329. doi: 10.1007/s12250-020-00242-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Zeng JK, et al. Identifying SARS-CoV-2 antiviral compounds by screening for small molecule inhibitors of nsp13 helicase. Biochem. J. 2021;478:2405–2423. doi: 10.1042/BCJ20210201. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Romeo, I. et al. Targeting SARS-CoV-2 nsp13 helicase and assessment of druggability pockets: identification of two potent inhibitors by a multi-site in silico drug repurposing approach. Molecules27, 7522 (2022). [DOI] [PMC free article] [PubMed]
  • 29.Ricci, F. et al. In silico insights towards the identification of SARS-CoV-2 NSP13 helicase druggable pockets. Biomolecules12, 482 (2022). [DOI] [PMC free article] [PubMed]
  • 30.Yan L, et al. Architecture of a SARS-CoV-2 mini replication and transcription complex. Nat. Commun. 2020;11:5874. doi: 10.1038/s41467-020-19770-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Newman JA, et al. Structure, mechanism and crystallographic fragment screening of the SARS-CoV-2 NSP13 helicase. Nat. Commun. 2021;12:4848. doi: 10.1038/s41467-021-25166-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Bhattacharyya S, Midwood KS, Varga J. Tenascin-C in fibrosis in multiple organs: translational implications. Semin. Cell Dev. Biol. 2022;128:130–136. doi: 10.1016/j.semcdb.2022.03.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Zuliani-Alvarez L, Piccinini AM. A virological view of tenascin-C in infection. Am. J. Physiol. Cell Physiol. 2023;324:C1–C9. doi: 10.1152/ajpcell.00333.2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Wang, Y., Wang, G. & Liu, H. Tenascin-C: a key regulator in angiogenesis during wound healing. Biomolecules12, 1689 (2022). [DOI] [PMC free article] [PubMed]
  • 35.Khomtchouk BB, et al. Targeting the cytoskeleton and extracellular matrix in cardiovascular disease drug discovery. Expert Opin. Drug Discov. 2022;17:443–460. doi: 10.1080/17460441.2022.2047645. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Lepucki, A. et al. The role of extracellular matrix proteins in breast cancer. J. Clin. Med.11, 1250 (2022). [DOI] [PMC free article] [PubMed]
  • 37.Akhtar M, et al. Mechanism and stereochemistry of enzymic reactions involved in porphyrin biosynthesis. Philos. Trans. R Soc. Lond. B Biol. Sci. 1976;273:117–136. doi: 10.1098/rstb.1976.0005. [DOI] [PubMed] [Google Scholar]
  • 38.Munakata H, et al. Purification and structure of rat erythroid-specific delta-aminolevulinate synthase. J. Biochem. 1993;114:103–111. doi: 10.1093/oxfordjournals.jbchem.a124123. [DOI] [PubMed] [Google Scholar]
  • 39.Srivastava G, et al. Regulation of 5-aminolevulinate synthase mRNA in different rat tissues. J. Biol. Chem. 1988;263:5202–5209. doi: 10.1016/S0021-9258(18)60700-8. [DOI] [PubMed] [Google Scholar]
  • 40.Bailey HJ, et al. Human aminolevulinate synthase structure reveals a eukaryotic-specific autoinhibitory loop regulating substrate binding and product release. Nat. Commun. 2020;11:2813. doi: 10.1038/s41467-020-16586-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Whatley SD, et al. C-terminal deletions in the ALAS2 gene lead to gain of function and cause X-linked dominant protoporphyria without anemia or iron overload. Am. J. Hum. Genet. 2008;83:408–414. doi: 10.1016/j.ajhg.2008.08.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Ducamp S, et al. Sideroblastic anemia: molecular analysis of the ALAS2 gene in a series of 29 probands and functional studies of 10 missense mutations. Hum. Mutat. 2011;32:590–597. doi: 10.1002/humu.21455. [DOI] [PubMed] [Google Scholar]
  • 43.Furuyama K, Sassa S. Interaction between succinyl CoA synthetase and the heme-biosynthetic enzyme ALAS-E is disrupted in sideroblastic anemia. J. Clin. Investig. 2000;105:757–764. doi: 10.1172/JCI6816. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.DasGupta D, Chan WKB, Carlson HA. Computational identification of possible allosteric sites and modulators of the SARS-CoV-2 main protease. J. Chem. Inf. Model. 2022;62:618–626. doi: 10.1021/acs.jcim.1c01223. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Douangamath A, et al. Crystallographic and electrophilic fragment screening of the SARS-CoV-2 main protease. Nat. Commun. 2020;11:5047. doi: 10.1038/s41467-020-18709-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Santana CA, et al. GRaSP: a graph-based residue neighborhood strategy to predict binding sites. Bioinformatics. 2020;36:i726–i734. doi: 10.1093/bioinformatics/btaa805. [DOI] [PubMed] [Google Scholar]
  • 47.Jumper J, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–589. doi: 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Varadi M, et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 2022;50:D439–D444. doi: 10.1093/nar/gkab1061. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Bekes M, Langley DR, Crews CM. PROTAC targeted protein degraders: the past is prologue. Nat. Rev. Drug Discov. 2022;21:181–200. doi: 10.1038/s41573-021-00371-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Siriwardena SU, et al. Phosphorylation-inducing chimeric small molecules. J. Am. Chem. Soc. 2020;142:14052–14057. doi: 10.1021/jacs.0c05537. [DOI] [PubMed] [Google Scholar]
  • 51.Simpson LM, et al. An affinity-directed phosphatase, AdPhosphatase, system for targeted protein dephosphorylation. Cell Chem. Biol. 2023;30:188–202.e6. doi: 10.1016/j.chembiol.2023.01.003. [DOI] [PubMed] [Google Scholar]
  • 52.Heitel P. Emerging TACnology: heterobifunctional small molecule inducers of targeted posttranslational protein modifications. Molecules. 2023;28:690. doi: 10.3390/molecules28020690. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Peng Y, et al. Targeted protein posttranslational modifications by chemically induced proximity for cancer therapy. J. Biol. Chem. 2023;299:104572. doi: 10.1016/j.jbc.2023.104572. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Pearce NM, et al. A multi-crystal method for extracting obscured crystallographic states from conventionally uninterpretable electron density. Nat. Commun. 2017;8:15123. doi: 10.1038/ncomms15123. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.ww PDBc. Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Res. 2019;47:D520–D528. doi: 10.1093/nar/gky949. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Krissinel E, Henrick K. Inference of macromolecular assemblies from crystalline state. J. Mol. Biol. 2007;372:774–797. doi: 10.1016/j.jmb.2007.05.022. [DOI] [PubMed] [Google Scholar]
  • 57.MacGowan SA, et al. The Dundee resource for sequence analysis and structure prediction. Protein Sci. 2020;29:277–297. doi: 10.1002/pro.3783. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Jubb HC, et al. Arpeggio: a web server for calculating and visualising interatomic interactions in protein structures. J. Mol. Biol. 2017;429:365–371. doi: 10.1016/j.jmb.2016.12.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Barton, G. J. OC - A Cluster Analysis Program (University of Dundee, UK, 1993).
  • 60.Russell RB, Barton GJ. Multiple protein sequence alignment from tertiary structure comparison: assignment of global and residue confidence levels. Proteins. 1992;14:309–323. doi: 10.1002/prot.340140216. [DOI] [PubMed] [Google Scholar]
  • 61.Pettersen EF, et al. UCSF Chimera-a visualization system for exploratory research and analysis. J. Comput. Chem. 2004;25:1605–1612. doi: 10.1002/jcc.20084. [DOI] [PubMed] [Google Scholar]
  • 62.Boutet E, et al. UniProtKB/Swiss-Prot, the manually annotated section of the UniProt KnowledgeBase: how to use the entry view. Methods Mol. Biol. 2016;1374:23–54. doi: 10.1007/978-1-4939-3167-5_2. [DOI] [PubMed] [Google Scholar]
  • 63.Eddy SR. Multiple alignment using hidden Markov models. Proc. Int. Conf. Intell. Syst. Mol. Biol. 1995;3:114–120. [PubMed] [Google Scholar]
  • 64.Shenkin PS, Erman B, Mastrandrea LD. Information-theoretical entropy as a measure of sequence variability. Proteins. 1991;11:297–313. doi: 10.1002/prot.340110408. [DOI] [PubMed] [Google Scholar]
  • 65.Utgés JS, et al. Ankyrin repeats in context with human population variation. PLoS Comput. Biol. 2021;17:e1009335. doi: 10.1371/journal.pcbi.1009335. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Karczewski KJ, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581:434–443. doi: 10.1038/s41586-020-2308-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Szumilas M. Explaining odds ratios. J. Can. Acad. Child Adolesc. Psychiatry. 2010;19:227–229. [PMC free article] [PubMed] [Google Scholar]
  • 68.Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983;22:2577–2637. doi: 10.1002/bip.360221211. [DOI] [PubMed] [Google Scholar]
  • 69.Tien MZ, et al. Maximum allowed solvent accessibilites of residues in proteins. PLoS ONE. 2013;8:e80635. doi: 10.1371/journal.pone.0080635. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Mann HB, Whitney DR. On a test of whether one of two random variables is stochastically larger than the other. Annal. Math. Stat. 1947;18:50–60. doi: 10.1214/aoms/1177730491. [DOI] [Google Scholar]
  • 71.Virtanen P, et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods. 2020;17:261–272. doi: 10.1038/s41592-019-0686-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Lloyd S. Least squares quantization in PCM. IEEE Trans. Inf. Theory. 1982;28:129–137. doi: 10.1109/TIT.1982.1056489. [DOI] [Google Scholar]
  • 73.Sørensen, T. A Method of Establishing Groups of Equal Amplitude in Plant Sociology Based on Similarity of Species and Its Application to Analyses of the Vegetation on Danish Commons. Kongelige Danske Videnskabernes Selskab, 5, 1–34 (1948).
  • 74.Sokal RR, Michener CD. A statistical method for evaluating systematic relationships. University Kansas Sci. Bull. 1958;38:1409–1438. [Google Scholar]
  • 75.Ward JH. Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 1963;58:236–244. doi: 10.1080/01621459.1963.10500845. [DOI] [Google Scholar]
  • 76.Mead A. Review of the development of multidimensional scaling methods. J. Royal Stat. Soc. Ser. D (The Statistician) 1992;41:27–39. [Google Scholar]
  • 77.Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. App. Math. 1987;20:53–65. doi: 10.1016/0377-0427(87)90125-7. [DOI] [Google Scholar]
  • 78.Thorndike RL. Who belongs in the family? Psychometrika. 1953;18:267–276. doi: 10.1007/BF02289263. [DOI] [Google Scholar]
  • 79.Caliński T, Harabasz J. A dendrite method for cluster analysis. Commun. Stat. 1974;3:1–27. [Google Scholar]
  • 80.Davies DL, Bouldin DW. A cluster separation measure. IEEE Trans. Pattern Anal. Machine Intell. 1979;PAMI-1:224–227. doi: 10.1109/TPAMI.1979.4766909. [DOI] [PubMed] [Google Scholar]
  • 81.Pedregosa F, et al. Scikit-learn: machine learning in Python. J. Machine Learn. Res. 2011;12:2825–2830. [Google Scholar]
  • 82.Cybenko G. Approximation by superpositions of a sigmoidal function. Math. Control, Signals Syst. 1989;2:303–314. doi: 10.1007/BF02551274. [DOI] [Google Scholar]
  • 83.Chollet, F. et al. Keras GitHub. Available from: https://github.com/fchollet/keras (2015).
  • 84.Cuff JA, Barton GJ. Application of multiple sequence alignment profiles to improve protein secondary structure prediction. Proteins. 2000;40:502–511. doi: 10.1002/1097-0134(20000815)40:3&#x0003c;502::AID-PROT170&#x0003e;3.0.CO;2-Q. [DOI] [PubMed] [Google Scholar]
  • 85.Rand WM. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 1971;66:846–850. doi: 10.1080/01621459.1971.10482356. [DOI] [Google Scholar]
  • 86.Hubert L, Arabie P. Comparing partitions. J. Classif. 1985;2:193–218. doi: 10.1007/BF01908075. [DOI] [Google Scholar]
  • 87.Vinh, N. X., Epps, J. & Bailey, J. Information theoretic measures for clusterings comparison: is a correction for chance necessary? In Proceedings of the 26th Annual International Conference on Machine Learning. 1073–1080 (Association for Computing Machinery: Montreal, Quebec, Canada, 2009).
  • 88.Vinh N, Epps J, Bailey J. Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J. Machine Learn. Res. 2010;11:2837–2854. [Google Scholar]
  • 89.Steinley D, Brusco MJ, Hubert L. The variance of the adjusted Rand index. Psychol. Methods. 2016;21:261–272. doi: 10.1037/met0000049. [DOI] [PubMed] [Google Scholar]
  • 90.UniProt C. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 2019;47:D506–D515. doi: 10.1093/nar/gky1049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 91.Nightingale A, et al. The Proteins API: accessing key integrated protein and genome information. Nucleic Acids Res. 2017;45:W539–W544. doi: 10.1093/nar/gkx237. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 92.Harris CR, et al. Array programming with NumPy. Nature. 2020;585:357–362. doi: 10.1038/s41586-020-2649-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 93.McKinney, W. Data Structures for statistical computing in Python. In Proceedings of the 9th Python in Science Conference (eds van der Walt, S. & Millman, J.) 56–61 (2010). https://pandas.pydata.org/about/citing.html.
  • 94.Team, T. P. D. pandas-dev/pandas: Pandas. Zenodo Available from: 10.5281/zenodo.3509134 (2020).
  • 95.Hunter JD. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 2007;9:90–95. doi: 10.1109/MCSE.2007.55. [DOI] [Google Scholar]
  • 96.Waskom ML. seaborn: statistical data visualization. J. Open Sour. Softw. 2021;6:3021. doi: 10.21105/joss.03021. [DOI] [Google Scholar]
  • 97.Utgés, J. S. bartongroup/FRAGSYS: second release. Zenodo Available from: 10.5281/zenodo.10606595 (2024).
  • 98.Lee Rodgers J, Nicewander WA. Thirteen ways to look at the correlation coefficient. Am. Stat. 1988;42:59–66. doi: 10.1080/00031305.1988.10475524. [DOI] [Google Scholar]
  • 99.Bowley AL. The standard deviation of the correlation coefficient. J. Am. Stat. Assoc. 1928;23:31–34. doi: 10.1080/01621459.1928.10502991. [DOI] [Google Scholar]
  • 100.Yee VC, et al. Crystal structure of a 30 kDa C-terminal fragment from the gamma chain of human fibrinogen. Structure. 1997;5:125–138. doi: 10.1016/S0969-2126(97)00171-8. [DOI] [PubMed] [Google Scholar]
  • 101.Wilson EB. Probable inference, the law of succession, and statistical inference. J. Am. Stat. Assoc. 1927;22:209–212. doi: 10.1080/01621459.1927.10502953. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Peer Review File (5.1MB, pdf)
Reporting Summary (1.6MB, pdf)

Data Availability Statement

The main summary result tables resulting from this analysis are available in the following repository: https://github.com/bartongroup/FRAGSYS (10.5281/zenodo.10606595)97.

Software developed to carry out this analysis is also found in our GitHub repository: https://github.com/bartongroup/FRAGSYS (10.5281/zenodo.10606595).


Articles from Communications Biology are provided here courtesy of Nature Publishing Group

RESOURCES