ABSTRACT
Proteins form arguably the most significant link between genotype and phenotype. Understanding the relationship between protein sequence and structure, and applying this knowledge to predict function, is difficult. One way to investigate these relationships is by considering the space of protein folds and how one might move from fold to fold through similarity, or potential evolutionary relationships. The many individual characterisations of fold space presented in the literature can tell us a lot about how well the current Protein Data Bank represents protein fold space, how convergence and divergence may affect protein evolution, how proteins affect the whole of which they are part, and how proteins themselves function. A synthesis of these different approaches and viewpoints seems the most likely way to further our knowledge of protein structure evolution and thus, facilitate improved protein structure design and prediction.
Keywords: protein folds, protein fold switches, protein similarity networks, protein structure networks, protein evolution
I. INTRODUCTION
(1). Background
Many attempts have been made to map how different proteins are evolutionarily and structurally related; that is, to visualise protein space (Maynard‐Smith, 1970; Skolnick et al., 2009; Holm & Sander, 1996a , b ; Dokholyan, Shakhnovich & Shakhnovich, 2002; Edwards & Deane, 2015; Cui et al., 2016; Valavanis, Spyrou & Nikita, 2010; Çamoĝlu, Can & Singh, 2006; Meyerguz, Kleinberg & Elber, 2007; Cao & Elber, 2010; Hoffmann, Wrabl & Hilser, 2016; Grahnen et al., 2011; Chi et al., 2018; Kleinman et al., 2010). They can be roughly divided into three types of networks: those based on similarity of sequence or structure (Skolnick et al., 2009; Holm & Sander, 1996a , b ; Dokholyan et al., 2002; Edwards & Deane, 2015), those predicting the tolerance of mutations based on known proteins (Meyerguz et al., 2007; Cao & Elber, 2010; Hoffmann et al., 2016) and those based on physics principles (Grahnen et al., 2011; Chi et al., 2018). These types of networks will be referred to as similarity, informational, and physics‐based (respectively) in this review.
These visualisations are required to understand protein structure as differences in protein sequence and structure are correlated, but not as closely as might be expected (Bryan & Orban, 2010; Roessler et al., 2008; Alexander et al., 2007). That is, until recently, it has been difficult to predict protein structure from sequence alone. Small changes in sequence can often produce dramatic changes in structure and alteration or loss of function, and the consequences of these remain difficult to predict. Conversely, proteins with very low sequence identity can perform the same function and exist in a similar conformation. This makes it difficult to define and classify protein structures confidently and to determine mutational pathways that connect folds.
The complex nature of protein structure is further demonstrated by the ambiguity around what constitutes a protein fold, and how protein structures can be hierarchically organised. According to the two main hierarchical classifications of the Protein Data Bank (PDB), there are around 12 to 1400 different folds (Fox, Brenner & Chandonia, 2014; Csaba, Birzele & Zimmer, 2009). A protein fold can generally be defined as a structure shared by a number of protein sequences that do not necessarily share an evolutionary history. The degree to which different protein domains must be similar to be considered the same fold varies depending on the hierarchy of structural organisation used. The main organisational hierarchies for protein structures are Structural Classification of Proteins [SCOP/SCOP extended (SCOPe)] (Fox et al., 2014) and Class, Architecture, Topology and Homologous Superfamily (CATH) (Knudsen & Wiuf, 2010). Both involve four layers of increasing structural similarity – class, fold, superfamily, and family in SCOP/SCOPe and class, architecture, topology and homologous superfamily in CATH. However, when treated as labelled trees, their structural groupings only agree in about 70% of cases (Csaba et al., 2009). Many other structural hierarchies exist, with a notable example being ECOD (Evolutionary Classification of Protein Domains) which, as its name suggests, focuses on evolutionary relationships (Cheng et al., 2014). This has five levels of similarity: architecture, possible homology, homology, topology and family.
If we are to construct a model of protein space, it would be helpful to know the extent to which our current library of proteins is representative of this space. Some small fraction of the approximately 20 n possible amino acid sequences code for viable proteins (where n is the length of a protein). It may not be necessary to identify all ‘foldable sequences’ if we can instead identify the set of folds it is possible for them to take (Zhang et al., 2006; Cossio et al., 2010; Taylor et al., 2009; Dai & Zhou, 2011; Skolnick, Zhou & Brylinski, 2012). First, using some calculation of protein structural similarity, proteins can be clustered into those of the same fold. Then, a database of pseudoproteins can be constructed in silico to compare to these real proteins. If there are equivalent folds in the existing protein group for each pseudoprotein, we can say with some confidence that it is likely our understanding of protein fold space is complete. This technique has been used a number of times with different methods of pseudoprotein construction and similarity assessment, with varied results (Zhang et al., 2006; Cossio et al., 2010; Taylor et al., 2009; Dai & Zhou, 2011; Skolnick et al., 2012).
Protein structural change on a large scale follows a number of patterns. Convergence in protein structure evolution occurs when multiple protein sequences from different origins evolve to the same fold, potentially due to its robustness or fitness. Divergence is the opposite of this; the evolutionary direction is primarily away from an ancestral protein fold. Past attempts to map protein space have provided evidence for the influence of both of these patterns on protein evolution (Holm & Sander, 1996a ; Meyerguz et al., 2007; Cao & Elber, 2010; Dokholyan et al., 2002; Meunier & Broz, 2017; Kaur et al., 2018).
Additionally, some maps of protein space have considered proteins as elements of a whole, or at least contributors to the same enzyme pathway or system (Goldstein, 2011, 2013; Zeldovich & Shakhnovich, 2008; Orlenko et al., 2016). Nodes still represent proteins, but restrictions are enforced to ensure their functionality as part of a pathway or organism. This area of investigation is in its infancy but some observations are still possible.
This review describes the ways that protein structure and sequence are correlated (as well as exceptions to these rules), then explores the different models available for protein structural space along with the challenges they face. We discuss the ways these models have informed potential answers to important questions concerning protein evolution and proteins, such as: is our understanding of protein space complete, what patterns can be used to describe the evolution of proteins, how does protein evolution affect pathways and organisms, and how does the evolution of proteins affect their functionality? Finally, a best‐of‐all‐worlds approach to planning an exhaustive model of protein structure evolution is proposed, considering perspectives and ideas from as many different sources as possible.
(2). Correlating protein sequence, structure and function
Differences in protein structure and sequence have been shown to be correlated, for the most part, in a non‐linear manner. Chothia & Lesk (1986) provided evidence for this, using root mean square deviation (RMSD) as a measure of structural divergence and percentage residue identity for sequence differences in homologous proteins. An exponential model fitted the data well within the range examined, where two pairs of proteins with high percentage residue identity have small differences in structural similarity, but two pairs with low percentage residue identity have large differences in structural similarity. For example, through this model, a pair of proteins with 80% residue identity may only have an RMSD 0.2 A greater than a pair with 90% residue identity, but a pair of proteins with 20% residue identity may have an RMSD 0.6 A greater than a pair with 30% residue identity. This was an expansion of earlier, more specific work concerning sequence–structure relationships in globins (Lesk & Chothia, 1980) and immunoglobins (Lesk & Chothia, 1982).
Rahman & Rackovsky (1995) successfully correlated distance matrices representing protein sequences and structure and found secondary structure element propensity and side chain bulk to be important in this correlation. Wilson, Kreychman & Gerstein (2000) showed a similar relationship with a significantly broader database of proteins.
As our knowledge of protein structure and sequence grows, so do the complexities that must be considered when investigating them. Exceptions to the exponential correlation rule have been explored for decades (Rost, 1999). They are particularly numerous in the so‐called ‘Twilight Zone’ of protein sequence alignment, between percentage sequence identity of 20 and 35%. Sequence–structure relationships in the twilight zone become more predictable when considering intermediate sequences, the length of alignments and emphasising identity over similarity (Rost, 1999). More specific exceptions have been found since (Bryan & Orban, 2010; Roessler et al., 2008; Alexander et al., 2007), leading us to question the universal applicability of this rule. For example, proteins have been found that switch folds (Bryan & Orban, 2010), proteins have been designed with high similarity in sequence but different structure and function, and proteins with completely different sequences have been found to have similar structures (Roessler et al., 2008; Alexander et al., 2007).
Protein fold switches are a well‐documented phenomenon in nature (Bryan & Orban, 2010). For example, enzyme function is reliant on structural change, from small‐scale active site flexibility to complete conformational change (Tokuriki & Tawfik, 2009). Lymphotactin, mitotic arrest deficiency 2 protein and chloride intracellular panel protein all switch folds to perform their functions (Bryan & Orban, 2010). Lymphotactin binds two different targets in the human body depending on whether it is in its monomeric chemokine fold, or dimeric β‐sandwich fold, the equilibrium between which is dependent on salt and temperature levels (Fig. 1). The function of the mitotic arrest deficiency 2 protein (monitoring mitosis to ensure sufficient microtubules are available for the process to occur) depends on its ability to switch folds to reveal an active site through catalysis. The N domain of the chloride intracellular protein switches from an α‐β‐α group to a group of three α‐helices under oxidative conditions. Porter & Looger (2018) identified almost 100 extant fold‐switching proteins in the PDB (including those described above) and used their shared features to predict that there are many more to be found. Prions are other examples of fold‐switching proteins; their pathology is linked to a transition from a soluble form to an aggregated form.
Fig. 1.

Lymphotactin exists in its monomeric (left) and dimeric (right) forms in approximately equal amounts in vivo (Bryan & Orban, 2010). In the transition from monomer to dimer, the C‐terminus helix becomes unstructured and the N‐terminus strands form new hydrogen bonds. An unstructured region at the N‐terminus also forms a new strand. The equilibrium between these two forms shifts with changes in salt and temperature levels.
Intrinsically disordered proteins represent naturally occurring conformationally diverse molecules (Tokuriki & Tawfik, 2009). As their name suggests, these are proteins for whom disorder is inherent to their structure, and often, function. This property may be present along the entire chain, or merely in small parts of the protein.
Further exceptions to the correlation between sequence and structure can be found in the form of proteins with high residue identity and different structure and/or function. Alexander et al. (2007) found that a domain consisting of three α‐helices and one consisting of an α‐helix and a β‐sheet that bound different ligands could be progressively mutated from 16 to 88% similarity whilst retaining unique functional and structural attributes, building on previous work on the same molecules (Alexander et al., 2005; He et al., 2005). This level of identity was later raised to 95% whilst still maintaining separate structures (He et al., 2008). Alexander et al. (2009) presented a paper demonstrating a mutational pathway from one of these proteins to the other, including a bi‐functional protein and a single point mutation that changed the sequence from a strong preference for one fold to the other. Many other examples of fold‐switch mutations exist, including a series of three different substitution mutations that take one of the proteins of interest in this pathway three times back and forth between two folds (Stewart et al., 2013; Anderson, Cordes & Sauer, 2005; He et al., 2012; Porter et al., 2015).
Similarly, Roessler et al. (2008) found a naturally occurring pair of Cro proteins (a family of bacteriophage transcription factors) with 40% sequence identity and completely separate folds. Subsequently, this pair was used to design two hybrid proteins to demonstrate how protein evolution may proceed through fold changes (Kumirov et al., 2018). Each of these hybrid proteins was about 85% sequentially similar and very structurally similar to the protein on which it was based, about 70% sequentially similar to the other hybrid and about 55% sequentially similar to the original protein on which it was not based (see Fig. 2). Progression from one of the original, naturally occurring proteins to the other involves both abrupt global structural change, gradual loss of α‐helices and gain of β‐sheets, and a chameleon section of sequence that could form two different structural patterns depending on the surrounding residues. More generally, Chen, Meller & Elber (2016) presented a way of detecting fold‐switch mutations in silico. They found that this was theoretically most likely in the N or C terminal of the examined protein, which seems likely to be related to the varied contact patterns in these regions.
Fig. 2.

Progression from Cro protein Xfaso 1 to Cro protein Pfl 6 via two hybrids. These two naturally occurring and two designed proteins demonstrate a possible path for a fold switch. Xfaso 1 shares 85% sequence identity (ID) with XPH1, and is structurally very similar. XPH1 is about 70% sequentially similar to XPH2, although the transition between them involves loss of alpha helices and gain of beta sheets. XPH2 is, once again, about 85% sequentially similar to Pfl 6, and structurally very similar. The path via which this fold switch could occur is demonstrated by the bold arrows. The original Cro proteins share about 40% of their sequence, and each hybrid shares 55% of its sequence with the original protein it is not based on.
Two elements of the Cro pathway mentioned above were also found with only 26% sequence identity and identical structure and function (Roessler et al., 2008). Examples of sequences of low similarity with the same fold are, to some extent, expected, due to the relative lack of diversity in protein folds (numbering in the hundreds) compared with protein sequences (astronomical).
Substitution mutations are by no means the only mutations that are relevant to the discussion of protein structure, sequence and function. Residue insertions and deletions often have more dramatic effects than substitutions and are important to consider, although they have been explored in the literature to a lesser extent. Research into these kinds of changes has been focused on demonstrating tolerance rather than observing change. Since a protein fold's tolerance of mutations is representative of its sequence capacity (the number of sequences that can take its shape), this still has important functional implications. One study on the effect of single‐residue deletions in green fluorescent protein found almost half were tolerated, and that some even increased fluorescent function (Arpino et al., 2014). In a similar experiment, out of 32 insertion and deletion events performed upon a lysozyme enzyme (involving up to four residues), only one failed to produce a protein, demonstrating the ability of this protein fold to tolerate extreme changes in sequence (Vetter et al., 1996). A survey of insertions and deletions that occurred between human, mouse and rat protein‐coding genes found that structural cores and transmembrane domains were less tolerant to these mutations than loop regions and other areas of low structural complexity (Taylor, Ponting & Copley, 2004).
There is evidence that perhaps the phenomenon of mutation tolerance (that is, mutations that allow proteins to maintain the same structure and, potentially, function), can be explained by the limitations of the experiments that are conducted (Rockah‐Shmuel, Tóth‐Petróczy & Tawfik, 2015). An experiment performed on a DNA methyltransferase found that with multiple generations and accumulation of mutations (arguably more representative of natural processes), deleterious effects of seemingly neutral mutations are revealed (Rockah‐Shmuel et al., 2015). Proteins in nature may therefore deviate in their tolerance over generations and not necessarily align with theoretical expectations.
In spite of the complex relationship between them, recent developments have revolutionised the potential to predict protein structure from sequence. CASP, an international effort to examine the state of the art of protein structure prediction, has demonstrated various incarnations of AlphaFold to be extremely effective tools (Jumper et al., 2021). In the CASP14 set, AlphaFold's median backbone deviation from solved structures was less than the width of a carbon atom. However, this has an inherent reliance on evolutionarily related protein sequences, so its capacity to predict the effects of new mutations has been shown to be limited (Buel & Walters, 2022). Further developments in this tool may seek to solve this problem, possibly through increased reliance on the physical properties of residues.
Even if the relationships between protein sequence, structure and function were quantifiable and predictable, they would still be extremely complex. This necessitates considering the protein universe as a whole rather than a collection of related amino acid sequences; network approaches are well suited to this task.
II. NETWORKS
A number of publications have considered the topic of visualising protein structure space; that is, mapping how amino acid changes occur in and influence protein structures (Skolnick et al., 2009; Holm & Sander, 1996a , b ; Dokholyan et al., 2002; Edwards & Deane, 2015; Cui et al., 2016; Valavanis et al., 2010; Çamoĝlu et al., 2006; Meyerguz et al., 2007; Cao & Elber, 2010; Hoffmann et al., 2016; Grahnen et al., 2011; Chi et al., 2018; Kleinman et al., 2010). Many of these take the form of networks. Similarity networks involve the construction of networks with protein folds as nodes (Skolnick et al., 2009; Holm & Sander, 1996a , b ; Dokholyan et al., 2002; Edwards & Deane, 2015) (A in Fig. 3). Then, alignment is used to create edges between nodes based on structural similarity, sequential similarity, or a combination of the two. These edges can be directional or non‐directional, depending on how similarity is assessed. Informational networks are also based on folds as nodes, however directional edges are formed based on the potential for sequences from one fold to mutate to favour another (Meyerguz et al., 2007; Cao & Elber, 2010; Hoffmann et al., 2016) (B in Fig. 3). The energy calculations that determine these are reliant on data about residue placement and mutations in pre‐existing proteins. Physics‐based investigations involve identifying potential directional edges through physical properties of the residues involved without relying on knowledge from existing proteins (Grahnen et al., 2011; Chi et al., 2018). Physics‐based investigations also involve identifying edges through the probability of a protein sequence migrating from one fold to another through mutation, but in this case the necessary calculations are based on first‐principles physics (Grahnen et al., 2011; Chi et al., 2018). These are usually used to model the potential evolution of a few proteins rather than map all known protein structures, as they are very computationally expensive (C in Fig. 3).
Fig. 3.

Different network visualisations of fold space applied to the same four‐subdomain ‘proteins’. ‘Proteins’ that take the same fold are encircled, with different background colours for different folds. Connections labelled (A), (B), and (C) are different types of networks. Subdomains of the same sequence are represented by the same shapes. All subdomains can be considered structurally equivalent for the purpose of this diagram. (A) Similarity connections represent instances wherein some of the proteins that take one fold are similar to those that take another (in this case, structurally). (B) Informational connections are directional and represent instances wherein a protein sequence originally taking one fold could be ‘lost’ to another fold via residue changes. These networks generally do not retain sequence information, as each edge represents the migration of a high volume of sequences from one fold to another. (C) Physics‐based connections also represent instances where one protein sequence could mutate to a different fold, but the likelihood of this happening is calculated based on first‐principles physics. Some are found through simulation to be more likely than others, as represented in this case by arrow thickness. Physics‐based models are complex; their purpose is generally to look at how accurately we can model protein evolution rather than to answer questions about overall protein fold space (for which simpler but more broadly applicable models are generally used).
Many publications are centred around establishing the nature of the protein universe according to a number of network patterns, most often variations on a scale‐free network. Scale‐free networks are defined by the relationship between an integer, k, and the fraction of nodes with k edges (connections to other nodes). In a scale‐free network, the fraction of nodes with k edges can be modelled by applying a negative exponent to k. This is known as power‐law behaviour, and is often associated with a tree‐like pattern of connections.
‘Small world’ networks are characterised by the number of edges, p, that make up the typical shortest path between two randomly selected nodes. In these networks, p is proportional to the log of the number of nodes. For example, in a small world network of 10,000 nodes, the distance between most node pairs would only be about twice the equivalent distance in a small world network of 100 nodes. These graphs are characterised as about halfway between highly clustered/poorly connected and poorly clustered/highly connected graphs and are often scale‐free as well.
(1). Similarity networks
(a). Structural similarity networks
A network of similarity with regards to structure can be made based on any protein structure alignment (PSA) method (see A in Fig. 3), with edges representing structural similarity between proteins.
One such study described the arrangement of protein space through the number of protein pairs that are kth neighbours (e.g. second neighbours are proteins connected by one other protein but not to each other) and the size of the largest strongly connected component (LSCC; a sub‐graph in which every pair of vertices is connected, in both directions) (Skolnick et al., 2009). Networks were constructed at a number of template modelling score (TM‐score) thresholds for edges between proteins. These were calculated through TM‐align (Zhang & Skolnick, 2005), a PSA method that quantifies the similarity in amino acid arrangement and secondary structure arrangement (α‐helix and β‐sheet position) between two proteins, and then corrects for length.
It was found across all thresholds that short (40–80 amino acids long) and long (170+ amino acids long) proteins were underrepresented in the set of proteins included in the LSCC (Skolnick et al., 2009). This makes logical sense, as proteins of a moderate rather then unusual size would be more likely to be similar to each other. It was also found that domains of size greater than 300 amino acids could act as bridges between the largest sub‐graph and molecules of underrepresented length, as parts of these proteins shared structural similarity with proteins from both sections of the graph. The model used allowed for gaps at the start and end of protein sequences; thus, smaller proteins could be aligned to parts of larger proteins without penalty.
Another study (Edwards & Deane, 2015) considered not only TM‐align, but three other methods as well: Mammoth (Ortiz, Strauss & Olmea, 2009), FATCAT (Veeramalai & Gilbert, 2008) and Elastic Shape Analysis (ESA) (Srivastava et al., 2016). Despite extensive standardisation, a high degree of variation between the networks yielded by each of these four methods was found (see Fig. 4). Twenty to thirty percentage of edges only appeared in the TM‐based network; for the other three alignment methods approximately 50% of edges were unique to them. This is important to consider when interpreting the results of the TM‐based network described above, as although TM‐score was the most consistent with the others, a quarter to a third of the edges could still depend on the alignment method used.
Fig. 4.

Structural similarity networks for four different protein structure alignment (PSA) methods. Note the differing network shapes between different protein structure alignment methods and the ‘network collapse’ as the similarity score threshold for edges (shown in the top row) is increased. Figure reproduced from Edwards & Deane (2015) in accordance with the Creative Commons Attribution (CC BY) license.
Edwards & Deane (2015) also explored the links between structure and time of fold emergence through projection of estimated fold age onto the structural networks. It was found that connected fold pairs had more similar estimated ages than unconnected ones. This trend was found to be more apparent when multiple alignment methods predicted the connection; the median age difference of folds that were connected in at least two networks dropped to zero. This implies that folds that emerged at approximately the same time are more likely to be structurally similar. Whether this also meant that the folds were related was not explored. Highly connected, central folds were also found to be much older than peripheral ones.
Building on previous network work (Holm & Sander, 1996a , b ) Dali was used to create networks of proteins with variable similarity score cutoffs for edges (Dokholyan et al., 2002). The database consisted of proteins with no more than 25% pairwise sequence similarity, which were then clustered. Random networks using the same number of nodes were also created with the same score distribution but different edges as a control. It was found that the graph representing the actual structural similarity between proteins showed scale‐free behaviour, but the control did not. This behaviour is associated with a tree‐like network, an explosion of nodes from a few central points. In a separate study, Dali was used to create a protein similarity map in three dimensions (Hou et al., 2003). This separated the classes well, and seemed to sort folds by complexity. The separation of α/β folds from a plane containing other fold types was suggested to indicate the later evolution of this class of proteins, as would be expected. This link between protein fold age and their position in this three‐dimensional space was tested with the fold usage of three clades of varying evolutionary ages, two eubacterial and one archaeal, and the results were consistent.
Friedberg & Godzik (2005) looked at the structural similarity of fragments within proteins rather than the overall similarity of protein domains. These fragments were 5, 10, 15 or 20 residues in length and a score was constructed for the similarity of protein domains based on the proportion of fragments they had in common. The rationale presented for this study was that small areas of common structural similarity may reveal more about the relatedness of proteins (particularly with regard to function) than overall similarity, in accordance with a continuous view of protein fold space. Functional similarity between folds was quantified with a score based on the frequency and overlap of gene ontology terms. Significant correlation was found between the structural fragment similarity score and this functional similarity of folds, which supported the rationale hypothesis. Structurally compact fragments of length 10, 15 and 20 were found to have higher fold diversity (they were found in a greater number of folds) than their more disperse counterparts, supporting a ‘building block’ view of proteins. Also, power‐law behaviour (see above) was observed in the connectivity of the graph.
(b). Co‐occurrence networks
Quaternary structure (the joining of multiple single‐domain proteins together to form a functional unit) is an extremely common element in the bank of known proteins (Tung et al., 2016). Some work has been done incorporating information about this into networks of single domains (referred to as ‘co‐occurrence networks’). Wuchty (2001) built networks of single‐domain nodes wherein edges were drawn if the two domains occurred in at least one of the same proteins. These can be considered in a similar way to the above structure‐based networks, but instead of global approximate similarity in single‐domain proteins; edges are determined by local exact similarity in multi‐domain proteins. Both interspecies networks and networks of a single species' proteome were found to display scale‐free behaviour, albeit with different characteristics. Unlike the models in which edges were based on structural similarity, no evidence was found that older domains were more likely to be highly connected and central. It seemed the opposite was true, with newly evolved folds exhibiting higher connectivity. Highly connected folds were also more likely to be involved in signalling pathways.
A similar study focused on comparing domain networks between 53 different species' genomes (16 archaeal, 30 bacterial, and seven eukaryotic) (Ye & Godzik, 2004). All were found to show scale‐free behaviour. The number of domains, number of combinations and the size of the giant component (a highly interconnected graph feature) were found to be correlated with the complexity of the organism in question. Sets of domain nodes formed clusters that were strongly connected internally but weakly connected to the rest of the graph. Some degree of functional homogeneity was observed within these, which was quantified and found to be inversely correlated with cluster size (i.e. greater functional homogeneity was observed in smaller clusters). This observation was used in functional annotation. A small number of common domain combinations (focused on functions such as DNA regulation) were found in all genomes examined. Areas of domain graph structural similarity unique to eukaryotes (involved in ubiquitination, DNA‐binding and RNA‐binding) and bacteria (involved in DNA replication and repair) were also found.
Another co‐occurrence domain network study completed by Kummerfeld & Teichmann (2009) included a directional component (edges preserved information concerning the order in which the domains occurred). Again, scale‐free topology was observed in the non‐directed graphs. Both the in‐ and out‐ degrees of these graphs follow a power‐law/scale‐free distribution. Restricting the graph to a single species did not affect this behaviour. Bidirectional pairs of domains (i.e. pairs that appeared in both possible orders) were found to occur more often than would be expected at random, as were domain clusters. Domains in clusters were, again, found to have common functionality in many cases; some were even found to belong to the same signalling pathway. Clusters also followed some logical evolutionary patterns (for example, most of those found in the human proteome were eukaryote specific).
Levitt (2009) found similar patterns when looking at the properties of single domains when forming multiple‐domain architectures. They found that the number of domains included and the number of repeats of each domain both followed a power‐law rule.
Nepomnyachiy, Ben‐Tal & Kolodny (2014) looked at co‐occurrence from the perspective of both domains and motifs. An all versus all alignment of proteins with less than 70% sequence similarity was performed with the structural alignment program SSM. RMSD, alignment length and sequence similarity were used to assess whether a domain or motif was truly shared. At the laxest thresholds, protein space was found to be continuous (highly connected), but when requirements for similarity were stricter, a continuous and a discrete component were found. The continuous component was found to consist of many α/β domains.
(c). Sequence similarity networks
Some authors returned to the property of proteins that can be most easily measured and altered – the sequence. This also allows the use of more tools and well‐established procedures, due to its similarity to nucleotide sequence investigations.
Alva et al. (2010) used profile hidden Markov models to allow proteins to be attracted to those with similar sequences and repelled from those with dissimilar ones in a virtual two dimensional space. Clustering was then performed to look at potential relationships. Proteins from the same class seemed loosely clustered together, which was thought to be primarily due to similarities in composition. Some degree of similarity in sequence was found in proteins from vastly different superfamilies and folds; sequences of 20–40 amino acids were present in disparate groups.
Sequences can also be represented as a series of properties. Yu et al. (2013) present a way of representing protein sequences as points in 60‐dimensional space. This allows natural calculation of distances between protein sequences and can be used as a space for networks, which was illustrated using a protein kinase C data set. This group of proteins was well classified in this space (according to theoretical expectations), despite being difficult to classify with tree‐constructing methods. Residue sequences that fold to proteins seemed to occupy a specific part of this space, which led to a possible test of the potential for a given sequence to code for a functional protein (Yau et al., 2015). The boundaries of the space occupied by proteins were found to be reasonably resistant to change with new protein discoveries. The same group compared their natural vector method with a number of other sequence‐based methods for mapping and classifying protein space (Wan & Tan, 2019). Natural vector space was compared with a representation of protein sequence involving 10 property strings averaged over protein length (Rackovsky, 2009), a position‐specific scoring matrix (originally designed for functional annotation; Jeong, Lin & Chen, 2011) and a pseudo amino acid composition representation (Shen & Chou, 2008). It was found that the natural vector method outperformed others in classification and that convex hulls created well‐defined boundaries for this purpose.
Sequence space can be modelled through mutual information (Wan, Zhao & Yau, 2017). These values were used to construct a matrix of similarities between sequences, which were then converted to adjacency in a network. The resulting network enabled detection of evolutionary relationships between proteins from different species, but did not seem to fare so well with identifying CATH structural classification (as is to be expected due to the unpredictable relationship between protein sequence and protein structure).
(d). Combined networks
Although structural and sequential similarity network research provides interesting insights into similarity relationships between proteins, the use of either property alone may not be enough to understand how protein structure space is organised. Çamoĝlu et al. (2006) produced a combined network involving both sequential and structural properties. Dali, combinatorial extension (CE) and vector alignment search tool (VAST) were used to assess structural similarity (through Z‐scores and p values) and position‐specific iterative basic local alignment search tool (PSI‐BLAST) and a hidden Markov model tool (HMMER) were used for sequence similarity (quantified through E‐values). Scores were normalised from 0 to 1 to form similarity networks of proteins and random walks were performed, followed by Bayesian classification. It was found that integrated networks (that is, networks considering more than one type of information) were more effective at classifying proteins into their SCOP categories than individual networks. This technique was also found to be more effective than rival classification techniques [kernel‐based support vector machine (SVM) classifier, a nearest‐neighbour classifier, and an integrated sum classifier].
Conversely, Valavanis et al. (2010) analysed separate structural and sequential networks of 311 proteins (296 structures) concurrently. Two sequence networks were constructed, one based on sequence properties (amino acid composition, predicted secondary structure, hydrophobicity, normalised van der Waals volume, polarity and polarisability) and the other on BLAST comparison. One structural network was constructed based on Dalilite Z‐scores. Fold and class structure were shown to be echoed in the networks to various extents (Valavanis et al., 2010). It was also shown that proteins from the α/β class had high ‘betweenness’, that is they were very likely to connect to other proteins and act as hubs in the structure and sequence networks. This is consistent with Nepomnyachiy et al.’s (2014) finding that these types of domains were associated with continuous protein space. The main focus of this analysis was to establish the ‘small world’ nature of protein networks (Valavanis et al., 2010), which was successfully demonstrated.
(2). Informational networks
Networks of potential evolutionary relationships between protein structures can be produced with edges indicating the likelihood of a mutational connection based on knowledge regarding residue positions in existing protein domains (Meyerguz et al., 2007; Cao & Elber, 2010; Hoffmann et al., 2016). These networks are often referred to as informational models of protein evolution. Informational models, by their very nature, often have little ability to express sequence data explicitly, as each edge represents the migration of many sequences from one fold to another. However they are based on the predicted effect of sequence mutations (derived from existing proteins), which is itself defined by the amino acid sequence.
One of the first such models was produced by Meyerguz et al. (2007). A matrix of contact energies was constructed from two variables, the amino acid type and the number of contacts, to estimate the energy of proteins following point mutations. Empirically calculated energy thresholds were used to determine whether a step in a protein mutation series was viable, and whether the protein had the potential to be ‘lost’ to another fold following mutations. Following these calculations, a network was formed with various thresholds for c, where c is the minimum fraction of sequences migrating from one fold to another represented by an edge between the two folds. Protein mutability was in this way based on the sequence capacity of each protein fold. This was calculated both with and without competition from other folds.
The major finding of this investigation was that protein space appeared a lot more mutationally interconnected than was originally thought (Meyerguz et al., 2007). It was also found that β‐sheet‐rich folds acted as protein ‘sinks’; that is, many of their surrounding proteins in sequence space were able to mutate to their conformation.
This model was revisited by Cao & Elber (2010). The library of folds was more thoroughly vetted for redundancies and three different methods of assessing mutation viability were used instead of the single one from Meyerguz et al. (2007), but a highly connected space was still observed. It was found that the distribution of incoming edges per fold was much wider than that of outgoing edges (Cao & Elber, 2010). That is, the number of sequences from other nodes that can mutate into a given fold was observed to be more variable than the number of nodes sequences from a fold can mutate into. Nodes with up to 800 in‐edges were observed, but the maximum number of out‐edges per node was 150. This not only demonstrates high sequence capacity for some proteins, but a potential cap on the number of outgoing edges an existing fold may have. It stands to reason that there is some limit on the number of sequences a fold can potentially lose to others before it is no longer observed; perhaps intermediate ‘donor’ folds once existed.
It was also found that α/β proteins and enzymes formed an anomalous peak in the otherwise scale‐free log–log plot of number of incoming edges versus frequency (Cao & Elber, 2010). This suggests that the number of incoming edges has particular significance for this type of protein. Longer proteins, like the aforementioned β proteins, were shown to be likely protein sinks (and partially responsible for the wide distribution of in‐edges).
A similar study based on predicting the effects of mutations through existing information was published by Hoffmann et al. (2016). The aim of this work was to look at sequence restrictions in relation to structure and structural restrictions in relation to sequence. Protein structure was described as a series of thermodynamic environments (one per residue); each amino acid was considered in terms of how easy it would be for it to ‘sit’ in a certain environment. Proteins in the PDB were split into overlapping sequences of 13 amino acids and the 13 thermodynamic environments they sit in (Hoffmann et al., 2016). Energy values were calculated for pairs of sequences and environments on an all versus all basis. Analysing the ease with which sequences surrounding a residue site fit into different structures, and vice versa, allowed positive and negative compatibility indices (PCI/NCI) to be calculated for each environment fragment (structure) and amino acid sequence within a protein. PCI with respect to structure represents the number of sequences that fit well into it; NCI represents the number of sequences that dramatically do not (and vice versa). Each site's PCI and NCI were compared with the median for the protein to determine how permissive or restrictive that particular site is.
Principal component analysis showed that the NCI with respect to sequence and structure had a much stronger influence than PCI on which amino acid and structure pairs ended up viable (Hoffmann et al., 2016). This primary result could have some effect on the shape of an evolutionary network. It seems to imply that restrictive sites in proteins may be more important than permissive sites with regard to mutation and evolution. That is, that peptide evolution may be shaped primarily by which structures and sequences are extremely disadvantageous.
Although many informational studies have shown how a number of important aspects of protein space interact, this work on constructing protein space is concerned primarily with the stability of proteins (represented by thermodynamic energy of either the overall fold or of parts thereof). This can be affected by a number of aspects of the environment, such as pH, temperature and interaction with other proteins; and, while important, is by no means the only aspect of a protein's biochemistry that determines how well it can perform its function (DePristo, Weinreich & Hartl, 2005). Ability to aggregate, interactions with other proteins and substrates, and flexibility required for function are also important factors that determine how proteins evolve.
Furthermore, it has been found that proteins are only marginally stable in nature (that is, proteins often do not exist in their most theoretically stable form) (Taverna & Goldstein, 2002). Some believe this to be a functional adaptation, but there is also evidence for this as an inherent property of proteins in both simulations where residues are fixed to a three‐dimensional lattice and more flexible non‐lattice simulations (Taverna & Goldstein, 2002; Goldstein, 2011; Williams, Pollock & Goldstein, 2006). However, a connection has also been implied by others (Ferrada & Wagner, 2008). Marginal stability, combined with the NCI/PCI findings described above (Hoffmann et al., 2016), seems to describe proteins as being halfway up a slope in the stability landscape, their course being more driven by avoiding the peaks than by approaching the valleys.
Whatever the effects and causes of marginal stability in proteins, it definitely throws into question the use of stability as the sole determinant of the success or failure of a mutational evolutionary step. The determination of network edges is often based primarily on maximising stability. If simulations are to be accurate, stability should not be maximised, but marginal. This is difficult to define, and leads us to question whether other protein properties are necessary to the formation of an accurate network of protein structure evolution.
(3). Physics‐based models
Physics‐based models are based on physical properties of individual amino acids. They often map parts of protein space, rather than the complete PDB, due to the computational complexity required to map structure physically from residue type. Currently, work concerning these models is primarily focused on the accuracy with which we can model the evolution of proteins, rather than drawing specific conclusions about protein structural evolution in a universal sense. However, mutational patterns in single structures, if applied to a library of protein folds, could result in a much more detailed picture of protein fold evolution than information‐based or similarity networks. These visualisations of how protein structure works are often related to protein structure prediction.
One study showing the benefits of models based on the intricacies of amino acid physics was published by Grahnen et al. (2011). It was primarily concerned with comparing two models to look at potential evolutionary developments in a representative protein to test how well each model reflected expectations. One of these was an informational model similar to those presented above, wherein protein mutations were predicted based on residues' proximity to each other (and thus, potential interaction) in existing proteins. The other was a physics‐based model involving a coarse‐grained representation of protein structure, which incorporated data about protein geometry (based on backbone bond angles, disulphide bridges and ability to form α‐helices/β‐sheets) as well as stability and other functional information (Lennard‐Jones potential, Coloumbic potential, and a solvation potential).
Neither of these ways of predicting evolutionary change resulted in the native, unmutated sequence of the proteins in question being near the thermodynamic stability optimum for its fold, which is incompatible with what was expected (Grahnen et al., 2011) (see Fig. 5). It is, however, possible that this relates to marginal stability. Both models produced non‐synonymous to synonymous substitution ratios smaller than 1, which is expected (as non‐synonymous mutations, those that change amino acid sequence, are more likely to be disruptive and are therefore selected against). They also both showed patterns of site‐based rate heterogeneity similar to those expected from real proteins. The physics‐based model seemed to favour stronger selection pressure and was more compatible with theoretical expectations, whereas the informational model supported a more evenly dispersed and highly connected view of the protein universe with lower selection pressure (see Fig. 5). This implies that previous findings around a highly connected evolutionary protein space (Meyerguz et al., 2007; Cao & Elber, 2010) may be more to do with the informational model used than the true nature of this space. The authors acknowledged that, although the physics‐based model would seem to be more accurate, it is much more computationally complex than the informational model.
Fig. 5.

The landscapes of folding scores for near‐native serum amyloid‐P (SAP) sequences. The native sequence of this protein is not near the optimum stability when calculated with (A) an informational and (B) a physics‐based model. The physics‐based model does, however, exhibit more selection pressure, with steeper slopes and a more dramatic minimum (most stable region). Reproduced from Grahnen et al. (2011) in accordance with the Creative Commons Attribution (CC BY) license.
A similar site‐specific model was published by Chi et al. (2018); weights of considered factors in this case were allowed to vary with the residue site in question. Predicted mutations were compared to observed substitutions in a family‐wise alignment. The site‐specific nature of this model was found to increase its accuracy in predicting the amino acid substitutions favoured by a particular SH2 domain (a well‐researched protein motif involved in signal transduction). It also provided a greater distinction between residue sequences taking the SH2 conformation and decoys than the global model.
A study that seems to combine informational and physics‐based aspects was produced by Kleinman et al. (2010). Physics‐based considerations were examined from an informational perspective to account for site interdependencies. It seemed that the more physical properties included, the more accurate was the model. However, it was still not as realistic as some models that assume site independence.
Physics‐based models represent much potential for progress in mapping the protein universe. They combine geometric and stability information, as well as some degree of functional information and data about the site of certain amino acids within proteins. These calculations are, however, only performed for a select number of proteins, due to the immense amount of computing power required and the complexity of the model. As computing power improves, there is potential for this model to be applied to the complete database of known proteins. This would not only allow insight into how proteins may change in the foreseeable future, but also show connections between existing proteins from a different perspective to that provided by the informational models.
III. KEY QUESTIONS
(1). Is our understanding of protein space complete?
Protein structure alignment can be used to provide evidence for or against the hypothesis that all protein folds that exist have been documented. Protein pairs can be compared with these methods and clusters of proteins created, each corresponding to a separate fold. This process can be applied to the current bank of known proteins to determine the folds present, which can then be compared with a bank of pseudoproteins (generally constructed in silico) to explore the completeness of protein space. Equivalence can then be assessed through a number of ways, such as a similarity score threshold, or by examining the closeness and broadness (RMSD and coverage) of the best scoring alignments between pseudo proteins and real proteins.
A series of papers, together, cover an intriguing debate on our current understanding of protein space. Zhang et al. (2006) published results suggesting that our understanding of protein space was complete (in relation to single‐domain proteins). That is, that all possible conformations of single‐domain proteins are represented in some way in the protein database. Homopolypeptides (consisting of only one repeated amino acid) were produced of up to length 100 and folded according to hydrogen bonding, excluded volume, and a uniform, pairwise attractive potential between side chains. These sequences were representative of polyalanine chains, as each residue had a β‐carbon side chain. This can be considered as showing expected minimal steric hindrance for most protein chains, as the only smaller amino acid is glycine. The homopolypeptides were compared with a database of 30,000 real single‐domain proteins of 150 amino acids or less. The TM‐scores yielded by these comparisons suggested that each synthetic molecule had a partner of equivalent structure in the bank of real molecules and that every real protein had an equivalent in the bank of synthetic molecules. This work provoked a number of similar investigations that contradicted it.
Cossio et al. (2010) estimated that we could only be aware of 5% of possible protein conformations with the use of 60 amino acid polyvaline chains (another form of homopolypeptide) and TM‐align as a comparison to existing folds. This is counterintuitive, as the increased steric hindrance involved with using valine instead of alanine as a base residue would imply fewer conformations could be attained (although there is evidence to suggest that alanine and valine can be considered close to structurally equivalent). Polyvaline proteins with natural equivalents were found to have lower contact order (smaller average separation between contacting residues) and higher contact locality (more contacts between residue pairs from the same half of the protein chain) than the other polyvaline chains. This structural difference was seen as evidence of some evolutionary bias towards this arrangement of interresidue contacts (Cossio et al., 2010).
Likewise, Taylor et al. (2009) used rearrangement of secondary structure elements from multiple sequence alignments of proteins from the same family to construct molecules from the α‐carbons up. These were compared with existing proteins using a combination of three different protein structure alignment methods [Structural Alignment of Proteins (SAP) (Taylor, 2008), TM‐align (Zhang & Skolnick, 2005) and Dali (Holm & Sander, 1993) – focusing primarily on SAP] and visual inspection. They found less than 10% of their pseudoproteins to be reflected in known protein space.
Dai & Zhou (2011) also found that they could produce many protein structures not yet observed in the PDB through five rounds of N‐ and C‐terminus loop closure and re‐opening at a different site. Of the 180–200 residue structures created, 82% belonged to novel folds (assessed using TM‐score); as did 23% of the 60–80 residue structures created.
Skolnick et al. (2012) provided further evidence that all possible protein folds have been discovered. An updated version of the TM‐align procedure was used (Fr‐TM‐align; Pandit & Skolnick, 2008) to increase the accuracy of the alignment. This work utilised a number of ways of looking at proteins, including a quasi‐spherical protein model, and those used in the contradictory work mentioned above. It showed that programs designed to produce artificial protein structures do not produce any that can not be sufficiently accounted for in the PDB and vice versa.
Skolnick et al. (2012) also attempted to explain how the differing results were derived. They argued that the threshold for similarity used by Cossio et al. (2010) and Dai & Zhou (2011) (0.45 and 0.5, respectively) was too conservative and that it would have resulted in pairs of real and theoretical proteins that were structurally similar being categorised as different. A threshold of 0.4 was suggested as more appropriate (Skolnick et al., 2012). The use of TM‐align rather than the updated Fr‐TM‐align was also discussed as a possible source of this discrepancy, as Fr‐TM‐align is more sensitive to subtle structural similarity. Skolnick et al. (2012) also argued that the size of the templates in some of these works (particularly Taylor et al., 2009) was too small to produce meaningful results.
The recent slow rate of novel protein fold discovery could provide further evidence that protein fold space has been completely documented (Moult et al., 2014) (Fig. 6). However, sampling bias has been presented as a potential reason for existing protein folds not being represented in the PDB. The hypothesis tested by Barry Roche & Brüls (2015) was that undersampled regions of the tree of life (for example unculturable bacteria) may correspond to gaps in fold space. The fraction of proteins that could be assigned to known folds or protein families from a multiple sequence alignment and hidden Markov model database (Pfam) was significantly lower in the predicted structures of proteins from underrepresented species than in the structures of proteins from well‐known species. It is important to note that the sequence novelty observed in undersampled regions of the tree of life contained a lot of short sequences, which are known to be more ambiguous in their folding. When sequences of size less than 100 amino acids were excluded, the differences in coverage between data sets from under‐sampled and well‐known species were no longer observed (Barry Roche & Brüls, 2015). It is still, however, plausible that these sequences could represent genuine novelty in protein structures.
Fig. 6.

Number of folds recorded in the protein database (PDB) over time. Note the slowing rate of discovery in the last decade.
Most of the work done on the completeness of protein space concerns single‐domain proteins. That is, the conclusion to be drawn is that we may know of the building blocks of all multi‐domain protein structures. Even if this conclusion is accepted, we cannot say that we necessarily know of all protein structures that exist (that is, we may not know all arrangements of these building blocks). As it is, we cannot assign structures to all known sequences of multi‐domain proteins.
Levitt (2009) assessed the structural coverage (percentage of sequences for which structure is known) of the known single‐ and multi‐domain protein sequence universe. Patterns common to the sequences of all proteins in a family (sequence profiles) were used to gain an understanding of sequence–structure relationships. Total coverage was improved through repetitious coverage (coverage of all sequences from a family instead of merely a representative) and merged coverage (allowing duplicated sequence profiles to merge and thus inform one another) (Levitt, 2009). These processes better allowed the sequence profiles of known structures to inform those of unknown structures. Their analysis characterised a total of 78% of protein sequences longer than 50 residues, then described potential reasons for the remaining 22% – the ‘dark matter’ in protein sequence space. Firstly, some of the protein sequences occupying this dark matter may be deduced from DNA that is, in fact, non‐coding. Secondly, that the proteins were low‐complexity forms with little structure to cover, which was supported by the fact that dark matter protein sequences were shorter, on average, than characterised ones. Finally, they suggested that the current search methods did not detect sequence profiles completely and that there were further sequence profiles yet to be found.
Furthermore, it is possible that the proteins that do exist do not represent all those that could exist, but show the result of evolutionary pressures. Given the slowing rate of discovery coupled with the fact that some authors have found motifs that potentially do not align with known entries in the PDB, this seems like a theory that fits the currently available evidence. Further research is required to determine whether there are protein conformations that are evolutionarily unfavourable or have not been identified.
(2). What patterns dominate protein structure evolution?
Convergence in protein structure evolution occurs when multiple protein sequences from different origins converge on the same place in protein structure space. Divergence is the opposite of this; when the evolution of protein sequences is primarily away from an ancestral protein fold. It is rarely possible to tell whether convergence or divergence has produced a given set of proteins (Fig. 7). Structural and informational networks have provided evidence for the involvement of both patterns in protein evolution.
Fig. 7.

Convergent and divergent protein evolution. The simplified molecules could be an indicator of divergence from (left) or convergence towards (right) a U‐shaped fold. The overall number of discrete folds increases with divergence but decreases with convergence.
‘Attractors’ and ‘protein sinks’ found in structural similarity (Holm & Sander, 1996a ) and informational networks (Meyerguz et al., 2007; Cao & Elber, 2010) have been suggested as evidence for convergent evolution towards beneficial protein chain conformations. It is impossible to infer with 100% confidence the evolutionary process from the end result, but the authors provided convergence as the explanation. The discrepancy between the wide range of possible protein sequences and comparatively small range of folds they take also seems to suggest convergence (Dokholyan et al., 2002). Proteins with independent evolutionary origins could have evolved into a particularly effective conformation in parallel.
Conversely, a Dali study intended specifically to investigate the convergent or divergent nature of protein evolution found scale‐free behaviour in protein similarity networks but not in a random control (Dokholyan et al., 2002). This is more consistent with theoretical expectations for a divergent pattern of structural evolution (as it implies a more tree‐like, predictable, and even distribution of edges formed through this process). This has since been observed in other networks of protein structure, as mentioned above (Valavanis et al., 2010; Cao & Elber, 2010). A good example of the observation of divergence in specific protein structures can be found in the treatment of Rossman‐like folds in the ECOD database (Medvedev et al., 2021). Having defined a minimal structure to determine members of this superfold group, Medvedev et al. (2021) demonstrated the likelihood of divergence dominating its evolution, although convergent patterns could also plausibly be observed.
Many studies show examples of both convergence and divergence simultaneously. There is evidence that similar domain architectures of nucleotide‐binding oligomerization domain‐like (NOD‐like) receptors in animals and nucleotide‐binding site leucine‐rich repeat proteins in plants evolved independently despite their structural similarity (Meunier & Broz, 2017). Also, zebrafish (Danio rerio) have receptors that, while not in the NOD‐like receptor family, perform the same function with the same targets. Both of these are examples of convergence. Divergence can be observed in the similarity and overall conservation of these genes in related organisms (homology) with the addition of different domains in some cases. Kaur et al. (2018) showed examples of the difficulties of entangling evolutionary history of proteins when similar evidence can point to both convergence and divergence. They presented a case involving both these patterns concerning the evolution of the chromo‐like superfamily of SH3 domains. They showed that these domains evolved from bacterial precursors rather than their archaeal structural homologues (which themselves diverged from a zinc ribbon ancestor, and then structurally converged to their current similarity with these signal transduction SH3 domains). Mai, Hu & Chen's (2016) combined network clustering showed examples of both functional divergence (sequences from the same cluster with similar structures and different functions) and structural convergence (unrelated sequences in the same structural cluster).
Evidence showing that proteins may evolve greater sequence capacity may further complicate this relationship. Cao & Elber (2010) found that, in the Cro family (used as an example due to pre‐existing knowledge about its evolutionary progression), protein sequences appear to evolve towards folds with greater sequence capacity. A similar relationship has been shown using two‐dimensional lattice simulation (Taverna & Goldstein, 2002). Taverna & Goldstein (2002) found that proteins that were the result of evolution were more robust to site mutations than random proteins. Mutation robustness implies greater sequence capacity as the more mutations that can be tolerated, the more sequences can take a fold. Tian & Best (2017) also found that older proteins had higher sequence capacity. This was shown via a thorough exploration of 10 proteins based on a stability threshold for sequences that could fit into a fold. Another study in yeast found that proteins with higher sequence capacity were more likely to evolve (via a relationship with inter‐residue contact density), potentially demonstrating a reversal or complication of the correlation observed in other works (Ferrada & Wagner, 2008).
Should proteins that tolerate more different sequences be selected for, the question of divergence versus convergence could be moot, or at least much more difficult to answer. The domination of protein space by old folds with high sequence capacity, as seen in Edwards & Deane (2015), creates similar patterns to those expected by convergence. This seems to be the biochemical equivalent of the ‘survival of the flattest’ concept in evolutionary biology. This sees flatter areas of fitness space favoured by organisms over time, as these are more robust to mutations (Wilke et al., 2001). The tendency to evolve greater sequence capacity may also relate to the observation of marginally stable proteins in nature. Folds that are able to take many different sequences are unlikely to be maximally stable as they must accommodate different amino acid shapes.
Sequence‐based networks (see Section II.1.c) have provided another potential pattern associated with protein evolution. Alva et al. (2010) found sequentially similar proteins from vastly different superfamilies and folds; potential evidence of evolution from ancestral peptide molecules. An attempt to uncover what these ancestral peptides might be was made by Alva, Soding & Lupas (2015). Examples of high sequence and structural similarity in evolutionarily distant proteins were dissected to provide a list of candidates. Biophysical constraints were eliminated as the cause as there was no correlation between sequential and structural differences when these peptides were identified in a wide variety of proteins. While the peptides described were not present in all folds, the authors suggested that there may be more ancestral peptides to be found due to the conservative nature of their search. This theory has roots much earlier in different experimental results (Lupas, Ponting & Russell, 2001; Shindyalov & Bourne, 2000; Swindells et al., 1998) and similar ‘building blocks’ have been demonstrated from a structural perspective (Friedberg & Godzik, 2005). Nepomnyachiy et al. (2014) even suggested that the classification and splitting of proteins should be into these smaller fragments as much as domains.
The potentially modular nature of proteins may mean a more continuous view of protein structural space is appropriate, where protein structures are defined by their position in the space rather than a particular group (as in a discrete view). Networks of protein structures can often be seen as continuous when low levels of structural similarity are required to form edges, as shown through most of the structural similarity networks described above (particularly Skolnick et al., 2009). An ancient step of peptide arrangement in the history of protein evolution could help to explain this, as proteins would display some level of structural similarity to many proteins from other structures (Pascual‐García et al., 2009). The lack of distinct boundaries in some regions of the ECOD database also lends support to a continuous view of protein space (Cheng et al., 2014). This database also features homology relationships between proteins in folds that would otherwise be seen as completely disparate, further emphasising this property.
Perhaps there should even be some duality in the view we hold of protein structural organisation. It has been suggested that a discrete model of protein structure is appropriate only when focusing on proteins that are very structurally similar (for example, up to a superfamily level), and a continuous model is required as focus broadens (to proteins that merely share a fold or class) (Pascual‐García et al., 2009; Nepomnyachiy et al., 2014; Daniels et al., 2012; Sadreyev, Kim & Grishin, 2009). This reconciles a somewhat continuous view of protein space with the power‐law results of many protein structural networks – which imply a not only divergent, but also discrete model of protein evolution. Evidence for this can be seen in a study that uses transitive homology violations to measure how discrete the protein universe is at different thresholds for structural similarity (Pascual‐García et al., 2009). These violations occur, for example, when A is considered similar to B, and B is considered similar to C but A is not considered similar to C. They found that the number of transitive homology violations increased at a slow rate with a decrease in structural similarity requirements for network edges, until a threshold was reached and the increase became much more rapid. A similar description of protein space is that it is evolutionarily discrete but geometrically continuous (Sadreyev et al., 2009). Some of the network approaches described above are also associated with these theories, for example, Skolnick et al. (2009) emphasised the continuity of protein space and Nepomnyachiy et al. (2014) demonstrated α/β domains as areas of continuity.
A number of attempts have been made to classify proteins whilst considering this potential continuity of protein structural space. Xu & Zhang (2016) used TM‐score to classify proteins into folds but included a measure of within‐fold structural heterogeneity to allow more accurate inclusion or exclusion of proteins. Daniels et al. (2012) produced similar results with Matt. It is clear that a discrete view of protein space is still useful, even if it does not encompass some of the intricacies of true molecular evolution (Pascual‐García et al., 2009; Cheng et al., 2014).
(3). How does protein evolution influence populations, organisms and enzyme pathways?
A primary difficulty with understanding protein evolution is that most work needs to consider proteins as individual units, as opposed to elements acting within a whole system.
A series of papers concerning populations of single proteins has been published with the aim being to look at how population genetics concepts can be applied to protein evolution. It was first found that the larger the population of proteins, the greater the power of selection on any mutations that occur (Goldstein, 2011), which is consistent with classical population genetics results. With greater selection there was an increase in the stability of proteins, in accordance with previous findings. Another paper looked at differences in fitness effects with population size (Goldstein, 2013), finding that population size had little influence on mutation fitness effects, when considering both protein models that rewarded folding and protein models that punished unfolding (focusing on stability and aggregation, respectively). The explanation given for this was that the increased stability and increased selection pressure due to the larger population size cancelled each other out; that is, that increased stability resulted in mutations having less influence, even though there was greater selection on these mutations. However, it was found that when epistatic interactions between residues were accounted for, and thus removed from consideration, there was an effect of population size on mutation fitness effects. A strong influence of population size fluctuations was also observed. Fluctuations in both directions resulted in an increase in the population‐scaled fitness effects of mutations.
Additionally, attempts have been made to model entire organisms with regard to their proteins. For example, Zeldovich & Shakhnovich (2008) conducted a simulation of an organism with a small genome, that was assumed to be alive if all of its proteins were stable (a simplistic but practical definition). This was based on varying rates of two molecular events (mutation and duplication) and two organismal events (replication and death). It was found that the population of organisms exponentially increased once a small number of stable proteins was found. The authors drew a link from this finding to the larger genome size of DNA viruses when compared with RNA viruses. RNA viruses have a higher mutation rate, therefore they may have reached this point of exponential population increase in a shorter time with fewer proteins than possessed by DNA viruses and thus require less genetic material.
There has also been some work conducted looking into how enzymatic pathways evolve together. Orlenko et al. (2016) conducted such a study on a simplified glycolysis/methylglyoxal metabolism pathway. A number of clusters of parameters that evolved together were found between enzymes such as concentration, catalytic constant and the Michaelis constant for the substrate, which is consistent with what we would expect as the enzymes must operate in conjunction with each other. The authors took a wide range of factors related to enzyme function into account, including concentration of aggregative substrates, expression cost and flux (the turnover of reactant into product). It was also found that rate‐limiting steps (the slowest steps in enzymatic pathways) are not evolutionarily stable. That is, that they are likely to be acted upon by evolution, potentially to reduce their effects on productivity. It stands to reason that mutations of proteins involved in the slowest step of a pathway have more potential to improve functionality than mutation in a quicker step. Perhaps these evolutionarily unsteady enzymes are in some way related to the anomaly in degrees for these proteins found by Cao & Elber (2010). The inclusion of enzyme pathway data in a universal network model of protein space would enrich it greatly (as enzymes are a major constituent of this space), and seems the most achievable way to consider proteins as elements of a greater system.
(4). How does the evolution of proteins affect their functionality?
Protein functionality is another potential influence on protein space that is difficult to take into account. In some cases (such as that of structural proteins) this can be quantified to a fairly accurate degree with thermodynamic stability measures. Functionality of enzymatic pathways could be measured through coevolution of various parameters (Orlenko et al., 2016). The function of many other proteins (for example, hormones, enzymes and storage molecules) relies on their ability to bind a ligand. This can often result in mutations that slow protein folding through increased complexity (energetic trapping) or destabilise protein folding through extra interactions (topological trapping). Both of these phenomena may make a protein seem unrealistic when assessed by stability only, due to the folding–function trade‐off (Gosavi, 2013).
As we learn more about how the ligand‐binding active site of a protein works, we may be able to include considerations for this in our models of protein space. Already, it has been shown that there are a finite number of forms of active site (Zhang et al., 2006), which may make this task less daunting than it initially appears. Methodologies have been presented to identify protein sites that may be involved with interactions. One recent and comprehensive example of this is the program Cofex (Hu & Chan, 2017). This is based on the concept that sites of protein–protein interaction are likely to evolve together. The frequencies of certain sequences in the interacting proteins is compared with that in the wider protein universe. Should the frequency be much higher than expected, they are considered to be covariations and thus, potential active sites. Methodologies have also been presented to identify the sites of proteins involved with binding small‐molecule ligands. These include FINDSITE (Brylinski & Skolnick, 2008), based on similarity between a query and known sites (equivalent to informational investigations described for proteins).
Many networks treat proteins as rigid objects to compare, but flexibility is vital to protein functionality. Minary & Levitt (2008) used this property to construct networks. Loop torsion angles were taken as degrees of freedom, preserving the secondary, but not tertiary, structure of proteins. Different types of proteins were found to have different ‘fingerprints’ of conformations they were likely to flex into (measured by RMSD from the original conformation). Domains consisting of all α‐helices were found to be easily denatured and thus, easily flexed into a different conformation. Domains containing alternating α and β elements were particularly resistant to denaturing.
Some believe that studies of dynamics, the internal mobility patterns of proteins, might fill some of the gaps in similarity between proteins left by structure and sequence (Rackovsky, 1990; Tokuriki & Tawfik, 2009). This certainly seems clear in the case of intrinsically disordered proteins (Tokuriki & Tawfik, 2009). It has even been suggested that dynamics has a more predictable relationship with function than structure does (Hensen et al., 2012). Hensen et al. (2012) constructed a protein dynasome space determined by 34 dynamics measures, such as ruggedness of the free energy landscape, solvent‐accessible surface and average RMSD deviation from the standard X‐ray structure. This was compared with a similar structure space, using measures such as number of contacts, number of hydrogen bonds and secondary structure content, with a total of 20 dimensions. They found that centroids determined by function could be used to classify proteins in only four of these dynasome dimensions with a success rate of 57%, compared to a structure‐based success rate of 36%. This could support the theory that dynamics may have a stronger relationship with function than structure, but could also simply show that dynamics lends itself better to expression in a vector format. Some semblance of a relationship between dynamics and structure was also observed, with protein pairs of similar or very different structure often having similar or very different dynamics (respectively), although exceptions to this rule were reasonably frequent (a Pearson correlation coefficient of 0.38 was observed).
This structure–dynamics relationship has been similarly observed in other work less focused on protein function. Tobi (2018) found that structure (measured by TM‐score) and dynamics (measured through a Dynamics Similarity Score of an alignment of motion indicators based on an anisotropic network model alignment of superimposed structures) were correlated well in tetrameric globin family members. A similar dynamics similarity score was able to cluster a wider set of globins by quaternary structure (Tobi, 2018).
Wrabl & Hilser (2010) observed local thermodynamic stability being correlated in homologous pairs of proteins to a much greater extent than in non‐homologous pairs. Entropy and enthalpy showed similar effects to a lesser extent. The dynamics correlation coefficient between proteins was also found to decrease as structural similarity (according to SCOP) decreased (Wrabl & Hilser, 2010). Exceptions were, again, observed. Proteins with little structural relationship and similar dynamics were thought to be evidence of fold changes. Highly homologous proteins with low similarity in dynamics were suggested to have functional relevance. These ideas may also explain the clusters of exceptions observed by Hensen et al. (2012). The fact that variations in local stability may be the fingerprints of structural homology throw further into question the use of thermodynamic stability as a measure for the realism of folds.
IV. SUMMARY AND SYNTHESIS
The findings of the investigations mentioned above can be summarised to give an idea about what we know about the evolutionary network of proteins. The potential of proteins to switch folds through a relatively small number of mutations is high (Bryan & Orban, 2010; Roessler et al., 2008; Alexander et al., 2007; Zhang et al., 2006; Meyerguz et al., 2007; Cao & Elber, 2010), resulting in a densely connected network. High connectivity is also observed in networks of structural similarity (Skolnick et al., 2009; Holm & Sander, 1996a , b ; Dokholyan et al., 2002; Edwards & Deane, 2015). However, physics‐based methods show a less densely connected picture (Grahnen et al., 2011). The number of proteins that can mutate into a particular fold is much more variable than the number of folds into which a protein can mutate according to informational work (Cao & Elber, 2010). Folds incompatible with protein sequences have a strong effect on the shape of protein space (Hoffmann et al., 2016). Folds that are compatible with protein sequences have a comparatively small effect. Proteins and domains from the α/β class continually occupy unique spaces in networks, as highly connected bridges in co‐occurrence (Nepomnyachiy et al., 2014), informational (Cao & Elber, 2010) and sequence‐similarity networks (Valavanis et al., 2010) or a distinct cluster in structural networks (Hou et al., 2003). Most importantly, the accuracy with which we can model the above is currently limited and we have much more to learn.
Aside from these general findings, we can look at what these network investigations have to say about a number of questions we may want answered about protein fold space. The conclusion that protein space is complete seems to depend on the definition of structural equivalence used (Zhang et al., 2006; Cossio et al., 2010; Taylor et al., 2009; Dai & Zhou, 2011; Skolnick et al., 2012). Whether protein space is complete or not, we know there must be some reason for anything that has not come to attention yet, be it undersampled regions of the tree of life (Barry Roche & Brüls, 2015) or evolutionary selection. Evidence for both convergent and divergent models of protein evolution has been forthcoming (Holm & Sander, 1996a ; Meyerguz et al., 2007; Cao & Elber, 2010; Dokholyan et al., 2002; Meunier & Broz, 2017; Kaur et al., 2018), suggesting that, as for entire organisms, protein evolution frequently exhibits both these trends. Small ‘building blocks’ may also have a role in the way proteins evolved (Alva et al., 2010, 2015; Shindyalov & Bourne, 2000; Lupas et al., 2001; Friedberg & Godzik, 2005), resulting in some suggestions of a continuous structure space (Pascual‐García et al., 2009). Proteins have been found to be marginally stable in nature (Taverna & Goldstein, 2002; Goldstein, 2011; Williams et al., 2006), which must be considered when approaching thermodynamic energy‐based networks. A link between the sequence capacity of proteins and their evolution has been posited (Cao & Elber, 2010; Taverna & Goldstein, 2002; Goldstein, 2011), although it is unclear which way around the causation behind this correlation lies. This seems to support a ‘survival of the flattest’ view of protein evolution, which may act in concurrence with the tendency of proteins to be marginally, rather than maximally, stable to make paths of evolution halfway up shallow stability slopes. Proteins have been shown to obey some rules of classic population genetics when considered as part of a whole (Goldstein, 2011, 2013), which suggests there may be other links that could be exploited with this field. Dynamics has emerged as a potential way to include protein functionality to some extent in maps of protein structure space (Hensen et al., 2012; Tobi, 2018).
The existing research is invaluable as we move towards a complete understanding of the network of protein evolution, however, a comprehensive, universal model of protein space is as yet out of reach. Those models involving large proportions of the protein universe mentioned above tend to consider only one or two aspects of protein structure evolution at a time (Meyerguz et al., 2007; Cao & Elber, 2010; Hoffmann et al., 2016; Skolnick et al., 2009; Holm & Sander, 1996a , b ; Dokholyan et al., 2002; Edwards & Deane, 2015). The emphasis in these works is on breadth; a top‐down view of protein space. Physics‐based work (Grahnen et al., 2011; Chi et al., 2018) provides a picture of protein structure and sequence that considers many more elements but is restricted to a small set of proteins from the PDB; a bottom‐up view. These methods are also combined in some cases (Kleinman et al., 2010). Building on what we have learned from these existing models, and introducing elements such as measures of function, enzymatic pathway interactions, and an organismal view, will allow us to develop a complete picture of protein space.
It seems that as the computational resources available to researchers improve, so too will the sophistication and, therefore, the efficacy, of the models we can use for protein mutation. The recent excitement around developments in protein structure prediction is an excellent example of the impact software development can have. With close‐to‐infallible protein structure prediction on the horizon, it could become easier than ever to visualise complex and detailed models of protein structure evolution.
Moving forward, the emphasis of thinking about protein structure should be one that considers both sides to all the questions discussed in this review. It seems from the literature that the answer to most of these dichotomous queries is ‘both’. Evidence for both convergence and divergence can be observed in protein structural networks; protein structural space can be considered both discrete and continuous and protein evolution should be considered from both a top‐down and bottom‐up perspective. Even marginal stability can be seen as an example of the difficulty in designating protein structures to one property or another. It has been demonstrated in the past that including different perspectives in one protein structure alignment method can result in improved performance (Sykes, Holland & Charleston, 2020). Similarly, Edwards & Deane (2015) found that if multiple protein structure alignment programs confirmed a structural similarity link, it was more likely to be reflected in protein age. Kleinman et al. (2010) found that the more physics properties included in their informational model, the more accurate it seemed. Integrated sequence and structure similarity networks perform better at classification than individual ones (Çamoĝlu et al., 2006). AlphaFold (Jumper et al., 2021) demonstrated an extremely successful reliance on both evolutionary and physics‐based techniques for predicting protein structure. Only through consideration of multiple aspects of protein structural space will an ideal, exhaustive model of protein evolution be possible.
V. CONCLUSIONS
-
(1)
Most models of protein structure evolution and similarity suggest a densely connected network, with the exception of physics‐based models, which suggest much stronger selection and thus, a sparser network of connection.
-
(2)
α/β protein structures occupy special places in many protein networks, as bridges in co‐occurrence, informational and sequence similarity networks, and as a distinct cluster in structural networks.
-
(3)
There is evidence for protein structure evolution as convergent and divergent, for protein structures themselves to be both stable and unstable and for protein structure space to be both complete and incomplete, both discrete and continuous.
-
(4)
Moving forward, considering protein structure evolution from as many different points of view and using as many different models as possible seems the most likely route to a complete understanding of the way these structures evolve.
ACKNOWLEDGEMENTS
J. S. was supported to complete this work through an SET Research Training Program (RTP) Stipend and by the University of Tasmania Discipline of Mathematics. Open access publishing facilitated by University of Tasmania, as part of the Wiley ‐ University of Tasmania agreement via the Council of Australian University Librarians.
REFERENCES
- Alexander, P. A. , He, Y. , Chen, Y. , Orban, J. & Bryan, P. N. (2007). The design and characterization of two proteins with 88% sequence identity but different structure and function. Proceedings of the National Academy of Sciences 104, 11963–11968. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Alexander, P. A. , He, Y. , Chen, Y. , Orban, J. & Bryan, P. N. (2009). A minimal sequence code for switching protein structure and function. Proceedings of the National Academy of Sciences 106, 21149–21154. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Alexander, P. A. , Rozak, D. A. , Orban, J. & Bryan, P. N. (2005). Directed evolution of highly homologous proteins with different folds by phage display: implications for the protein folding code. Biochemistry 44, 14045–14054. [DOI] [PubMed] [Google Scholar]
- Alva, V. , Remmert, M. , Biegert, A. , Lupas, A. N. & Soding, J. (2010). A galaxy of folds. Protein Science 19, 124–130. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Alva, V. , Soding, J. & Lupas, A. N. (2015). A vocabulary of ancient peptides at the origin of folded proteins. eLife 4, e09410. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Anderson, T. A. , Cordes, M. H. J. & Sauer, R. T. (2005). Sequence determinants of a conformational switch in a protein structure. Proceedings of the National Academy of Sciences 102, 18344–18349. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Arpino, J. A. J. , Reddington, S. C. , Halliwell, L. M. , Rizkallah, P. J. & Jones, D. D. (2014). Random single amino acid deletion sampling unveils structural tolerance and the benefits of helical registry shift on GFP folding and structure. Structure 22, 889–898. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barry Roche, D. & Brüls, T. (2015). An assessment of the amount of untapped fold level novelty in under‐sampled areas of the tree of life. Scientific Reports 5, srep14717. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bryan, P. N. & Orban, J. (2010). Proteins that switch folds. Current Opinion in Structural Biology 20, 482–488. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brylinski, M. & Skolnick, J. (2008). A threading‐based method (FINDSITE) for ligand‐binding site prediction and functional annotation. Proceedings of the National Academy of Sciences of the United States of America 105, 129–134. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Buel, G. R. & Walters, K. J. (2022). Can AlphaFold2 predict the impact of missense mutations on structure? Nature Structural & Molecular Biology 29, 1–2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Çamoĝlu, O. , Can, T. & Singh, A. K. (2006). Integrating multi‐attribute similarity networks for robust representation of the protein space. Bioinformatics 22, 1585–1592. [DOI] [PubMed] [Google Scholar]
- Cao, B. & Elber, R. (2010). Computational exploration of the network of sequence flow between protein structures. Proteins: Structure, Function and Bioinformatics 78, 985–1003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen, S. H. , Meller, J. & Elber, R. (2016). Comprehensive analysis of sequences of a protein switch. Protein Science 25, 135–146. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cheng, H. , Schaeffer, R. D. , Liao, Y. , Kinch, L. N. , Pei, J. , Shi, S. , Kim, B. H. & Grishin, N. V. (2014). ECOD: an evolutionary classification of protein domains. PLoS Computational Biology 10(12), e1003926. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chi, P. B. , Kim, D. , Lai, J. K. , Bykova, N. , Weber, C. C. , Kubelka, J. & Liberles, D. A. (2018). A new parameter‐rich structure‐aware mechanistic model for amino acid substitution during evolution. Proteins: Structure, Function and Bioinformatics 86, 218–228. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chothia, C. & Lesk, A. M. (1986). The relation between the divergence of sequence and structure in proteins. The EMBO Journal 5, 823–826. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cossio, P. , Trovato, A. , Pietrucci, F. , Seno, F. , Maritan, A. & Laio, A. (2010). Exploring the universe of protein structures beyond the protein data bank. PLoS Computational Biology 6, e1000957. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Csaba, G. , Birzele, F. & Zimmer, R. (2009). Systematic comparison of SCOP and CATH: a new gold standard for protein structure analysis. BMC Structural Biology 9, 23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cui, X. , Lu, Z. , Wang, S. , Jing‐Yan Wang, J. & Gao, X. (2016). CMsearch: simultaneous exploration of protein sequence space and structure space improves not only protein homology detection but also protein structure prediction. Bioinformatics 32, i332–i340. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dai, L. & Zhou, Y. (2011). Characterizing the existing and potential structural space of proteins by large‐scale multiple loop permutations. Journal of Molecular Biology 408, 585–595. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Daniels, N. M. , Kumar, A. , Cowen, L. J. & Menke, M. (2012). Touring protein space with Matt. IEEE/ACM Transactions on Computational Biology and Bioinformatics 9, 286–293. [DOI] [PMC free article] [PubMed] [Google Scholar]
- DePristo, M. A. , Weinreich, D. M. & Hartl, D. L. (2005). Missense meanderings in sequence space: a biophysical view of protein evolution. Nature Reviews Genetics 6, 678–687. [DOI] [PubMed] [Google Scholar]
- Dokholyan, N. V. , Shakhnovich, B. & Shakhnovich, E. I. (2002). Expanding protein universe and its origin from the biological big bang. Proceedings of the National Academy of Sciences 99, 14132–14136. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Edwards, H. & Deane, C. M. (2015). Structural bridges through fold space. PLoS Computational Biology 11, e1004466. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ferrada, E. & Wagner, A. (2008). Protein robustness promotes evolutionary innovations on large evolutionary time‐scales. Proceedings of the Royal Society B: Biological Sciences 275, 1595–1602. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fox, N. K. , Brenner, S. E. & Chandonia, J. M. (2014). SCOPe: structural classification of proteins ‐ extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Research 42, D304–D309. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Friedberg, I. & Godzik, A. (2005). Connecting the protein structure universe by using sparse recurring fragments. Structure 13, 1213–1224. [DOI] [PubMed] [Google Scholar]
- Goldstein, R. A. (2011). The evolution and evolutionary consequences of marginal thermostability in proteins. Proteins: Structure, Function and Bioinformatics 79, 1396–1407. [DOI] [PubMed] [Google Scholar]
- Goldstein, R. A. (2013). Population size dependence of fitness effect distribution and substitution rate probed by biophysical model of protein thermostability. Genome Biology and Evolution 5, 1584–1593. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gosavi, S. (2013). Understanding the folding‐function tradeoff in proteins. PLoS One 8, e61222. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Grahnen, J. A. , Nandakumar, P. , Kubelka, J. & Liberles, D. A. (2011). Biophysical and structural considerations for protein sequence evolution. BMC Evolutionary Biology 11, 361. [DOI] [PMC free article] [PubMed] [Google Scholar]
- He, Y. , Chen, Y. , Alexander, P. , Bryan, P. N. & Orban, J. (2008). NMR structures of two designed proteins with high sequence identity but different fold and function. Proceedings of the National Academy of Sciences 105, 14412–14417. [DOI] [PMC free article] [PubMed] [Google Scholar]
- He, Y. , Chen, Y. , Alexander, P. A. , Bryan, P. N. & Orban, J. (2012). Mutational tipping points for switching protein folds and functions. Structure 20, 283–291. [DOI] [PMC free article] [PubMed] [Google Scholar]
- He, Y. , Yeh, D. C. , Alexander, P. , Bryan, P. N. & Orban, J. (2005). Solution NMR structures of IgG binding domains with artificially evolved high levels of sequence identity but different folds. Biochemistry 44, 14055–14061. [DOI] [PubMed] [Google Scholar]
- Hensen, U. , Meyer, T. , Haas, J. , Rex, R. , Vriend, G. & Grubmüller, H. (2012). Exploring protein dynamics space: the dynasome as the missing link between protein structure and function. PLoS One 7, e33931. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hoffmann, J. , Wrabl, J. O. & Hilser, V. J. (2016). The role of negative selection in protein evolution revealed through the energetics of the native state ensemble. Proteins: Structure, Function and Bioinformatics 84, 435–447. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Holm, L. & Sander, C. (1993). Protein structure comparison by alignment of distance matrices. Journal of Molecular Biology 233, 123–138. [DOI] [PubMed] [Google Scholar]
- Holm, L. & Sander, C. (1996. a). Mapping the protein universe. Science 273, 595–602. [DOI] [PubMed] [Google Scholar]
- Holm, L. & Sander, C. (1996. b). The FSSP database: fold classification based on structure‐structure alignment of proteins. Nucleic Acids Research 24, 206–209. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hou, J. , Sims, G. E. , Zhang, C. & Kim, S. H. (2003). A global representation of the protein fold space. Proceedings of the National Academy of Sciences of the United States of America 100, 2386–2390. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hu, L. & Chan, K. C. (2017). Extracting coevolutionary features from protein sequences for predicting protein‐protein interactions. IEEE/ACM Transactions on Computational Biology and Bioinformatics 14, 155–166. [DOI] [PubMed] [Google Scholar]
- Jeong, J. C. , Lin, X. & Chen, X. W. (2011). On position‐specific scoring matrix for protein function prediction. IEEE/ACM Transactions on Computational Biology and Bioinformatics 8, 308–315. [DOI] [PubMed] [Google Scholar]
- Jumper, J. , Evans, R. , Pritzel, A. , Green, T. , Figurnov, M. , Ronneberger, O. , Tunyasuvunakool, K. , Bates, R. , Žídek, A. , Potapenko, A. , Bridgland, A. , Meyer, C. , Kohl, S. A. A. , Ballard, A. J. , Cowie, A. , et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kaur, G. , Iyer, L. M. , Subramanian, S. & Aravind, L. (2018). Evolutionary convergence and divergence in archaeal chromosomal proteins and chromo‐like domains from bacteria and eukaryotes. Scientific Reports 8, 6196. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kleinman, C. L. , Rodrigue, N. , Lartillot, N. & Philippe, H. (2010). Statistical potentials for improved structurally constrained evolutionary models. Molecular Biology and Evolution 27, 1546–1560. [DOI] [PubMed] [Google Scholar]
- Knudsen, M. & Wiuf, C. (2010). The CATH database. Human Genomics 4(3), 207–212. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kumirov, V. K. , Dykstra, E. M. , Hall, B. M. , Anderson, W. J. , Szyszka, T. N. & Cordes, M. H. (2018). Multistep mutational transformation of a protein fold through structural intermediates. Protein Science 27, 1767–1779. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kummerfeld, S. K. & Teichmann, S. A. (2009). Protein domain organisation: adding order. BMC Bioinformatics 10, 39. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lesk, A. M. & Chothia, C. (1980). How different amino acid sequences determine similar protein structures: the structure and evolutionary dynamics of the globins. Journal of Molecular Biology 136, 225–270. [DOI] [PubMed] [Google Scholar]
- Lesk, A. M. & Chothia, C. (1982). Evolution of proteins formed by β‐sheets. Journal of Molecular Biology 160, 325–342. [DOI] [PubMed] [Google Scholar]
- Levitt, M. (2009). Nature of the protein universe. Proceedings of the National Academy of Sciences of the United States of America 106, 11079–11084. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lupas, A. N. , Ponting, C. P. & Russell, R. B. (2001). On the evolution of protein folds: are similar motifs in different protein folds the result of convergence, insertion, or relics of an ancient peptide world? Journal of Structural Biology 134, 191–203. [DOI] [PubMed] [Google Scholar]
- Mai, T. L. , Hu, G. M. & Chen, C. M. (2016). Visualizing and clustering protein similarity networks sequences, structures, and functions. Journal of Proteome Research 15, 2123–2131. 10.1021/acs.jproteome.5b01031. [DOI] [PubMed] [Google Scholar]
- Maynard‐Smith, J. (1970). Natural selection and the concept of a protein space. Nature 225, 563–564. [DOI] [PubMed] [Google Scholar]
- Medvedev, K. E. , Kinch, L. N. , Dustin Schaeffer, R. , Pei, J. & Grishin, N. V. (2021). A fifth of the protein world: Rossmann‐like proteins as an evolutionarily successful structural unit. Journal of Molecular Biology 433(4), 166788. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meunier, E. & Broz, P. (2017). Evolutionary convergence and divergence in NLR function and structure. Trends in Immunology 38, 744–757. [DOI] [PubMed] [Google Scholar]
- Meyerguz, L. , Kleinberg, J. & Elber, R. (2007). The network of sequence flow between protein structures. Proceedings of the National Academy of Sciences 104, 11627–11632. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Minary, P. & Levitt, M. (2008). Probing protein fold space with a simplified model. Journal of Molecular Biology 375, 920–933. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moult, J. , Fidelis, K. , Kryshtafovych, A. , Schwede, T. & Tramontano, A. (2014). Critical assessment of methods of protein structure prediction (CASP) – round XIII. Proteins: Structure, Function and Bioinformatics 82, 1–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nepomnyachiy, S. , Ben‐Tal, N. & Kolodny, R. (2014). Global view of the protein universe. Proceedings of the National Academy of Sciences of the United States of America 111, 11691–11696. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Orlenko, A. , Teufel, A. I. , Chi, P. B. & Liberles, D. A. (2016). Selection on metabolic pathway function in the presence of mutation‐selection‐drift balance leads to rate‐limiting steps that are not evolutionarily stable. Biology Direct 11, 31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ortiz, A. R. , Strauss, C. E. & Olmea, O. (2009). MAMMOTH (Matching molecular models obtained from theory): an automated method for model comparison. Protein Science 11, 2606–2621. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pandit, S. B. & Skolnick, J. (2008). Fr‐TM‐align: a new protein structural alignment method based on fragment alignments and the TM‐score. BMC Bioinformatics 9, 531. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pascual‐García, A. , Abia, D. , Ortiz, Á. R. & Bastolla, U. (2009). Cross‐over between discrete and continuous protein structure space: insights into automatic classification and networks of protein structures. PLoS Computational Biology 5, e1000331. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Porter, L. L. , He, Y. , Chen, Y. , Orban, J. & Bryan, P. N. (2015). Subdomain interactions foster the design of two protein pairs with 80% sequence identity but different folds. Biophysical Journal 108, 154–162. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Porter, L. L. & Looger, L. L. (2018). Extant fold‐switching proteins are widespread. Proceedings of the National Academy of Sciences 115, 5968–5973. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rackovsky, S. (1990). Quantitative organization of the known protein x‐ray structures. I. Methods and short‐length‐scale results. Proteins: Structure, Function and Bioinformatics 7, 378–402. [DOI] [PubMed] [Google Scholar]
- Rackovsky, S. (2009). Sequence physical properties encode the global organization of protein structure space. Proceedings of the National Academy of Sciences of the United States of America 106, 14345–14348. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rahman, R. S. & Rackovsky, S. (1995). Protein sequence randomness and sequence/structure correlations. Biophysical Journal 68, 1531–1539. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rockah‐Shmuel, L. , Tóth‐Petróczy, Á. & Tawfik, D. S. (2015). Systematic mapping of protein mutational space by prolonged drift reveals the deleterious effects of seemingly neutral mutations. PLoS Computational Biology 11, e1004421. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Roessler, C. G. , Hall, B. M. , Anderson, W. J. , Ingram, W. M. , Roberts, S. A. , Montfort, W. R. & Cordes, M. H. J. (2008). Transitive homology‐guided structural studies lead to discovery of Cro proteins with 40% sequence identity but different folds. Proceedings of the National Academy of Sciences of the United States of America 105, 2343–2348. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rost, B. (1999). Twilight zone of protein sequence alignments. Protein Engineering 12, 85–94. [DOI] [PubMed] [Google Scholar]
- Sadreyev, R. I. , Kim, B. H. & Grishin, N. V. (2009). Discrete‐continuous duality of protein structure space. Current Opinion in Structural Biology 19, 321–328. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shen, H. B. & Chou, K. C. (2008). PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition. Analytical Biochemistry 373, 386–388. [DOI] [PubMed] [Google Scholar]
- Shindyalov, I. N. & Bourne, P. E. (2000). An alternative view of protein fold space. Proteins: Structure, Function and Genetics 38, 247–260. [PubMed] [Google Scholar]
- Skolnick, J. , Arakaki, A. K. , Lee, S. Y. & Brylinski, M. (2009). The continuity of protein structure space is an intrinsic property of proteins. Proceedings of the National Academy of Sciences of the United States of America 106, 15690–15695. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Skolnick, J. , Zhou, H. & Brylinski, M. (2012). Further evidence for the likely completeness of the library of solved single domain protein structures. Journal of Physical Chemistry B 116, 6654–6664. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Srivastava, S. , Lal, S. B. , Mishra, D. C. , Angadi, U. B. , Chaturvedi, K. K. , Rai, S. N. & Rai, A. (2016). An efficient algorithm for protein structure comparison using elastic shape analysis. Algorithms for Molecular Biology 11, 27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stewart, K. L. , Dodds, E. D. , Wysocki, V. H. & Cordes, M. H. J. (2013). A polymetamorphic protein. Protein Science 22, 641–649. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Swindells, M. B. , Orengo, C. A. , Jones, D. T. , Hutchinson, E. G. & Thornton, J. M. (1998). Contemporary approaches to protein structure classification. BioEssays 20(11), 884–891. [DOI] [PubMed] [Google Scholar]
- Sykes, J. , Holland, B. R. & Charleston, M. A. (2020). Benchmarking methods of protein structure alignment. Journal of Molecular Evolution 88, 575–597. [DOI] [PubMed] [Google Scholar]
- Taverna, D. M. & Goldstein, R. A. (2002). Why are proteins marginally stable? Proteins: Structure, Function and Genetics 46, 105–109. [DOI] [PubMed] [Google Scholar]
- Taylor, M. S. , Ponting, C. P. & Copley, R. R. (2004). Occurrence and consequences of coding sequence insertions and deletions in mammalian genomes. Genome Research 14, 555–566. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Taylor, W. R. (2008). Protein structure comparison using iterated double dynamic programming. Protein Science 8, 654–665. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Taylor, W. R. , Chelliah, V. , Hollup, S. M. , MacDonald, J. T. & Jonassen, I. (2009). Probing the "dark matter" of protein fold space. Structure 17, 1244–1252. [DOI] [PubMed] [Google Scholar]
- Tian, P. & Best, R. B. (2017). How many protein sequences fold to a given structure? A coevolutionary analysis. Biophysical Journal 113, 1719–1730. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tobi, D. (2018). Dynamics based clustering of globin family members. PLoS One 13, e0208465. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tokuriki, N. & Tawfik, D. S. (2009). Protein dynamism and evolvability. Science 324, 203–207. [DOI] [PubMed] [Google Scholar]
- Tung, C. H. , Chen, C. W. , Guo, R. C. , Ng, H. F. & Chu, Y. W. (2016). QuaBingo: a prediction system for protein quaternary structure attributes using block composition. BioMed Research International 2016(2016), 9480276. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Valavanis, I. , Spyrou, G. & Nikita, K. (2010). A similarity network approach for the analysis and comparison of protein sequence/structure sets. Journal of Biomedical Informatics 43, 257–267. [DOI] [PubMed] [Google Scholar]
- Veeramalai, M. & Gilbert, D. (2008). A novel method for comparing topological models of protein structures enhanced with ligand information. Bioinformatics 24, 2698–2705. [DOI] [PubMed] [Google Scholar]
- Vetter, I. R. , Baase, W. A. , Heinz, D. W. , Xiong, J. P. , Snow, S. & Matthews, B. W. (1996). Protein structural plasticity exemplified by insertion and deletion mutants in T4 lysozyme. Protein Science 5, 2399–2415. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wan, X. & Tan, X. (2019). A study on separation of the protein structural types in amino acid sequence feature spaces. PLoS One 14, e0226768. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wan, X. , Zhao, X. & Yau, S. S. (2017). An information‐based network approach for protein classification. PLoS One 12, e0174386. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wilke, C. O. , Wang, J. L. , Ofria, C. , Lenski, R. E. & Adami, C. (2001). Evolution of digital organisms at high mutation rates leads to survival of the flattest. Nature 412, 331–333. [DOI] [PubMed] [Google Scholar]
- Williams, P. D. , Pollock, D. D. & Goldstein, R. A. (2006). Functionality and the evolution of marginal stability in proteins: inferences from lattice simulations. Evolutionary Bioinformatics 2, 91–101. [PMC free article] [PubMed] [Google Scholar]
- Wilson, C. A. , Kreychman, J. & Gerstein, M. (2000). Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. Journal of Molecular Biology 297, 233–249. [DOI] [PubMed] [Google Scholar]
- Wrabl, J. O. & Hilser, V. J. (2010). Investigating homology between proteins using energetic profiles. PLoS Computational Biology 6, e1000722. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wuchty, S. (2001). Scale‐free behavior in protein domain networks. Molecular Biology and Evolution 18, 1694–1702. [DOI] [PubMed] [Google Scholar]
- Xu, J. & Zhang, J. (2016). Impact of structure space continuity on protein fold classification. Scientific Reports 6, 23263. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yau, S. S. , Mao, W. G. , Benson, M. & He, R. L. (2015). Distinguishing proteins from arbitrary amino acid sequences. Scientific Reports 5, 7972. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ye, Y. & Godzik, A. (2004). Comparative analysis of protein domain organization. Genome Research 14, 343–353. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yu, C. , Deng, M. , Cheng, S. Y. , Yau, S. C. , He, R. L. & Yau, S. S. (2013). Protein space: a natural method for realizing the nature of protein universe. Journal of Theoretical Biology 318, 197–204. [DOI] [PubMed] [Google Scholar]
- Zeldovich, K. B. & Shakhnovich, E. I. (2008). Understanding protein evolution: from protein physics to Darwinian selection. Annual Review of Physical Chemistry 59, 105–127. [DOI] [PubMed] [Google Scholar]
- Zhang, Y. , Hubner, I. A. , Arakaki, A. K. , Shakhnovich, E. & Skolnick, J. (2006). On the origin and highly likely completeness of single‐domain protein structures. Proceedings of the National Academy of Sciences of the United States of America 103, 2605–2610. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang, Y. & Skolnick, J. (2005). TM‐align: a protein structure alignment algorithm based on the TM‐score. Nucleic Acids Research 33, 2302–2309. [DOI] [PMC free article] [PubMed] [Google Scholar]
