Ancestral sequences of a large promiscuous enzyme family correspond to bridges in sequence space in a network representation

Patrick C F Buchholz; Bert van Loo; Bernard D G Eenink; Erich Bornberg-Bauer; Jürgen Pleiss

doi:10.1098/rsif.2021.0389

. 2021 Nov 3;18(184):20210389. doi: 10.1098/rsif.2021.0389

Ancestral sequences of a large promiscuous enzyme family correspond to bridges in sequence space in a network representation

Patrick C F Buchholz ¹, Bert van Loo ^2,³, Bernard D G Eenink ³, Erich Bornberg-Bauer ^3,⁴, Jürgen Pleiss ^1,^✉

PMCID: PMC8564622 PMID: 34727710

Abstract

Evolutionary relationships of protein families can be characterized either by networks or by trees. Whereas trees allow for hierarchical grouping and reconstruction of the most likely ancestral sequences, networks lack a time axis but allow for thresholds of pairwise sequence identity to be chosen and, therefore, the clustering of family members with presumably more similar functions. Here, we use the large family of arylsulfatases and phosphonate monoester hydrolases to investigate similarities, strengths and weaknesses in tree and network representations. For varying thresholds of pairwise sequence identity, values of betweenness centrality and clustering coefficients were derived for nodes of the reconstructed ancestors to measure the propensity to act as a bridge in a network. Based on these properties, ancestral protein sequences emerge as bridges in protein sequence networks. Interestingly, many ancestral protein sequences appear close to extant sequences. Therefore, reconstructed ancestor sequences might also be interpreted as yet-to-be-identified homologues. The concept of ancestor reconstruction is compared to consensus sequences, too. It was found that hub sequences in a network, e.g. reconstructed ancestral sequences that are connected to many neighbouring sequences, share closer similarity with derived consensus sequences. Therefore, some reconstructed ancestor sequences can also be interpreted as consensus sequences.

Keywords: ancestor reconstruction, consensus sequence, network biology, phylogeny, protein evolution

1. Background

Protein sequences can be characterized and analysed by multiple computational approaches. A major concept for grasping the complexity of protein evolution and their structural and functional relationship is by grouping proteins into families [1]. Conceptually, protein families are assumed to have arisen from a common ancestor by means of gene duplication followed by divergence in sequence and function [2–4], while the structure remains better conserved in general [5,6]. Conversely, to some extent, this apparent conservation is a consequence of structure limiting the extent to which sequence changes are tolerated before a protein becomes non-functional. Consequently, continuous mutational paths [7] transforming one structure into another without rendering the protein non-functional [8] are believed to be rare [9]. This separation of protein families in sequence space has been postulated by simulations using coarse-grained biophysical models [10,11]. However, mutational paths within a structural family that gradually change protein function have been repeatedly inferred in experiments and by reconstructing the evolutionary history of enzyme families [4,12,13]. The detailed nature of such transitions is difficult to investigate [14,15], but finer scale changes between members within a family are more amenable to both computational and experimental studies [16]. Because the evolution of protein sequences occurs predominantly in small successive steps of substitutions and indels, it has been discussed how much of sequence space has been explored during evolution [17] and whether there are continuous mutational paths connecting protein families [7].

Two computational methods are frequently used to characterize protein families: protein sequence networks and phylogenetic trees. A tree builds on the assumption of gradual divergence that can be placed in a hierarchical context and reflects evolutionary history. Binary trees can be regarded as a special case of network topology without reticulation. Phylogenetic trees are generated from a set of all theoretically feasible trees and selected by a criterion such as maximum likelihood [18], whereas the protein sequence networks used herein are derived from pairwise alignments based on a dynamic algorithm from Needleman & Wunsch [19]. A network does not make any assumptions about evolutionary relationships between protein sequences but describes their relation by some measure of similarity. Atkinson et al. [20] were among the first who visualized and interpreted protein families by networks as particularly instructive representations for large and diverse sequence datasets. In a protein sequence network, the nodes represent individual sequences, and two nodes are connected by an edge if the similarity of two sequences exceeds a threshold. Protein families are formed by clusters or sub-networks within the overall network, and sequences and families are connected by multiple paths which indicate alternative evolutionary pathways [21]. For various protein families of different sequence length, structural fold and domain arrangement, it was found that their protein sequence networks were heterogeneous with few highly connected sequences (hub sequences) and a large number of smaller and less connected sequences; the degree distributions, the number of nodes with a given number of neighbours in a network, followed a power law distribution with similar scaling exponents [22]. The hub sequences have many functional neighbours and are thus expected to be robust toward possible deleterious effects of mutations. Furthermore, the cluster size of different protein families followed a similar power law distribution, in accordance with percolation theory, strongly supporting overall connectedness of extant sequence space [23].

On the other hand, phylogenetic trees allow for reconstructing the likely ancestors of protein families, which in turn provides a glimpse into the past of a protein family's function [4,12,13]. A key factor for creating evolutionary opportunities seems to be promiscuity, in particular for enzymes [24]. While multiple latent functions might exist in an enzyme variant that are not physiologically relevant, they may become selected for upon a change in environmental conditions [25]. These ancestral proteins can be reconstructed using maximum-likelihood methods applied to protein families, the resulting most likely ancestral genes can be synthesized, and their gene products can be expressed and characterized in the laboratory [26].

Reconstructing ancestral sequences has also been proposed as a strategy to design (thermo)stable proteins under the assumption of thermophilic common ancestor organisms. Using ancestral sequence reconstruction, active and highly stable enzymes have been obtained [27,28]. In an alternative approach, (thermos)stable proteins were designed following the consensus of conserved sequence regions, i.e. by deriving a characteristic average sequence [29]. In back-to-consensus design, amino acids are substituted by consensus residues, the residues most frequently occurring at a given position in a multiple sequence alignment (MSA) of known sequences [30]. Back-to-consensus mutations resulted in increased thermostability in some enzymes with unchanged catalytic properties [31]. Interestingly, most ancestor mutations overlap consensus mutations [32]. Due to (thermo)stability and promiscuity that have been observed in ancestral enzymes, ancestral enzymes have been interpreted as promising starting points for directed evolution or protein design [33].

Here, we compare network versus tree representations of protein families. Specifically, we study the large families of arylsulfatases (ASs) and phosphonate monoester hydrolases (PMHs), two members of the sulfatase family that share a common ancestor [34]. ASs and PMHs have alkaline phosphatase-like folds with conserved active site residues and are between 500 and 600 amino acids long. Previously, the phylogeny of these two sulfatase subfamilies was visualized by a maximum-likelihood phylogenetic tree, with annotated choline sulfatases (CSs) serving as an outgroup [35–37]. In an alternative approach, the distances for homologues of CSs could be treated by weights in the phylogenetic tree. As we focus on the two families of ASs and PMHs, we ignore CSs for further analyses. Here, we focus on the role of reconstructed ancestral sequences in protein sequence networks to compare essential features of phylogenetic representations. We would like to understand the role of the previously reconstructed ancestor sequences in protein sequence networks and the evolutionary interpretations of these properties. We finally discuss implications, generality and limitations of our results by comparing them to earlier studies employing model systems that enabled a complete coverage of sequence space using coarse-grained structural representations of proteins [10,11,38].

2. Methods

2.1. Dataset

The dataset used herein was selected from previous work and comprises 95 protein sequences annotated as homologues of ASs, and 85 protein sequences annotated as homologues of PMHs [35–37]. In addition to these 180 protein sequences, 58 ancestral protein sequences had been reconstructed from the maximum-likelihood phylogenetic tree in [35–37], increasing the overall sample size to 238 protein sequences. The outgroup of CS homologues from the original phylogenetic tree was ignored for the analyses presented herein.

2.2. Ancestral sequence reconstruction

Inference of the coding sequences of the 58 ancestral enzymes was done using maximum-likelihood (PAML 4) [39] using the previously published maximum-likelihood tree and a three-dimensional-coffee-generated MSA of the 267 protein sequences mentioned above [35,36]. The coding DNA sequences of each of the 267 proteins were obtained from the NCBI RefSeq database [40], and each sequence was labelled with the respective names of the corresponding amino acid sequence. These coding sequences and their corresponding amino acid sequences were used as input for TranslatorX server [41]. The resulting nucleotide sequence alignment and the phylogenetic tree generated from the amino acid sequences using RAxML HPC2 8.0.24 [35,36,42] were used as input for the codeml program of the PAML software package [39]. The resulting ancestral DNA sequences were manually curated to correct for the inability of PAML to deal with insertions and deletions (indels), i.e. PAML calculates a most likely amino acid for each position in the alignment regardless of the presence of indel characters. The manual curation of the ancestors was based on the maximum-parsimony principle, i.e. assuming the lowest possible number of indel events across the tree (see electronic supplementary material, figure S15 for an example). The positions of the reconstructed ancestors in the maximum-likelihood tree are indicated in the electronic supplementary material, figure S16. The protein sequences, including the reconstructed ancestor sequences, are available as FASTA file under https://doi.org/10.18419/darus-1801.

2.3. Network construction and visualization

Pairs of protein sequences were aligned by the Needleman–Wunsch algorithm implemented in the EMBOSS software suite (v. 6.6.0) applying the BLOSUM62 substitution matrix with gap opening and gap extension penalties of 10 and 0.5, respectively [19,43]. Protein sequence networks were constructed as undirected graphs with edge weights of pairwise global sequence identity (given as percentage) from Needleman–Wunsch alignments. The topology of a protein sequence network and the separation of sub-networks or clusters depends on the threshold that is used to select the edges. Here, edges were selected by thresholds of sequence identity, thereby discarding unconnected nodes from the network.

Networks were stored in GraphML format using the NetworkX library in Python (v. 1.9) to assign node and edge attributes [44]. The complete protein sequence network with 238 nodes and 28 203 edges is available as GraphML file under https://doi.org/10.18419/darus-817.

Protein sequence networks were visualized in the software Cytoscape (v. 3.8.2) by prefuse force-directed layout with respect to the edge weights, i.e. sequence pairs with higher identity were preferably placed in closer vicinity to each other [45].

The software Visone (v. 2.17) was used for visualization of the backbone layout [46,47]. For the backbone layout, edges were ranked in order of decreasing sequence identity and a chosen fraction of the highest ranked edges was selected as ‘backbone’ while maintaining connectivity of the network.

2.4. Statistical properties of network nodes

The software Cytoscape mentioned above was used to derive the following node properties named betweenness centrality, degree centrality and clustering coefficient:

The term betweenness centrality was formally derived for individual nodes in social networks [48] to find the shortest possible paths between the nodes of a network, i.e. the minimal number of edges between all pairs of nodes: for a given node, the betweenness centrality equals the fraction of shortest paths that pass through that node over all shortest paths. Nodes with a comparably high betweenness centrality are bridges between the remaining nodes, as exemplified in the electronic supplementary material, figure S5.

The term degree centrality describes the number of direct neighbours of a given node. The clustering coefficient of a node is a further property that helps to identify ‘bridges’ in a network [49]. Let n_i be the number of edges between the direct neighbours of node i and d_i the degree centrality of node i. Then, the clustering coefficient c_i is defined as

c_{i} = \frac{n_{i}}{(1 / 2) d_{i} (d_{i} - 1)}

with the denominator being the number of theoretically possible edges between the neighbours of i (only defined for degree centrality d_i greater than one), as exemplified in the electronic supplementary material, figure S6. If c_i equals zero, the neighbours of node i are only linked by a path through node i. If c_i equals one, the neighbours of node i are linked between each other. Thus, bridges tend to have comparably small clustering coefficients. Whereas the betweenness centrality of a given node takes the whole network into account (by counting shortest paths between all remaining nodes), the clustering coefficient focuses on the local neighbourhood of a given node.

Spearman's correlation coefficients were determined by MATLAB (v. R2020a, The MathWorks, Natick, MA, USA) using the implemented Statistics and Machine Learning Toolbox (v. 11.7).

2.5. Closest homologues of reconstructed ancestors

To identify the closest homologue for a given reconstructed ancestor from [39], the Basic Local Alignment Search Tool (BLAST) was used to scan the NCBI's non-redundant protein database [50,51] (as available on 4 August 2020). The closest BLAST hit was identified by a minimal expectation value and maximal query coverage. To derive global sequence identity, in contrast with the local sequence identity provided by BLAST, Needleman–Wunsch alignments between a reconstructed ancestor from [35–37] and its closest BLAST hit were applied using the implementation and settings mentioned above.

2.6. Derivation of consensus sequences

An MSA was constructed by Clustal Omega (v. 1.2.4) for the 95 protein sequences annotated as homologues of ASs and the 85 protein sequences annotated as homologues of PMHs, respectively [52]. The HMMER software suite (v. 3.1.b2, http://hmmer.org) was used to derive a profile hidden Markov model (HMM) from an MSA via the command hmmbuild and to derive a consensus sequence from a profile HMM via the command hmmemit. The derived AS and PMH consensus sequences, the underlying MSAs and profile HMMs for ASs and PMHs are available under https://doi.org/10.18419/darus-1838.

3. Results

3.1. Protein sequence networks and the maximum-likelihood phylogenetic tree

Protein sequences from the previously published phylogenetic tree [35–37], including 180 characterized or putative enzymes annotated as ASs or PMHs and 58 reconstructed ancestors, were characterized and visualized as protein sequence networks. The pairwise global sequence identity derived from a Needleman–Wunsch alignment was used as edge weight, i.e. a measure for the strength of a link between two sequences. At a threshold of 50% sequence identity, the 180 characterized or putative ASs and PMHs and the 58 reconstructed ancestors formed a single network of 238 nodes connected by 6694 edges (electronic supplementary material, figure S1). At the increasing threshold, requiring higher sequence identity to connect two sequences, the number of edges gradually decreased. At a threshold of 56% sequence identity, the 238 nodes were still connected by 4855 edges (figure 1). The two sub-networks annotated as AS and PMH homologues were connected only by the last common ancestor of ASs and PMHs named AncAS-PMH. The separation into two AS and PMH sub-networks is in accordance with the topology of the phylogenetic tree that formed two main clades annotated as AS and PMH [35–37], and clusters in the protein sequence network corresponded to sub-branches in the phylogenetic tree (electronic supplementary material, figure S3). In this network, AncAS-PMH was connected to nine other nodes (including the reconstructed ancestors of ASs and PMHs named AncAS and AncPMH, respectively) resulting in alternative paths between the AS and the PMH sub-networks. In the phylogenetic tree, however, AncAS-PMH can only be connected to the two reconstructed ancestors named AncAS and AncPMH. For thresholds above 56% sequence identity, several disconnected networks (clusters) emerged, whereas the average cluster size decreased (electronic supplementary material, figure S2). The 58 reconstructed ancestors for AS and PMH homologues emerged mostly as bridges between more densely connected sub-networks (electronic supplementary material, figure S1). Edges that linked the reconstructed ancestors appeared as the backbone of the protein sequence network (electronic supplementary material, figure S4).

Figure 1. — The maximum-likelihood phylogenetic tree for the protein sequences from [35–37], including 85 homologues of PMHs depicted in red, 95 homologues of ASs depicted in blue and 58 exemplary reconstructed ancestral sequences depicted as dots (a). The last reconstructed ancestors are labelled AncPMH, AncAS-PMH and AncAS, respectively, and depicted in magenta. The annotated sequences of characterized enzymes are listed on the right with Protein Data Bank accessions in brackets. At a threshold of 56% pairwise global sequence identity, the 238 protein sequences are connected by 4855 edges forming a single-protein sequence network. (b) PMH homologues are depicted as triangles, AS homologues as squares and reconstructed ancestors as dots. Different sub-networks can be annotated within the overall network (electronic supplementary material, figure S3).

3.2. Network properties of ancestral sequences

In the connected network formed at a threshold of 56% pairwise sequence identity, the nodes representing the 58 reconstructed ancestors formed bridges between the different sub-networks (figure 1). The bridging characteristics of nodes representing the reconstructed ancestors were quantified by two network properties, the betweenness centrality and the clustering coefficient (see electronic supplementary material, figures S5 and S6 for illustrative examples). The betweenness centrality measures the propensity of a node to act as a bridge between all other nodes in a network. For each node representing the reconstructed ancestor, the betweenness centrality was compared to its relative distance (shortest path) to the last common ancestor AncAS-PMH. At a threshold of 56% sequence identity, the betweenness centrality correlated with the relative distance to AncAS-PMH with a Spearman's rank correlation coefficient of −0.81. Thus, for older reconstructed ancestors that were closer to AncAS-PMH, higher betweenness centralities were observed, implying that the older a reconstructed ancestor the more likely it forms a bridge (figure 2). This observation was also supported by the correlation between the clustering coefficient of a node and its relative distance to AncAS-PMH with a Spearman's rank correlation coefficient of 0.69. The clustering coefficient measures the propensity of a node to act as a bridge between neighbouring nodes in a network, with smaller values hinting at bridges. By contrast, no correlation was observed between the degree centrality (the number of neighbours of a node) and the relative distance to AncAS-PMH (electronic supplementary material, figure S7).

Figure 2. — The protein sequence network generated at a threshold of 56% sequence identity was used to compare different properties for the nodes representing the 58 reconstructed ancestors and their relative distance from the last common ancestor AncAS-PMH in the phylogenetic tree (figure 1). The betweenness centrality ((a), compare with electronic supplementary material, figure S5) and the clustering coefficient ((b), compare with electronic supplementary material, figure S6) were determined to measure the propensity to act as a bridge in a network. The three deepest (oldest) reconstructed ancestors AncPMH, AncAS and AncAS-PMH are depicted as triangles, upside-down triangles and diamonds, respectively.

The dependencies of the betweenness centrality, the clustering coefficient and the degree centrality on the threshold were analysed for the three oldest reconstructed ancestors, AncAS-PMH, AncAS and AncPMH (electronic supplementary material, figures S8–S10). The last common ancestor, AncAS-PMH, remained connected for thresholds up to 63% sequence identity with almost constant betweenness centrality and became disconnected at 64% sequence identity (electronic supplementary material, figure S8). The disruption of the network was also demonstrated by sudden changes in the clustering coefficient. By contrast, the degree centrality gradually decreased with an increasing threshold value.

The disruption of the network at thresholds greater than 64% sequence identity resulted in an abrupt increase of the betweenness centrality of the last common PMH ancestor, named AncPMH (electronic supplementary material, figure S9), which became the central bridge between the three PMH sub-clusters (electronic supplementary material, figures S1 and S2). At a threshold of 73% sequence identity, the PMH network disrupted. Consequently, the betweenness centrality of AncPMH dropped and the clustering coefficient increased.

In contrast to AncPMH, the reconstructed last common AS ancestor, named AncAS (electronic supplementary material, figure S10), had a maximum betweenness centrality just before the disruption of the network at 63% sequence identity, which was similar to the value for AncAS-PMH. At 63% sequence identity, AncAS showed its lowest clustering coefficient, which indicated its role as a bridge between neighbouring nodes. However, for thresholds greater than 64% sequence identity, the betweenness centrality was low, indicating that other nodes played the roles of bridges between the AS sub-clusters (electronic supplementary material, figures S1 and S2).

3.3. Homologues of ancestral sequences

BLAST was applied to search for homologous protein sequences in the NCBI's non-redundant protein database using each of the 58 reconstructed ancestors as query sequences, and global sequence identity between each query sequence and its closest BLAST hit was derived by a pairwise Needleman–Wunsch alignment. The relative distance of each ancestor to the last common ancestor, AncAS-PMH, correlated with the sequence identity to its closest BLAST hit with a Spearman's rank correlation coefficient of 0.87 (electronic supplementary material, figure S11). Thus, close homologues with almost 100% global sequence identity were found for the younger ancestors, whereas the hits for the oldest ancestors AncPMH, AncAS and AncAS-PMH were found at global sequences identities of 68%, 65% and 48%, respectively. Thus, the sequence space close to the older ancestors appeared more sparsely populated in contrast with the sequence space close to the younger ancestors.

3.4. Hubs, ancestors and consensus sequences

AS and PMH consensus sequences were derived from profile HMMs for the 95 AS homologues and the 85 PMH homologues, respectively. For the connected protein sequence network at a threshold of 56% sequence identity, the degree centralities were derived for each node. The sequence identity of each AS or PMH sequence to the respective consensus sequence correlated with its degree centrality, with Spearman correlation coefficients of 0.85 and 0.83 for PMHs and ASs, respectively (figure 3). Thus hubs, which are nodes with many neighbours, were more similar to a consensus sequence. The last reconstructed ancestor AncAS-PMH shared approximately 50% sequence identity with both consensus sequences (electronic supplementary material, figures S12 and S13). Interestingly, the AS and PMH ancestors AncAS and AncPMH were not the closest ancestors to their respective AS or PMH consensus sequence. Instead, younger ancestors with higher degree centralities were found to be more similar to the consensus sequences (electronic supplementary material, figure S14).

Figure 3. — The correlation between degree centralities for the protein sequence network from figure 1 and the pairwise sequence identity against the respective consensus sequence for PMHs (a) and ASs (b). Homologous protein sequences are represented as filled dots, and reconstructed PMH or AS ancestors are represented as crosses, with the last reconstructed ancestors AncPMH (triangles) and AncAS (upside-down triangles) depicted as triangles, respectively. Complete plots with all protein sequences under investigation are available as electronic supplementary material, figures S12 and S13 for comparison.

4. Discussion

4.1. Phylogenetic tree versus protein sequence network

Protein sequence networks are insightful representations for more diverse and larger sequence datasets that are difficult to describe by an MSA [53]. Protein sequence networks neither imply a timescale nor assume a detailed evolutionary model, in contrast with phylogenetic trees [18]. However, a substitution matrix is assumed for the pairwise alignments for a protein sequence network, which implies a model on the frequency of amino acid exchanges, e.g. in the Needleman–Wunsch algorithm [19]. Both trees and networks, however, include assumptions on the frequency of amino acid exchanges or gaps. Thus, each of these two representations for protein families have their specific assumptions and limitations. Because a phylogenetic analysis seeks the optimal (i.e. most likely) evolutionary path from a common ancestor, a phylogenetic tree shows only a single relation between two sequences through an internal node (i.e. an inferred ancestral sequence). By contrast, a protein sequence network allows for sequences to be connected through multiple paths consisting of extant sequences. Thus, protein sequence networks can capture alternative possible routes of evolution between known sequences, as it was previously shown for a network of point variants of TEM β-lactamases [21]. Protein sequence networks of protein families that differed in the fold, domain composition and sequence length showed a scale-free cluster size distribution with a similar scaling exponent which indicated the connectedness of the protein sequence space [23]. The concept of a continuous protein sequence space has been proposed to describe possible routes of protein evolution by a continuous exchange of amino acids without losing fitness [7] and further explored using coarse-grained structural lattice protein models [10,38,54]. In addition, the concept of evolution occurring in network has been developed further to include events of neutral evolution, as a way to avoid local optima during the exploration of more or less beneficial mutations [11,55,56]. Thus, network representations have inspired the interpretation of protein evolution. Other evolutionary events such as horizontal or lateral gene transfer are further challenging the representations of binary trees; for instance, it was suggested to represent microbial phylogeny by a network topology or a reticulated tree rather than a tree alone [57,58].

4.2. Ancestral sequence versus bridge

Ancestral protein sequences are reconstructed under the assumption of a common and meanwhile extinct predecessor of two currently known homologous proteins. In a network, bridges are characterized by high betweenness centrality or low clustering coefficient [48,49]. When added to the protein sequence network, the older ancestral sequences appeared in sparsely connected regions of the protein sequence networks. They were densely linked among each other and formed a network that connected the sub-networks, whereas younger ancestors were tightly connected to extant sequences. Thus, ancestral sequences can be considered as bridging sequences in a network of extant proteins. The lack of neighbours could reflect extinction or that their extant neighbouring sequences are still waiting to be discovered, given that the number of known sequences (approximately 10⁸ in UniProtKB/TrEMBL) is still far below the estimated number of sequences explored during evolution (estimated 10⁴⁰) [17].

When comparing the AS and PMH branches of the network, ancestral nodes of both clades appear as bridges linking with themselves and different sub-networks. The sub-networks that appear as the global sequence identity threshold is raised, broadly resemble the three PMH and two AS subgroups identified in phylogenetic analysis (electronic supplementary material, figures S3 and S16). When comparing the ancestors AncAS and AncPMH, a noteworthy difference is that each reaches its maximum betweenness centrality at a different threshold (electronic supplementary material, figures S9 and S10). The threshold at which ancestral sequences appear as bridging sequences may provide insight into the degree to which bridging sequences connecting two diverging protein families still exist in extant sequence space.

Inferred ancestral sequences have been proposed as starting points for protein design due to properties such as increased (thermo)stability and wide substrate range [33,59]. One downside of inferring ancestral sequences is that sequence uncertainty is inherent in the method [26]. Therefore, close extant homologues to inferred ancestral sequences could be promising candidates for protein design because they are expected to have similar beneficial properties to the inferred ancestors but do not suffer from sequence uncertainty. The identification of functionally relevant amino acid positions for a certain (sub)family would require structural information on the active site, too. Various approaches have been reported previously for other protein families, for instance thiamine diphosphate-dependent decarboxylases [60] or cytochrome P450 monooxygenases [61].

4.3. Consensus versus hub

Hubs are nodes with a high degree centrality in a network. The respective proteins are promising candidates as starting points for directed evolution experiments in biotechnology, because they are highly connected to extant proteins and thus are expected to have a high functional robustness toward possible deleterious effects of mutations [22,33]. In the AS-PMH family, hub sequences resemble back-to-consensus sequences as indicated by the correlation between the degree centrality of a node and the sequence identity to the consensus sequence of a protein family. Whereas both the AS and PMH clade show a correlation between clustering and degree centrality, degree centrality seems to play a more pronounced role in the AS clade (figure 3; electronic supplementary material, figures S12 and S13), which might indicate that AS sequences more closely resemble hubs than PMH sequences. Younger ancestors appearing more similar to the consensus sequences initially appears counterintuitive. However, the consensus sequences were derived from known extant sequences. As individual sequences acquire mutations and duplications occur, the consensus sequence of a protein family likewise changes. Therefore, if younger ancestors resemble extant hub sequences, deeper ancestors may be interpreted as a hub in ancestral sequence space, even though appearing less connected in extant sequence space.

Thus, mutants resulting from back-to-consensus mutations can be interpreted as nodes with an increased degree centrality in a network, and the success of back-to-consensus design in protein evolution can be explained by increased robustness towards introduced amino acid substitutions [62–64]. Because not all back-to-consensus mutations are stabilizing [64], extant hub sequences could serve as a natural reservoir of ‘consensus-like’ sequences as promising starting points for protein design strategies.

The observation that younger ancestors were more similar to the consensus sequences than older ancestors implies that ‘ancestor’ and ‘consensus sequence’ should not be used interchangeably. Back-to-consensus sequences are derived from known sequences without any further assumptions on the frequency of amino acid exchanges over time, except for the parameters that are used for MSAs such as a scoring or substitution matrix. Younger reconstructed ancestor sequences are implicitly more similar to known, existing sequences, which are in turn used to derive back-to-consensus sequences, whereas the reconstruction of older ancestors tries to link rather distant sequences by several amino acid exchanges that differ more from the consensus.

4.4. Outlook

The results of this study have ramifications both for a conceptual and an applied perspective on protein evolution.

Conceptually, our results from a large family of extant proteins align intriguingly well with earlier results on evolutionary transitions which were obtained from coarse-grained biophysical models of lattice protein structures [10,11,38]. In both approaches, sequences with a high degree centrality correspond to the consensus sequences for the respective (sub-) families. In the lattice model, consensus sequences were also thermodynamically most stable, a property that has also often been ascribed to reconstructed ancestors. However, the stability of reconstructed ancestors may be an artefact from the applied models, and the true, less stable ancestral sequence might deviate by just a few substitutions. A similar parallelism between the two approaches applies to bridging sequences, which mediate the evolutionary transition between (sub-) families along with a continuous mutational path. In the case of the lattice protein models, bridge sequences do so by successively lowering the probability to fold in one particular structure while increasing the probability to fold in a new one corresponding to another protein family. Similar features have previously been found for sequences bridging between folds [65] or functions [66].

Accordingly, we propose that the existence of mutationally robust consensus sequences with a high degree centrality [22], the scale-free clustering of protein families in sequence space [23] and the smooth transitions between protein families by bridge sequences may be universal features of protein evolution and deserve further investigations. For example, larger fractions of sequence space could be sampled for enzymes using high-throughput directed evolution approaches applying microfluidics [67] complemented by novel computational techniques for a reliable structure prediction (https://predictioncenter.org/casp14/).

5. Conclusion

By combining the two complementary approaches of phylogenetic tree and protein sequence network, a deeper insight into a protein family can be obtained, and the blind spots of either approach can be circumvented leading to a more robust analysis. For instance, the combination of tree and network can select enzyme candidates, such as sequences for the design of more stable, consensus-like proteins or promiscuous, bridge-like candidates. Protein sequence networks could be assessed further to elucidate the evolvability of substrate ambiguity or promiscuity as starting points for protein design studies [24,32,68]. The alignment of functional protein domains or selected active site positions could further facilitate the selection of interesting enzyme candidates [60] and the design of highly enriched, minimal mutant libraries.

Acknowledgements

The authors thank Dr Magdalena Heberlein for her preparatory work in the generation of a maximum-likelihood phylogenetic tree and the reconstruction of ancestral protein sequences.

Data accessibility

Electronic supplementary data are available under https://darus.uni-stuttgart.de/ for the following three datasets: the 238 protein sequences used herein, including 58 reconstructed ancestor sequences, are available under https://doi.org/10.18419/darus-1801. The protein sequence network of 238 nodes connected by 28 203 edges is available as GraphML file under https://doi.org/10.18419/darus-817. The consensus sequences for ASs and PMHs are available under https://doi.org/10.18419/darus-1838 together with the underlying MSAs and profile HMMs.

Authors' contributions

P.C.F.B. performed the network construction and carried out data analysis, participated in the design of the study and drafted the manuscript; B.v.L. and B.D.G.E. constructed and analysed the phylogenetic trees and contributed to the manuscript; E.B.B. contributed to the manuscript and critically revised the manuscript; J.P. conceived of the study, designed the study, coordinated the study, contributed to and critically revised the manuscript. All authors gave final approval for publication and agree to be held accountable for the work performed therein.

Competing interests

We declare we have no competing interests.

Funding

B.v.L., B.D.G.E. and E.B.B. acknowledge funding by EU under the Horizon 2020 Research and Innovation Framework Programme (grant no. 722610). P.C.F.B. and J.P. acknowledge funding by BMBF (grant no. 031B0571A).

References

1.Worth CL, Gong S, Blundell TL. 2009. Structural and functional constraints in the evolution of protein families. Nat. Rev. Mol. Cell Biol. 10, 709-720. ( 10.1038/nrm2762) [DOI] [PubMed] [Google Scholar]
2.Ohno S. 1970. Evolution by gene duplication. Berlin, Germany: Springer. [Google Scholar]
3.Laurent JM, Garge RK, Teufel AI, Wilke CO, Kachroo AH, Marcotte EM. 2020. Humanization of yeast genes with multiple human orthologs reveals functional divergence between paralogs. PLoS Biol. 185, e3000627. ( 10.1371/journal.pbio.3000627) [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Voordeckers K, Brown CA, Vanneste K, van der Zande E, Voet A, Maere S, Verstrepen KJ. 2012. Reconstruction of ancestral metabolic enzymes reveals molecular mechanisms underlying evolutionary innovation through gene duplication. PLoS Biol. 10, e1001446. ( 10.1371/journal.pbio.1001446) [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Murzin AG, Lesk AM, Chothia C. 1994. Principles determining the structure of β-sheet barrels in proteins II. The observed structures. J. Mol. Biol. 236, 1382-1400. ( 10.1016/0022-2836(94)90065-5) [DOI] [PubMed] [Google Scholar]
6.Ingles-Prieto A, et al. 2013. Conservation of protein structure over four billion years. Structure 21, 1690-1697. ( 10.1016/j.str.2013.06.020) [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Smith JM. 1970. Natural selection and the concept of a protein space. Nature 225, 563-564. ( 10.1038/225563a0) [DOI] [PubMed] [Google Scholar]
8.Lipman DJ, Wilbur WJ. 1991. Modelling neutral and selective evolution of protein folding. Proc. R. Soc. B 245, 7-11. ( 10.1098/rspb.1991.0081) [DOI] [PubMed] [Google Scholar]
9.Taverna DM, Goldstein RA. 2000. The distribution of structures in evolving protein populations. Biopolymers 53, 1-8. ( 10.1002/(SICI)1097-0282(200001)53:1<1::AID-BIP1>3.0.CO;2-X) [DOI] [PubMed] [Google Scholar]
10.Bornberg-Bauer E. 1997. How are model protein structures distributed in sequence space? Biophys. J. 73, 2393-2403. ( 10.1016/S0006-3495(97)78268-7) [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Bornberg-Bauer E, Chan HS. 1999. Modeling evolutionary landscapes: mutational stability, topology, and superfunnels in sequence space. Proc. Natl Acad. Sci. USA 96, 10 689-10 694. ( 10.1073/pnas.96.19.10689) [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Harms MJ, Thornton JW. 2014. Historical contingency and its biophysical basis in glucocorticoid receptor evolution. Nature 512, 203-207. ( 10.1038/nature13410) [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Steindel PA, Chen EH, Wirth JD, Theobald DL. 2016. Gradual neofunctionalization in the convergent evolution of trichomonad lactate and malate dehydrogenases. Protein Sci. 25, 1319-1331. ( 10.1002/pro.2904) [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Cordes MHJ, Davidson AR, Sauer RT. 1996. Sequence space, folding and protein design. Curr. Opin Struct. Biol. 6, 3-10. ( 10.1016/S0959-440X(96)80088-1) [DOI] [PubMed] [Google Scholar]
15.Dalal S, Regan L. 2000. Understanding the sequence determinants of conformational switching using protein design. Protein Sci. 9, 1651-1659. ( 10.1110/ps.9.9.1651) [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Alexander PA, He Y, Chen Y, Orban J, Bryan PN. 2009. A minimal sequence code for switching protein structure and function. Proc. Natl Acad. Sci. USA 106, 21 149-21 154. ( 10.1073/pnas.0906408106) [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Dryden DTF, Thomson AR, White JH. 2008. How much of protein sequence space has been explored by life on Earth? J. R Soc. Interface 5, 953-956. ( 10.1098/rsif.2008.0085) [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Felsenstein J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17, 368-376. ( 10.1007/BF01734359) [DOI] [PubMed] [Google Scholar]
19.Needleman SB, Wunsch CD. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443-453. ( 10.1016/0022-2836(70)90057-4) [DOI] [PubMed] [Google Scholar]
20.Atkinson HJ, Morris JH, Ferrin TE, Babbitt PC. 2009. Using sequence similarity networks for visualization of relationships across diverse protein superfamilies. PLoS ONE 4, e4345. ( 10.1371/journal.pone.0004345) [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Zeil C, Widmann M, Fademrecht S, Vogel C, Pleiss J. 2016. Network analysis of sequence–function relationships and exploration of sequence space of TEM beta-lactamases. Antimicrob. Agents Chemother. 60, 2709-2717. ( 10.1128/AAC.02930-15) [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Buchholz PCF, Zeil C, Pleiss J. 2018. The scale-free nature of protein sequence space. PLoS ONE 13, e0200815. ( 10.1371/journal.pone.0200815) [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Buchholz PCF, Fademrecht S, Pleiss J. 2017. Percolation in protein sequence space. PLoS ONE 12, e0189646. ( 10.1371/journal.pone.0189646) [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Khersonsky O, Tawfik DS. 2010. Enzyme promiscuity: a mechanistic and evolutionary perspective. Annu. Rev. Biochem. 79, 471-505. ( 10.1146/annurev-biochem-030409-143718) [DOI] [PubMed] [Google Scholar]
25.Pandya C, Farelli JD, Dunaway-Mariano D, Allen KN. 2014. Enzyme promiscuity: engine of evolutionary innovation. J. Biol. Chem. 289, 30 229-30 236. ( 10.1074/jbc.R114.572990) [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Eick GN, Bridgham JT, Anderson DP, Harms MJ, Thornton JW. 2016. Robustness of reconstructed ancestral protein functions to statistical uncertainty. Mol. Biol. Evol. 34, 247-261. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Watanabe K, Ohkuri T, Yokobori SI, Yamagishi A. 2006. Designing thermostable proteins: ancestral mutants of 3-isopropylmalate dehydrogenase designed by using a phylogenetic tree. J. Mol. Biol. 355, 664-674. ( 10.1016/j.jmb.2005.10.011) [DOI] [PubMed] [Google Scholar]
28.Miyazaki J, Nakaya S, Suzuki T, Tamakoshi M, Oshima T, Yamagishi A. 2001. Ancestral residues stabilizing 3-isopropylmalate dehydrogenase of an extreme thermophile: experimental evidence supporting the thermophilic common ancestor hypothesis. J. Biochem. 129, 777-782. ( 10.1093/oxfordjournals.jbchem.a002919) [DOI] [PubMed] [Google Scholar]
29.Steipe B. 2004. Consensus-based engineering of protein stability: from intrabodies to thermostable enzymes. Methods Enzymol. 388, 176-186. ( 10.1016/S0076-6879(04)88016-9) [DOI] [PubMed] [Google Scholar]
30.Sternke M, Tripp KW, Barrick D. 2019. Consensus sequence design as a general strategy to create hyperstable, biologically active proteins. Proc. Natl Acad. Sci. USA 166, 11 275-11 284. ( 10.1073/pnas.1816707116) [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Jochens H, Aerts D, Bornscheuer UT. 2010. Thermostabilization of an esterase by alignment-guided focussed directed evolution. Protein Eng. Des. Sel. 23, 903-909. ( 10.1093/protein/gzq071) [DOI] [PubMed] [Google Scholar]
32.Khersonsky O, et al. 2009. Directed evolution of serum paraoxonase PON3 by family shuffling and ancestor/consensus mutagenesis, and its biochemical characterization. Biochemistry 48, 6644-6654. ( 10.1021/bi900583y) [DOI] [PubMed] [Google Scholar]
33.Trudeau DL, Tawfik DS. 2019. Protein engineers turned evolutionists—the quest for the optimal starting point. Curr. Opin Biotechnol. 60, 46-52. ( 10.1016/j.copbio.2018.12.002) [DOI] [PubMed] [Google Scholar]
34.Barbeyron T, Brillet-Guéguen L, Carré W, Carrière C, Caron C, Czjzek M, Hoebeke M, Michel G. 2016. Matching the diversity of sulfated biomolecules: creation of a classification database for sulfatases reflecting their substrate specificity. PLoS ONE 11, e0164846. ( 10.1371/journal.pone.0164846) [DOI] [PMC free article] [PubMed] [Google Scholar]
35.van Loo B, Schober M, Valkov E, Heberlein M, Bornberg-Bauer E, Faber K, Hyvönen M, Hollfelder F. 2018. Structural and mechanistic analysis of the choline sulfatase from Sinorhizobium melliloti: a class I sulfatase specific for an alkyl sulfate ester. J. Mol. Biol. 430, 1004-1023. ( 10.1016/j.jmb.2018.02.010) [DOI] [PMC free article] [PubMed] [Google Scholar]
36.van Loo B, et al. 2019. Balancing specificity and promiscuity in enzyme evolution: multidimensional activity transitions in the alkaline phosphatase superfamily. J. Am. Chem. Soc. 141, 370-387. ( 10.1021/jacs.8b10290) [DOI] [PubMed] [Google Scholar]
37.Heberlein M. 2016. Evolution of substrate specificity in the alkaline phosphatase superfamily. PhD thesis, University of Münster, Münster, Germany. [Google Scholar]
38.Wroe R, Chan HS, Bornberg-Bauer E. 2007. A structural model of latent evolutionary potentials underlying neutral networks in proteins. HFSP J. 1, 79-87. ( 10.2976/1.2739116/10.2976/1) [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Yang Z. 2007. PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24, 1586-1591. ( 10.1093/molbev/msm088) [DOI] [PubMed] [Google Scholar]
40.Tatusova T, Ciufo S, Fedorov B, O'Neill K, Tolstoy I. 2014. RefSeq microbial genomes database: new representation and annotation strategy. Nucleic Acids Res. 42, D553-D559. ( 10.1093/nar/gkt1274) [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Abascal F, Zardoya R, Telford MJ. 2010. TranslatorX: multiple alignment of nucleotide sequences guided by amino acid translations. Nucleic Acids Res. 38(Suppl. 2), W7-W13. ( 10.1093/nar/gkq291) [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Stamatakis A, Hoover P, Rougemont J. 2008. A rapid bootstrap algorithm for the RAxML web servers. Syst. Biol. 57, 758-771. ( 10.1080/10635150802429642) [DOI] [PubMed] [Google Scholar]
43.Rice P, Longden L, Bleasby A. 2000. EMBOSS: the European molecular biology open software suite. Trends Genet. 16, 276-277. ( 10.1016/S0168-9525(00)02024-2) [DOI] [PubMed] [Google Scholar]
44.Hagberg AA, Schult DA, Swart PJ. 2008. Exploring network structure, dynamics, and function using NetworkX. In 7th Python Sci. Conf. (SciPy 2008), 21 August, Pasadena, CA, pp. 11-15. See https://www.osti.gov/biblio/960616-exploring-network-structure-dynamics-function-using-networkx. [Google Scholar]
45.Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T. 2003. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498-2504. ( 10.1101/gr.1239303) [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Baur M, Benkert M, Brandes U, Cornelsen S, Gaertler M, Köpf B, Lerner J, Wagner D. 2002. Visone software for visual social network analysis. In Graph drawing. GD 2001 (eds Mutzel P, Jünger M, Leipert S), pp. 463-464, vol. 2265, Lecture notes in computer science. Berlin, Heidelberg: Springer. ( 10.1007/3-540-45848-4_47) [DOI] [Google Scholar]
47.Nocaj A, Ortmann M, Brandes U. 2015. Untangling the hairballs of multi-centered, small-world online social media networks. J. Graph Algorithms Appl. 19, 595-618. ( 10.7155/jgaa.00370) [DOI] [Google Scholar]
48.Freeman L. 1977. A set of measures of centrality based on betweenness. Sociometry 40, 35-41. ( 10.2307/3033543) [DOI] [Google Scholar]
49.Watts DJ, Strogatz SH. 1998. Collective dynamics of ‘small-world’ networks. Nature 393, 440-442. ( 10.1038/30918) [DOI] [PubMed] [Google Scholar]
50.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J. Mol. Biol. 215, 403-410. ( 10.1016/S0022-2836(05)80360-2) [DOI] [PubMed] [Google Scholar]
51.Agarwala R, et al. 2016. Database resources of the National Center for biotechnology information. Nucleic Acids Res. 44, D7-D19. ( 10.1093/nar/gkv1290) [DOI] [PMC free article] [PubMed] [Google Scholar]
52.Sievers F, Higgins DG. 2014. Clustal Omega, accurate alignment of very large numbers of sequences. In Multiple sequence alignment methods (ed. Russell D), pp. 105-116, vol. 1079. Methods in molecular biology (methods and protocols). Totowa, NJ: Humana Press. ( 10.1007/978-1-62703-646-7_6) [DOI] [PubMed] [Google Scholar]
53.Rost B. 1999. Twilight zone of protein sequence alignments. Protein Eng. 12, 85-94. ( 10.1093/protein/12.2.85) [DOI] [PubMed] [Google Scholar]
54.Cui Y, Wong WH, Bornberg-Bauer E, Chan HS. 2002. Recombinatoric exploration of novel folded structures: a heteropolymer-based model of protein evolutionary landscapes. Proc. Natl Acad. Sci. USA 99, 809-814. ( 10.1073/pnas.022240299) [DOI] [PMC free article] [PubMed] [Google Scholar]
55.van Nimwegen E, Crutchfield JP, Huynen M. 1999. Neutral evolution of mutational robustness. Proc. Natl Acad. Sci. USA 96, 9716-9720. ( 10.1073/pnas.96.17.9716) [DOI] [PMC free article] [PubMed] [Google Scholar]
56.Bershtein S, Goldin K, Tawfik DS. 2008. Intense neutral drifts yield robust and evolvable consensus proteins. J. Mol. Biol. 379, 1029-1044. ( 10.1016/j.jmb.2008.04.024) [DOI] [PubMed] [Google Scholar]
57.Doolittle WF. 1999. Phylogenetic classification and the universal tree. Science 284, 2124-2128. ( 10.1126/science.284.5423.2124) [DOI] [PubMed] [Google Scholar]
58.Kunin V, Goldovsky L, Darzentas N, Ouzounis CA. 2005. The net of life: reconstructing the microbial phylogenetic network. Genome Res. 15, 954-959. ( 10.1101/gr.3666505) [DOI] [PMC free article] [PubMed] [Google Scholar]
59.Gumulya Y, Gillam EMJ. 2017. Exploring the past and the future of protein evolution with ancestral sequence reconstruction: the ‘retro’ approach to protein engineering. Biochem. J. 474, 1-19. ( 10.1042/BCJ20160507) [DOI] [PubMed] [Google Scholar]
60.Buchholz PCF, Ferrario V, Pohl M, Gardossi L, Pleiss J. 2019. Navigating within thiamine diphosphate-dependent decarboxylases: sequences, structures, functional positions, and binding sites. Proteins Struct. Funct. Bioinforma 87, 774-785. ( 10.1002/prot.25706) [DOI] [PubMed] [Google Scholar]
61.Pleiss J. 2014. Systematic analysis of large enzyme families: identification of specificity- and selectivity-determining hotspots. ChemCatChem 6, 944-950. ( 10.1002/cctc.201300950) [DOI] [Google Scholar]
62.Amin N, Liu AD, Ramer S, Aehle W, Meijer D, Metin M, Wong S, Gualfetti P, Schellenberger V. 2004. Construction of stabilized proteins by combinatorial consensus mutagenesis. Protein Eng. Des. Sel. 17, 787-793. ( 10.1093/protein/gzh091) [DOI] [PubMed] [Google Scholar]
63.Lehmann M, Loch C, Middendorf A, Studer D, Lassen SF, Pasamontes L, van Loon AP, Wyss M. 2002. The consensus concept for thermostability engineering of proteins: further proof of concept. Protein Eng. 15, 403-411. ( 10.1093/protein/15.5.403) [DOI] [PubMed] [Google Scholar]
64.Porebski BT, Buckle AM. 2016. Consensus protein design. Protein Eng. Des. Sel. 29, 245-251. ( 10.1093/protein/gzw015) [DOI] [PMC free article] [PubMed] [Google Scholar]
65.He Y, Chen Y, Alexander PA, Bryan PN, Orban J. 2012. Mutational tipping points for switching protein folds and functions. Structure 20, 283-291. ( 10.1016/j.str.2011.11.018) [DOI] [PMC free article] [PubMed] [Google Scholar]
66.Yang G, et al. 2019. Higher-order epistasis shapes the fitness landscape of a xenobiotic-degrading enzyme. Nat. Chem. Biol. 15, 1120-1128. ( 10.1038/s41589-019-0386-3) [DOI] [PubMed] [Google Scholar]
67.Van Loo B, et al. 2019. High-throughput, lysis-free screening for sulfatase activity using Escherichia coli autodisplay in microdroplets. ACS Synth. Biol. 8, 2690-2700. ( 10.1021/acssynbio.9b00274) [DOI] [PubMed] [Google Scholar]
68.Aharoni A, Gaidukov L, Khersonsky O, Gould SM, Roodveldt C, Tawfik DS. 2005. The ‘evolvability’ of promiscuous protein functions. Nat. Genet. 37, 73-76. ( 10.1038/ng1482) [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[RSIF20210389C1] 1.Worth CL, Gong S, Blundell TL. 2009. Structural and functional constraints in the evolution of protein families. Nat. Rev. Mol. Cell Biol. 10, 709-720. ( 10.1038/nrm2762) [DOI] [PubMed] [Google Scholar]

[RSIF20210389C2] 2.Ohno S. 1970. Evolution by gene duplication. Berlin, Germany: Springer. [Google Scholar]

[RSIF20210389C3] 3.Laurent JM, Garge RK, Teufel AI, Wilke CO, Kachroo AH, Marcotte EM. 2020. Humanization of yeast genes with multiple human orthologs reveals functional divergence between paralogs. PLoS Biol. 185, e3000627. ( 10.1371/journal.pbio.3000627) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSIF20210389C4] 4.Voordeckers K, Brown CA, Vanneste K, van der Zande E, Voet A, Maere S, Verstrepen KJ. 2012. Reconstruction of ancestral metabolic enzymes reveals molecular mechanisms underlying evolutionary innovation through gene duplication. PLoS Biol. 10, e1001446. ( 10.1371/journal.pbio.1001446) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSIF20210389C5] 5.Murzin AG, Lesk AM, Chothia C. 1994. Principles determining the structure of β-sheet barrels in proteins II. The observed structures. J. Mol. Biol. 236, 1382-1400. ( 10.1016/0022-2836(94)90065-5) [DOI] [PubMed] [Google Scholar]

[RSIF20210389C6] 6.Ingles-Prieto A, et al. 2013. Conservation of protein structure over four billion years. Structure 21, 1690-1697. ( 10.1016/j.str.2013.06.020) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSIF20210389C7] 7.Smith JM. 1970. Natural selection and the concept of a protein space. Nature 225, 563-564. ( 10.1038/225563a0) [DOI] [PubMed] [Google Scholar]

[RSIF20210389C8] 8.Lipman DJ, Wilbur WJ. 1991. Modelling neutral and selective evolution of protein folding. Proc. R. Soc. B 245, 7-11. ( 10.1098/rspb.1991.0081) [DOI] [PubMed] [Google Scholar]

[RSIF20210389C9] 9.Taverna DM, Goldstein RA. 2000. The distribution of structures in evolving protein populations. Biopolymers 53, 1-8. ( 10.1002/(SICI)1097-0282(200001)53:1<1::AID-BIP1>3.0.CO;2-X) [DOI] [PubMed] [Google Scholar]

[RSIF20210389C10] 10.Bornberg-Bauer E. 1997. How are model protein structures distributed in sequence space? Biophys. J. 73, 2393-2403. ( 10.1016/S0006-3495(97)78268-7) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSIF20210389C11] 11.Bornberg-Bauer E, Chan HS. 1999. Modeling evolutionary landscapes: mutational stability, topology, and superfunnels in sequence space. Proc. Natl Acad. Sci. USA 96, 10 689-10 694. ( 10.1073/pnas.96.19.10689) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSIF20210389C12] 12.Harms MJ, Thornton JW. 2014. Historical contingency and its biophysical basis in glucocorticoid receptor evolution. Nature 512, 203-207. ( 10.1038/nature13410) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSIF20210389C13] 13.Steindel PA, Chen EH, Wirth JD, Theobald DL. 2016. Gradual neofunctionalization in the convergent evolution of trichomonad lactate and malate dehydrogenases. Protein Sci. 25, 1319-1331. ( 10.1002/pro.2904) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSIF20210389C14] 14.Cordes MHJ, Davidson AR, Sauer RT. 1996. Sequence space, folding and protein design. Curr. Opin Struct. Biol. 6, 3-10. ( 10.1016/S0959-440X(96)80088-1) [DOI] [PubMed] [Google Scholar]

[RSIF20210389C15] 15.Dalal S, Regan L. 2000. Understanding the sequence determinants of conformational switching using protein design. Protein Sci. 9, 1651-1659. ( 10.1110/ps.9.9.1651) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSIF20210389C16] 16.Alexander PA, He Y, Chen Y, Orban J, Bryan PN. 2009. A minimal sequence code for switching protein structure and function. Proc. Natl Acad. Sci. USA 106, 21 149-21 154. ( 10.1073/pnas.0906408106) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSIF20210389C17] 17.Dryden DTF, Thomson AR, White JH. 2008. How much of protein sequence space has been explored by life on Earth? J. R Soc. Interface 5, 953-956. ( 10.1098/rsif.2008.0085) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSIF20210389C18] 18.Felsenstein J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17, 368-376. ( 10.1007/BF01734359) [DOI] [PubMed] [Google Scholar]

[RSIF20210389C19] 19.Needleman SB, Wunsch CD. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443-453. ( 10.1016/0022-2836(70)90057-4) [DOI] [PubMed] [Google Scholar]

[RSIF20210389C20] 20.Atkinson HJ, Morris JH, Ferrin TE, Babbitt PC. 2009. Using sequence similarity networks for visualization of relationships across diverse protein superfamilies. PLoS ONE 4, e4345. ( 10.1371/journal.pone.0004345) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSIF20210389C21] 21.Zeil C, Widmann M, Fademrecht S, Vogel C, Pleiss J. 2016. Network analysis of sequence–function relationships and exploration of sequence space of TEM beta-lactamases. Antimicrob. Agents Chemother. 60, 2709-2717. ( 10.1128/AAC.02930-15) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSIF20210389C22] 22.Buchholz PCF, Zeil C, Pleiss J. 2018. The scale-free nature of protein sequence space. PLoS ONE 13, e0200815. ( 10.1371/journal.pone.0200815) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSIF20210389C23] 23.Buchholz PCF, Fademrecht S, Pleiss J. 2017. Percolation in protein sequence space. PLoS ONE 12, e0189646. ( 10.1371/journal.pone.0189646) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSIF20210389C24] 24.Khersonsky O, Tawfik DS. 2010. Enzyme promiscuity: a mechanistic and evolutionary perspective. Annu. Rev. Biochem. 79, 471-505. ( 10.1146/annurev-biochem-030409-143718) [DOI] [PubMed] [Google Scholar]

[RSIF20210389C25] 25.Pandya C, Farelli JD, Dunaway-Mariano D, Allen KN. 2014. Enzyme promiscuity: engine of evolutionary innovation. J. Biol. Chem. 289, 30 229-30 236. ( 10.1074/jbc.R114.572990) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSIF20210389C26] 26.Eick GN, Bridgham JT, Anderson DP, Harms MJ, Thornton JW. 2016. Robustness of reconstructed ancestral protein functions to statistical uncertainty. Mol. Biol. Evol. 34, 247-261. [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSIF20210389C27] 27.Watanabe K, Ohkuri T, Yokobori SI, Yamagishi A. 2006. Designing thermostable proteins: ancestral mutants of 3-isopropylmalate dehydrogenase designed by using a phylogenetic tree. J. Mol. Biol. 355, 664-674. ( 10.1016/j.jmb.2005.10.011) [DOI] [PubMed] [Google Scholar]

[RSIF20210389C28] 28.Miyazaki J, Nakaya S, Suzuki T, Tamakoshi M, Oshima T, Yamagishi A. 2001. Ancestral residues stabilizing 3-isopropylmalate dehydrogenase of an extreme thermophile: experimental evidence supporting the thermophilic common ancestor hypothesis. J. Biochem. 129, 777-782. ( 10.1093/oxfordjournals.jbchem.a002919) [DOI] [PubMed] [Google Scholar]

[RSIF20210389C29] 29.Steipe B. 2004. Consensus-based engineering of protein stability: from intrabodies to thermostable enzymes. Methods Enzymol. 388, 176-186. ( 10.1016/S0076-6879(04)88016-9) [DOI] [PubMed] [Google Scholar]

[RSIF20210389C30] 30.Sternke M, Tripp KW, Barrick D. 2019. Consensus sequence design as a general strategy to create hyperstable, biologically active proteins. Proc. Natl Acad. Sci. USA 166, 11 275-11 284. ( 10.1073/pnas.1816707116) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSIF20210389C31] 31.Jochens H, Aerts D, Bornscheuer UT. 2010. Thermostabilization of an esterase by alignment-guided focussed directed evolution. Protein Eng. Des. Sel. 23, 903-909. ( 10.1093/protein/gzq071) [DOI] [PubMed] [Google Scholar]

[RSIF20210389C32] 32.Khersonsky O, et al. 2009. Directed evolution of serum paraoxonase PON3 by family shuffling and ancestor/consensus mutagenesis, and its biochemical characterization. Biochemistry 48, 6644-6654. ( 10.1021/bi900583y) [DOI] [PubMed] [Google Scholar]

[RSIF20210389C33] 33.Trudeau DL, Tawfik DS. 2019. Protein engineers turned evolutionists—the quest for the optimal starting point. Curr. Opin Biotechnol. 60, 46-52. ( 10.1016/j.copbio.2018.12.002) [DOI] [PubMed] [Google Scholar]

[RSIF20210389C34] 34.Barbeyron T, Brillet-Guéguen L, Carré W, Carrière C, Caron C, Czjzek M, Hoebeke M, Michel G. 2016. Matching the diversity of sulfated biomolecules: creation of a classification database for sulfatases reflecting their substrate specificity. PLoS ONE 11, e0164846. ( 10.1371/journal.pone.0164846) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSIF20210389C35] 35.van Loo B, Schober M, Valkov E, Heberlein M, Bornberg-Bauer E, Faber K, Hyvönen M, Hollfelder F. 2018. Structural and mechanistic analysis of the choline sulfatase from Sinorhizobium melliloti: a class I sulfatase specific for an alkyl sulfate ester. J. Mol. Biol. 430, 1004-1023. ( 10.1016/j.jmb.2018.02.010) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSIF20210389C36] 36.van Loo B, et al. 2019. Balancing specificity and promiscuity in enzyme evolution: multidimensional activity transitions in the alkaline phosphatase superfamily. J. Am. Chem. Soc. 141, 370-387. ( 10.1021/jacs.8b10290) [DOI] [PubMed] [Google Scholar]

[RSIF20210389C37] 37.Heberlein M. 2016. Evolution of substrate specificity in the alkaline phosphatase superfamily. PhD thesis, University of Münster, Münster, Germany. [Google Scholar]

[RSIF20210389C38] 38.Wroe R, Chan HS, Bornberg-Bauer E. 2007. A structural model of latent evolutionary potentials underlying neutral networks in proteins. HFSP J. 1, 79-87. ( 10.2976/1.2739116/10.2976/1) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSIF20210389C39] 39.Yang Z. 2007. PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24, 1586-1591. ( 10.1093/molbev/msm088) [DOI] [PubMed] [Google Scholar]

[RSIF20210389C40] 40.Tatusova T, Ciufo S, Fedorov B, O'Neill K, Tolstoy I. 2014. RefSeq microbial genomes database: new representation and annotation strategy. Nucleic Acids Res. 42, D553-D559. ( 10.1093/nar/gkt1274) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSIF20210389C41] 41.Abascal F, Zardoya R, Telford MJ. 2010. TranslatorX: multiple alignment of nucleotide sequences guided by amino acid translations. Nucleic Acids Res. 38(Suppl. 2), W7-W13. ( 10.1093/nar/gkq291) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSIF20210389C42] 42.Stamatakis A, Hoover P, Rougemont J. 2008. A rapid bootstrap algorithm for the RAxML web servers. Syst. Biol. 57, 758-771. ( 10.1080/10635150802429642) [DOI] [PubMed] [Google Scholar]

[RSIF20210389C43] 43.Rice P, Longden L, Bleasby A. 2000. EMBOSS: the European molecular biology open software suite. Trends Genet. 16, 276-277. ( 10.1016/S0168-9525(00)02024-2) [DOI] [PubMed] [Google Scholar]

[RSIF20210389C44] 44.Hagberg AA, Schult DA, Swart PJ. 2008. Exploring network structure, dynamics, and function using NetworkX. In 7th Python Sci. Conf. (SciPy 2008), 21 August, Pasadena, CA, pp. 11-15. See https://www.osti.gov/biblio/960616-exploring-network-structure-dynamics-function-using-networkx. [Google Scholar]

[RSIF20210389C45] 45.Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T. 2003. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498-2504. ( 10.1101/gr.1239303) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSIF20210389C46] 46.Baur M, Benkert M, Brandes U, Cornelsen S, Gaertler M, Köpf B, Lerner J, Wagner D. 2002. Visone software for visual social network analysis. In Graph drawing. GD 2001 (eds Mutzel P, Jünger M, Leipert S), pp. 463-464, vol. 2265, Lecture notes in computer science. Berlin, Heidelberg: Springer. ( 10.1007/3-540-45848-4_47) [DOI] [Google Scholar]

[RSIF20210389C47] 47.Nocaj A, Ortmann M, Brandes U. 2015. Untangling the hairballs of multi-centered, small-world online social media networks. J. Graph Algorithms Appl. 19, 595-618. ( 10.7155/jgaa.00370) [DOI] [Google Scholar]

[RSIF20210389C48] 48.Freeman L. 1977. A set of measures of centrality based on betweenness. Sociometry 40, 35-41. ( 10.2307/3033543) [DOI] [Google Scholar]

[RSIF20210389C49] 49.Watts DJ, Strogatz SH. 1998. Collective dynamics of ‘small-world’ networks. Nature 393, 440-442. ( 10.1038/30918) [DOI] [PubMed] [Google Scholar]

[RSIF20210389C50] 50.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J. Mol. Biol. 215, 403-410. ( 10.1016/S0022-2836(05)80360-2) [DOI] [PubMed] [Google Scholar]

[RSIF20210389C51] 51.Agarwala R, et al. 2016. Database resources of the National Center for biotechnology information. Nucleic Acids Res. 44, D7-D19. ( 10.1093/nar/gkv1290) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSIF20210389C52] 52.Sievers F, Higgins DG. 2014. Clustal Omega, accurate alignment of very large numbers of sequences. In Multiple sequence alignment methods (ed. Russell D), pp. 105-116, vol. 1079. Methods in molecular biology (methods and protocols). Totowa, NJ: Humana Press. ( 10.1007/978-1-62703-646-7_6) [DOI] [PubMed] [Google Scholar]

[RSIF20210389C53] 53.Rost B. 1999. Twilight zone of protein sequence alignments. Protein Eng. 12, 85-94. ( 10.1093/protein/12.2.85) [DOI] [PubMed] [Google Scholar]

[RSIF20210389C54] 54.Cui Y, Wong WH, Bornberg-Bauer E, Chan HS. 2002. Recombinatoric exploration of novel folded structures: a heteropolymer-based model of protein evolutionary landscapes. Proc. Natl Acad. Sci. USA 99, 809-814. ( 10.1073/pnas.022240299) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSIF20210389C55] 55.van Nimwegen E, Crutchfield JP, Huynen M. 1999. Neutral evolution of mutational robustness. Proc. Natl Acad. Sci. USA 96, 9716-9720. ( 10.1073/pnas.96.17.9716) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSIF20210389C56] 56.Bershtein S, Goldin K, Tawfik DS. 2008. Intense neutral drifts yield robust and evolvable consensus proteins. J. Mol. Biol. 379, 1029-1044. ( 10.1016/j.jmb.2008.04.024) [DOI] [PubMed] [Google Scholar]

[RSIF20210389C57] 57.Doolittle WF. 1999. Phylogenetic classification and the universal tree. Science 284, 2124-2128. ( 10.1126/science.284.5423.2124) [DOI] [PubMed] [Google Scholar]

[RSIF20210389C58] 58.Kunin V, Goldovsky L, Darzentas N, Ouzounis CA. 2005. The net of life: reconstructing the microbial phylogenetic network. Genome Res. 15, 954-959. ( 10.1101/gr.3666505) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSIF20210389C59] 59.Gumulya Y, Gillam EMJ. 2017. Exploring the past and the future of protein evolution with ancestral sequence reconstruction: the ‘retro’ approach to protein engineering. Biochem. J. 474, 1-19. ( 10.1042/BCJ20160507) [DOI] [PubMed] [Google Scholar]

[RSIF20210389C60] 60.Buchholz PCF, Ferrario V, Pohl M, Gardossi L, Pleiss J. 2019. Navigating within thiamine diphosphate-dependent decarboxylases: sequences, structures, functional positions, and binding sites. Proteins Struct. Funct. Bioinforma 87, 774-785. ( 10.1002/prot.25706) [DOI] [PubMed] [Google Scholar]

[RSIF20210389C61] 61.Pleiss J. 2014. Systematic analysis of large enzyme families: identification of specificity- and selectivity-determining hotspots. ChemCatChem 6, 944-950. ( 10.1002/cctc.201300950) [DOI] [Google Scholar]

[RSIF20210389C62] 62.Amin N, Liu AD, Ramer S, Aehle W, Meijer D, Metin M, Wong S, Gualfetti P, Schellenberger V. 2004. Construction of stabilized proteins by combinatorial consensus mutagenesis. Protein Eng. Des. Sel. 17, 787-793. ( 10.1093/protein/gzh091) [DOI] [PubMed] [Google Scholar]

[RSIF20210389C63] 63.Lehmann M, Loch C, Middendorf A, Studer D, Lassen SF, Pasamontes L, van Loon AP, Wyss M. 2002. The consensus concept for thermostability engineering of proteins: further proof of concept. Protein Eng. 15, 403-411. ( 10.1093/protein/15.5.403) [DOI] [PubMed] [Google Scholar]

[RSIF20210389C64] 64.Porebski BT, Buckle AM. 2016. Consensus protein design. Protein Eng. Des. Sel. 29, 245-251. ( 10.1093/protein/gzw015) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSIF20210389C65] 65.He Y, Chen Y, Alexander PA, Bryan PN, Orban J. 2012. Mutational tipping points for switching protein folds and functions. Structure 20, 283-291. ( 10.1016/j.str.2011.11.018) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSIF20210389C66] 66.Yang G, et al. 2019. Higher-order epistasis shapes the fitness landscape of a xenobiotic-degrading enzyme. Nat. Chem. Biol. 15, 1120-1128. ( 10.1038/s41589-019-0386-3) [DOI] [PubMed] [Google Scholar]

[RSIF20210389C67] 67.Van Loo B, et al. 2019. High-throughput, lysis-free screening for sulfatase activity using Escherichia coli autodisplay in microdroplets. ACS Synth. Biol. 8, 2690-2700. ( 10.1021/acssynbio.9b00274) [DOI] [PubMed] [Google Scholar]

[RSIF20210389C68] 68.Aharoni A, Gaidukov L, Khersonsky O, Gould SM, Roodveldt C, Tawfik DS. 2005. The ‘evolvability’ of promiscuous protein functions. Nat. Genet. 37, 73-76. ( 10.1038/ng1482) [DOI] [PubMed] [Google Scholar]

PERMALINK

Ancestral sequences of a large promiscuous enzyme family correspond to bridges in sequence space in a network representation

Patrick C F Buchholz

Bert van Loo

Bernard D G Eenink

Erich Bornberg-Bauer

Jürgen Pleiss

Abstract

1. Background

2. Methods

2.1. Dataset

2.2. Ancestral sequence reconstruction

2.3. Network construction and visualization

2.4. Statistical properties of network nodes

2.5. Closest homologues of reconstructed ancestors

2.6. Derivation of consensus sequences

3. Results

3.1. Protein sequence networks and the maximum-likelihood phylogenetic tree

Figure 1.

3.2. Network properties of ancestral sequences

Figure 2.

3.3. Homologues of ancestral sequences

3.4. Hubs, ancestors and consensus sequences

Figure 3.

4. Discussion

4.1. Phylogenetic tree versus protein sequence network

4.2. Ancestral sequence versus bridge

4.3. Consensus versus hub

4.4. Outlook

5. Conclusion

Acknowledgements

Data accessibility

Authors' contributions

Competing interests

Funding

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases