Evolution of protein interfaces in multimers and fibrils

W Jeffrey Zabel; Kyle P Hagner; Benjamin J Livesey; Joseph A Marsh; Sima Setayeshgar; Michael Lynch; Paul G Higgs

doi:10.1063/1.5086042

. 2019 Jun 12;150(22):225102. doi: 10.1063/1.5086042

Evolution of protein interfaces in multimers and fibrils

W Jeffrey Zabel ¹, Kyle P Hagner ², Benjamin J Livesey ³, Joseph A Marsh ³, Sima Setayeshgar ², Michael Lynch ⁴, Paul G Higgs ^1,^a)

PMCID: PMC6561775 EMSID: EMS83405 PMID: 31202237

Abstract

A majority of cellular proteins function as part of multimeric complexes of two or more subunits. Multimer formation requires interactions between protein surfaces that lead to closed structures, such as dimers and tetramers. If proteins interact in an open-ended way, uncontrolled growth of fibrils can occur, which is likely to be detrimental in most cases. We present a statistical physics model that allows aggregation of proteins as either closed dimers or open fibrils of all lengths. We use pairwise amino-acid contact energies to calculate the energies of interacting protein surfaces. The probabilities of all possible aggregate configurations can be calculated for any given sequence of surface amino acids. We link the statistical physics model to a population genetics model that describes the evolution of the surface residues. When proteins evolve neutrally, without selection for or against multimer formation, we find that a majority of proteins remain as monomers at moderate concentrations, but strong dimer-forming or fibril-forming sequences are also possible. If selection is applied in favor of dimers or in favor of fibrils, then it is easy to select either dimer-forming or fibril-forming sequences. It is also possible to select for oriented fibrils with protein subunits all aligned in the same direction. We measure the propensities of amino acids to occur at interfaces relative to noninteracting surfaces and show that the propensities in our model are strongly correlated with those that have been measured in real protein structures. We also show that there are significant differences between amino acid frequencies at isologous and heterologous interfaces in our model, and we observe that similar effects occur in real protein structures.

I. INTRODUCTION

Many features of cellular biology are governed by the actions and interactions of proteins, and an understanding of their evolution is crucial to understand the evolution of life itself. An important property of proteins is the formation of complexes consisting of two or more subunits, with 30%–50% forming homo-oligomers composed of identical monomers.¹ Homodimers constitute the majority (41%) of oligomeric proteins of known structure.² Two identical proteins can aggregate in a closed way, with isologous (i.e., head-to-head) interfaces, or in an open way, with heterologous (i.e., head-to-tail) interfaces. If open, they have the possibility of forming infinite fibrils. Amyloid fibrils, formed by normally soluble proteins that assemble to form open insoluble fibers, are resistant to degradation, and their formation can accompany a variety of human diseases, including Alzheimer’s disease, type-2 diabetes, and spongiform encephalopathies.³ Given the importance of homo-oligomers in the cellular repertoire, from mediating gene expression to functioning as enzymes, ion channels, and receptors,⁴ it is important to understand the competition between these different ways of assembling. More generally, mutations of amino acids at protein-protein interfaces are known to have large effects on human health because they affect the formation of protein complexes.⁵

Previous theoretical works have modeled protein fibrillogenesis based on mass action kinetics⁶ and thermodynamics of peptide solutions including formation of protofilament intermediates.^7,8 In this work, we present a simple model that allows both the physical and the evolutionary aspects of protein aggregation to be addressed. Our approach is similar to other previous works^9,10 in adopting a transfer matrix approach to obtain the equilibrium concentrations of oligomers of different lengths as a function of the free energies of interaction between proteins.

The novelty of our work is that it connects the statistical physics of protein aggregation to the evolution of higher-order protein structure by using population genetics theory to calculate the expected frequency of each protein in the ensemble of sequences generated by mutation and natural selection. We consider cases where the fitness is independent of whether the protein aggregates and cases where fitness is a function of structure, including selection for the formation of dimers and selection both for and against the formation of fibrils.

Our model considers a protein with two possible interacting surfaces, labeled A and B. There are two possible isologous interfaces (AA and BB) and one heterologous interface (AB). The energies of these interfaces depend on the amino acids on the two surfaces, as described in Sec. II. The model allows for the formation of closed dimers, which occur when one or the other of the isologous interfaces is strongly attractive and the other interfaces are weak. It also allows for the formation of fibrils with proteins oriented in the same direction in cases where the heterologous interface is strong or fibrils with proteins aligned in alternating directions in cases where both isologous interfaces are strong. Using the transfer matrix method given in Sec. III, it is possible to calculate the probabilities P_n that a protein is found in an assembly of n units. These probabilities depend on the values of the three interface energies.

The multimeric states of proteins are sometimes observed to change rapidly on an evolutionary time scale.^11,12 This may be an indication of selection for or against multimers or may simply be a result of neutral evolution. Within our model, it is possible to ask how likely dimer and fibril structures are to form under neutral evolution. We include selection in the model using the strong-selection weak-mutation approximation,¹³ which allows the expected frequencies of sequences in the presence of selection to be calculated from their frequencies under neutral evolution. We use a Monte Carlo Markov chain method (Sec. IV) to generate a set of representative protein sequences with frequencies given by evolutionary theory.

Thus, our model provides a simple way of linking evolutionary observations to the underlying statistical physics of protein aggregation. Within this framework, we consider probabilities of formation of dimers and fibrils, both under neutral evolution and under the action of several different kinds of selection. The model also predicts that the frequencies of amino acids at strongly binding interfaces are significantly different from their frequencies under the mutation process alone and from their frequencies at noninteracting, exposed surfaces. Furthermore, the frequencies of amino acids at isologous and heterologous interfaces are found to differ from one another. These predictions are compared with observations of amino acid frequencies at interfaces in databases of real proteins.

II. CALCULATION OF INTERFACE ENERGIES

We consider two opposing faces of the protein, denoted A and B, as potential binding surfaces (as shown in Fig. 1). There are two possible isologous interfaces (AA and BB) and one heterologous interface (AB). The energies of the three interfaces E_AA, E_BB, and E_AB depend on the sequences of residues on the surfaces. Nonsurface residues play no role in this model. A surface is modeled as a 4 × 4 array of amino acids. The energy of an interface is modeled as the sum of the 16 pairwise interactions between amino acids that are formed when two surfaces are brought together (see Fig. 1). We consider four possible 90° rotations of two surfaces. The three energies E_AA, E_BB, and E_AB are defined to be the lowest of the four energies that arise from the four possible rotations.

The square array of 16 amino acids is used for convenience because we require a simple model for which energies can be calculated for hundreds of thousands of protein sequences during evolutionary simulations. However, the fact that we consider the lowest energy of the multiple rotations of the two surfaces is an important feature of the model. When two identical proteins form an isologous interface, each pairwise interaction between the two surfaces is present twice. This means that the variance of the energy of the interface is twice what it would be for an interface between two independent proteins with the same number of amino acid contacts. The relevant interaction energy controlling binding of two proteins is the lowest energy of the rotational configurations possible when they are brought into contact. As the distribution of energies is broader for homodimers than heterodimers, the lowest energy tends to be lower.^14,15 This contributes to the excess of interactions between identical proteins and between closely related paralogs that are observed in the analysis of protein-protein interaction networks.^16,17 This factor is relevant here because we wish to consider relative probabilities of aggregation of multimer proteins in configurations that can involve either isologous or heterologous interfaces. If we simplified our model further by allowing only one rotational configuration, we would lose this effect.

As we wish to distinguish strongly and weakly interacting surfaces, it is useful to define the A surface such that the AA interface is stronger than the BB interface, i.e., E_AA ≤ E_BB (negative energies denote favorable interactions). For each amino acid sequence considered, we simply relabel the A and B surfaces if necessary so that this condition is true.

We use a simple model of pairwise contact energies because we wish to study evolution of large numbers of protein sequences using a model where the fitness depends on the energies of the surface interactions (as described in Sec. III). Thus, it is necessary to be able to evaluate the surface energies of any given sequence very rapidly, which could not be done if a more realistic, three dimensional model of a protein surface was used (for example, as in Refs. 18–20). Although pairwise amino acid potentials leave out many details (e.g., water and ion-mediated interactions, local flexibility of proteins, and the atomic structure of each residue), they have proved to be useful in many ways. The frequently cited early work of Miyazawa and Jernigan²¹ used the frequencies of contacts between amino acid pairs in globular protein structures to construct an effective pair potential matrix. This matrix has continued to be used in many applications such as coarse-grained simulations of protein complexes^22,23 and is also used as a basis of recent structural models of rates of amino acid substitutions.^24–26

Interactions between a solvent and amino acids were not included originally,²¹ but Betancourt and Thirumalai²⁷ showed that this could be accounted for by shifting the elements relative to the amino acid threonine. The original matrix would not be suitable for our study here because all the energies are negative, meaning all random surfaces would be attractive. This is not true in the transformed matrix, which has both positive and negative elements. The transformed matrix, B_ij, is shown in Table I of the supplementary material. It captures the fact that the interactions between pairs of hydrophobic amino acids are substantially negative and those between hydrophobic and polar or between two polar residues are on-average weaker and can be either positive or negative. It also captures specific features such as attractions and repulsions between charged amino acids. As a concrete example, this matrix has been successfully used in a study of protein folding in the GroEL cavity.²⁸

It should also be noted that the B_ij matrix we use is derived from contact frequencies within globular protein structures, not from specific frequencies of amino acids at surfaces and interfaces. It is therefore essentially independent of data on interface propensities. We will show here that use of this energy matrix in our model leads to useful predictions on interface propensities that correlate with experimental observations. These predictions are noncircular, whereas they would be if we had used statistical potentials derived from surface data.

III. CALCULATION OF AGGREGATION PROBABILITIES

For any given sequence of surface residues, we calculate the interface energies as in Sec. II. We then use the interface energies to calculate the probabilities of protein-protein interactions. We consider a solution of a single kind of protein with total concentration ϕ moles per unit volume. We determine the equilibrium concentration of monomers c and of aggregates of n subunits, C_n, in the following way.

For each of the three types of interface ij ∈ {AA, AB, BB}, we define

a_{i j} = \frac{1}{ω} e^{- β E_{i j}},

(1)

where ω is the number of possible orientational configurations of one protein relative to its neighbor. For the simple cubic lattice considered here, ω = 24, which is the number of possible orientations of a cubic object on a cubic lattice. In the calculations below, the statistical weight of an interface of type ij is given by a_ijc/c₀, where c₀ = 1M is the reference concentration.

The concentrations of the different possible aggregates can be calculated by considering formation of chains that grow from one end only. If chains grow from both ends or if chains can aggregate with other chains (rather than just chains with monomers), this does not alter the equilibrium frequencies of the different aggregates. Therefore, we give the simplest case of the calculation here, which allows for the growth of one monomer at a time from one end only.

Letting C_n(A) and C_n(B) denote the equilibrium concentrations of a chain of length n, with the A or the B face exposed at the growing end, the equilibrium concentrations can be calculated using a transfer matrix method,

(\begin{matrix} C_{n} (A) \\ C_{n} (B) \end{matrix}) = (c / c_{0}) (\begin{matrix} a_{A B} & a_{B B} \\ a_{A A} & a_{A B} \end{matrix}) (\begin{matrix} C_{n - 1} (A) \\ C_{n - 1} (B) \end{matrix}) .

(2)

We define C₁(A) = C₁(B) = c/2 so that the sum of the two orientations is equal to the total free monomer concentration, c. The eigenvalues of the transfer matrix, a, are given by $λ_{\pm} = a_{A B} \pm \sqrt{a_{A A} a_{B B}}$ , and from these, the abundance of chains of length n, given by C_n = C_n(A) + C_n(B), can be obtained as

\frac{C_{n}}{c_{0}} = {(\frac{c}{c_{0}})}^{n} [A_{+} λ_{+}^{n - 1} - A_{-} λ_{-}^{n - 1}],

(3)

where

A_{\pm} = \frac{a_{A A} + a_{B B} \pm 2 \sqrt{a_{A A} a_{B B}}}{4 \sqrt{a_{A A} a_{B B}}} .

(4)

For example, it is easily verified from Eq. (3) that for n = 2, the dimer concentration is

\frac{C_{2}}{c_{0}} = \frac{1}{2} {(\frac{c}{c_{0}})}^{2} (a_{A A} + 2 a_{A B} + a_{B B}),

(5)

where the terms for the two dimer configurations with isologous interfaces and the two orientations of the dimer with the heterologous interface can be clearly seen.

The concentration of monomers, c, can now be determined. The concentration of proteins in clusters of size n is given by ϕ_n = nC_n. The total protein subunit concentration is ϕ = ∑_nϕ_n. This sum gives an equation from which the free monomer concentration, c, can be calculated,

\frac{ϕ}{c_{0}} = \frac{A_{+} (c / c_{0})}{{(1 - λ_{+} c / c_{0})}^{2}} - \frac{A_{-} (c / c_{0})}{{(1 - λ_{-} c / c_{0})}^{2}} .

(6)

There is always a single solution to Eq. (6) in the physical range where 0 < c < ϕ.

The probability of a subunit being present in an n-mer is P_n = ϕ_n/ϕ. The fractions of proteins present as monomers and dimers are P₁ and P₂. We refer to all aggregates of 3 or more units as fibrils; hence, the fraction of proteins in fibrils is P_fib = ∑_n≥3P_n. In some cases, we wish to distinguish closed dimers with the strong AA interface from other dimers. The fraction of proteins in closed dimers is

P_{2}^{*} = P_{2} \frac{a_{A A}}{a_{A A} + a_{B B} + 2 a_{A B}} .

(7)

Likewise, in other cases, we wish to distinguish oriented fibrils containing only AB interfaces from general fibrils containing mixtures of all three types of interface. The concentration of proteins in oriented fibrils of length n is

ϕ_{n}^{o r i} = \frac{n c^{n} a_{A B}^{(n - 1)}}{{c_{0}}^{(n - 1)}},

(8)

and the fraction of proteins in oriented fibrils is

P_{o r i} = \frac{1}{ϕ} \sum_{n \geq 3} ϕ_{n}^{o r i} .

(9)

IV. EVOLUTIONARY COMPUTATIONS

We now consider the evolution of proteins whose interactions are described by the statistical physics model above. We consider a population of individuals, each with a gene for the protein in question. The fitness of an individual is a function of the protein sequence. If mutation is weak in comparison to selection, as we will assume below, there is a dominant variant of the protein in the population at any one time, and occasionally, a new variant spreads through the population and replaces the old one. We would like to calculate the long-term steady state frequencies of sequences in the ensemble of sequences generated by this evolutionary process.

We consider protein sequences evolving under a mutational model in which the rate of mutation from amino acid i to j is r_ij = uπ_j, where u is a rate constant and π_j is the steady state frequency of amino acid j under the mutational process. For simplicity, we deal with mutations at the level of the protein sequence and do not consider the underlying DNA. In the neutral case, protein sequences evolve under the influence of mutation, and there is no selection. Let $f_{k}^{m u t}$ be the steady state frequency of sequence k under mutation. We consider the simplest case where all 20 amino acids have equal frequency (π_j = 0.05 for all j). Hence, there are 20³² possible amino acid sequences, each with steady state frequency $f_{k}^{m u t} = {(0.05)}^{32}$ .

We define the fitness of a sequence as $w$ = 1 + s, where positive and negative values of the selection coefficient, s, denote advantageous or deleterious sequences and s = 0 for neutral variants. For any amino acid sequence, we assume that s is a function of the multimer configuration probabilities P_n for that sequence. We consider several choices of fitness functions: (i) a neutral case, where s = 0 for every sequence; (ii) positive selection in favor of dimer formation, where $s = σ P_{2}^{*}$ ; (iii) selection against fibril formation, where s = −σP_{f ib}; (iv) selection in favor of fibril formation, where s = σP_{f ib}; and (v) selection in favor of oriented fibrils containing only AB interfaces, where s = σP_ori. In all these cases, σ is a positive constant that determines the strength of selection.

In order to calculate the steady state frequencies of sequences in the presence of selection as well as mutation, we assume that mutations are rare enough so that only one mutation is segregating at a time in the same gene. This is a common approximation in population genetics that allows analytical progress in a simple way. In this approximation, the stationary frequency of a sequence k under the influence of selection is weighted by a factor $e^{2 N_{e} s (k)}$ relative to the case with no selection,^29,30 where s(k) is the selection coefficient for this sequence and N_e is the effective population size. The frequency of sequence k under selection and mutation is

f_{k}^{s e l} = \frac{f_{k}^{m u t} e^{2 N_{e} s (k)}}{\sum_{j} f_{j}^{m u t} e^{2 N_{e} s (j)}} .

(10)

The practical issue with Eq. (10) is the exponential number of sequences in the sum. It is not possible to exhaustively consider all 20³² sequences. We therefore use a Markov Chain Monte Carlo (MCMC) sampling method that generates a large sample of representative protein sequences such that the probability of any sequence arising in the sample is proportional to its steady state frequency. The average properties of the full ensemble are closely approximated by the simple mean of the properties of the sequences in the sample.

The MCMC simulations begin with a random sequence of 32 amino acids. We then generate a descendant sequence via replication with mutation. The probability that an amino acid i in the parent is replaced by j in the descendant is r_ij. The probability that the amino acid remains unchanged is r_ii = 1 − ∑_jr_ij. The value of u is not critical as it does not influence steady state frequencies. We found u = 0.05 to allow efficient exploration of the sequence space. If there is no selection, then every descendant sequence is accepted into the sample, and the method generates a sample with frequencies proportional to $f_{k}^{m u t}$ . If selection is acting, we accept or reject the descendant according to its fitness. Let the current sequence be k₁ and the descendant be k₂, and let the selection coefficients for these sequences be s(k₁) and s(k₂). The difference in fitness between the sequences is Δs = s(k₂) − s(k₁). To ensure that the frequency of any sequence k in the sample is proportional to $f_{k}^{m u t} e^{2 N_{e} s (k)}$ , as is required, the ratio of acceptance of mutations that increase and decrease fitness must be $e^{2 N_{e} Δ s}$ . Our MCMC algorithm does this in the simplest way: it accepts the new sequence with probability 1 if Δs is positive and with probability $e^{2 N_{e} Δ s}$ if Δs is negative. If the new sequence is rejected, a second copy of the old sequence goes into the sample. This method is equivalent to the Metropolis algorithm used for Boltzmann-weighted sampling in physics. We also note that a similar method of evolutionary simulation was used in another model of protein evolution³¹ in which the fitness of a sequence depends on its folding ability and its affinity to another target model.

V. PHENOTYPE DISTRIBUTIONS

The two most useful quantities to summarize the phenotype of a sequence are the frequency of AA dimers, $P_{2}^{*}$ , and the frequency of fibrils, P_{f ib}. Figure 2(a) shows the distribution of a sample of sequences generated by the MCMC evolutionary simulation in the neutral case with a total concentration of ϕ = 0.01M. The MCMC routine ran for 300 000 generations, and the first 5000 generations were discarded to allow for equilibration. As all sequences have equal frequency under this mutational model when there is no selection, the sequences generated are simply random amino acid sequences. The figure shows that sequences are spread over a broad range of $P_{2}^{*}$ and P_{f ib}. Sequences close to the origin (where $P_{2}^{*}$ and P_{f ib} are close to zero) exist mostly as monomers (P₁ is close to 1). Sequences in the bottom right corner are mostly dimers. Sequences in the top corner are mostly fibrils. It can be seen, however, that strong fibril formers are rare under neutral evolution at this concentration. Thus, no points are found very close to the top corner in Fig. 2(a). The mean values of these probabilities for all sequences in the sample are $⟨ P_{2}^{*} ⟩ = 0.04$ and ⟨P_{f ib}⟩ = 0.003. Thus, typical sequences are usually monomers.

FIG. 2. — Phenotype distribution in the space $P_{2}^{*}$ vs P_fib for samples of sequences arising under evolution using the MCMC method. (a) Neutral, (b) selection for dimers (N_eσ = 25), and (c) selection for fibrils (N_eσ = 25). For each of these plots, the red symbol denotes the mean value of $P_{2}^{*}$ and P_fib for all the sequences in the sample.

Figures 2(b) and 2(c) show the way the phenotype distribution shifts when selection is applied for dimers and for fibrils. When selection is applied for dimers, the distribution shifts close to the bottom right corner, with $⟨ P_{2}^{*} ⟩ = 0.89$ and ⟨P_{f ib}⟩ = 0.01. When selection is applied for fibrils, the distribution shifts close to the top corner, with $⟨ P_{2}^{*} ⟩ = 0.04$ and ⟨P_{f ib}⟩ = 0.90. This means that sequences that are either very strong fibril-formers or very strong dimer-formers are possible in this model and that they arise easily when selection favors them. Nevertheless, they are relatively rare compared to the large number of random sequences with weaker interface interactions, so they do not arise frequently in the mixture of random sequences generated under neutral evolution.

To illustrate the range of behaviors shown by individual sequences, we chose the six example sequences A–F described in Table I. For each of these sequences, the distribution of n-mer probabilities, P_n, is shown in Fig. 3 at concentration ϕ = 0.01M. This value is consistent with cellular concentrations of the enzymes that are present at the highest quantities in cells as these are the ones for which aggregation is most relevant. Various mechanisms of subcellular protein localization would additionally enhance their concentrations.^32–34

TABLE I.

Energies of the three interfaces for example sequences A-F discussed in Figs. 3 and 4.

Sequence	Description	E_AA/kT	E_BB/kT	E_AB/kT
A	Monomer	−2.68	−2.32	−2.80
B	Dimer-former	−12.06	−3.14	−5.25
C	Strong dimer-former	−15.82	−0.60	−2.80
D	Fibril former	−9.08	−8.67	−5.68
E	Strong fibril-former	−12.28	−12.06	−12.01
F	Oriented fibril-former	−4.53	−4.52	−8.18

Open in a new tab

FIG. 3. — Histograms of P_n for six example sequences illustrating different behaviors at ϕ = 0.01M. Panels (a)–(f) refer to the six sequences described in Table I.

The probabilities P_n change significantly as the concentration is varied. The changes with concentration can be illustrated as trajectories in the $P_{2}^{*}$ vs P_{f ib} triangle. Figure 4 shows the trajectories for sequences A–F as the concentration is increased from 10⁻⁶M to 1M. All sequences begin at the origin (all monomers) for low concentration and eventually move toward the fibril corner for very high concentration. Dimer-forming sequences approach the dimer corner at intermediate concentrations. The concentration ϕ = 0.01M, which was used in Fig. 3, is shown as red diamonds in Fig. 4. Extreme concentrations higher than this are included in order to illustrate the predictions of the model. The highest concentration point is 1M, shown as purple triangles.

FIG. 4. — Plot showing trajectories in the space $P_{2}^{*}$ vs P_fib for the six sequences A–F in Table I. The concentration varies from ϕ = 10⁻⁶M to 1M, with the reference concentration ϕ = 0.01M labeled as red diamonds and the highest concentration ϕ = 1M labeled as purple triangles.

Sequence A is a typical sequence chosen randomly from the sample generated by the neutral simulation [Fig. 2(a)]. The energies of all three interfaces are weak; hence, this sequence is almost entirely monomers at ϕ = 0.01M [see Fig. 3(a)]. The trajectory does not move close to the dimer corner at any concentration, and it is still not close to the fibril corner, even at ϕ = 1M.

Sequence B is a dimer-former found in the neutral sample. It is the sequence with the highest $P_{2}^{*}$ in Fig. 2(a). This sequence is mostly a dimer at ϕ = 0.01M [see Fig. 3(b)] and gradually becomes a fibril at concentrations higher than this. Sequence B forms dimers because the AA interface is strong. E_AA is much lower than the other two energies (see Table I).

Sequence C is a strong dimer-former found in the sample generated under selection for dimer formation [Fig. 2(b)]. It is almost entirely a dimer at the reference concentration and remains very close to the dimer corner even at ϕ = 1M. The AA interface is even stronger than for sequence B.

Sequence D is a fibril-former found in the neutral sample. All three interface energies are fairly strong. This sequence has P_{f ib} = 0.61 at ϕ = 0.01M, which is the highest in Fig. 2(a), and the distribution of P_n has significant weight at larger n. Sequence E is a strong fibril-former found in the sample of sequences selected for fibril formation [Fig. 2(c)]. All three interface energies are very strong. This sequence has P_{f ib} close to 1 already at ϕ = 0.01M.

Sequence F is a fairly strong fibril-former found in the neutral sample, which has P_{f ib} = 0.44 at ϕ = 0.01M. It differs from the other fibril-formers (D and E) in that the heterologous interface energy E_AB is much lower than the others. This means that it forms mostly oriented fibrils. The frequency of closed AA dimers, $P_{2}^{*}$ , is very low at all concentrations; hence, the trajectory in Fig. 4 moves almost along the P_{f ib} axis.

VI. PROPERTIES OF PROTEIN INTERFACES

Figure 5 shows the mean energies of the three possible interfaces for random sequences evolving neutrally (shown as horizontal lines) and compares these with the mean energies for sequences generated under four different kinds of selection (shown as points). It can be seen that for neutral evolution, E_AA is significantly lower than E_BB even though the sequences are random. This occurs by definition because we have labeled surfaces A and B for each sequence such that A forms the stronger interface of the two. The heterologous interface energy E_AB is intermediate between the two isologous interface energies. All three energies are negative because the mean interaction energy of random amino acid pairs (from the B_ij matrix in Table I of the supplementary material) is slightly negative: ⟨B_ij⟩ = −0.057. The mean energy for an interface of 16 random pairs is therefore −0.912. The average energies of the three kinds of interface under neutral evolution are all lower than this because we consider four rotations of the two surfaces, as shown in Fig. 1, and take the lowest of these to define the energy of the interface.

Figure 5 illustrates the way the energies of the interfaces change when selection is applied. In the case of selection for dimers, E_AA decreases substantially with respect to the neutral case, as we would expect, because we are selecting for sequences with high $P_{2}^{*}$ . It can be seen that E_BB actually increases slightly with respect to the neutral case. It is important that the BB interface should remain weak because if both AA and BB interfaces become strong, the sequence will form fibrils with proteins in alternating directions.

The second column in Fig. 5 illustrates the case of selection against fibrils. We were interested in this case because we expect that uncontrolled fibril formation should be harmful to the cell. Selection against fibrils eliminates the rare sequences with high P_{f ib} from the neutral phenotype distribution, but since these sequences are rare and since the mean value of P_{f ib} in the neutral case is already very low, selection against fibrils has only a small effect on the mean energies of the interfaces. It can be seen that all three energies increase slightly with respect to their neutral values, making all kinds of multimers and fibrils less frequent.

There are some proteins whose function requires fibril formation, such as actin and tubulin. Therefore, we also considered the case when selection acts for fibrils. In this case, all three interface energies become much lower than the neutral values. It can be seen that E_AA and E_BB decrease more than E_AB, meaning that the orientation of proteins in these fibrils will be rather random, but there will be relatively few heterologous interfaces. In Sec. III, we defined P_ori, the fraction of proteins in oriented fibrils. For the case where selection is for fibrils of all kinds, we find ⟨P_ori⟩ is only 0.05, even though ⟨P_{f ib}⟩, which includes fibrils with proteins in all possible arrangements, is 0.90. By contrast, the fourth case in Fig. 5 shows selection for oriented fibrils only. In this case, E_AB decreases much more than E_AA and E_BB. Hence, most interfaces will be heterologous. In this case, we find ⟨P_{f ib}⟩ is also 0.90, but ⟨P_ori⟩ is 0.87, meaning that almost all the fibrils are oriented.

Taken together, these results show that this model of protein interfaces is quite versatile. It allows selection for both increased and decreased strength of interfaces, and it allows separate selection for either heterologous interfaces (as in the case of dimers) or isologous interfaces (as in the case of oriented fibrils).

VII. INTERFACE PROPENSITIES OF AMINO ACIDS

Our model allows us to study the way that the frequencies of amino acids at interfaces vary with respect to the frequencies that would be expected under random mutation. The propensities of the amino acids to occur at interfaces have been measured in real proteins (Jones and Thornton³⁵ and Levy et al.³⁶). In this section, we show that our 20-amino acid model generates interface propensities that are similar to these.

We generated 2 × 10⁷ random sequences of amino acids on the A and B surfaces. We calculated E_AA and E_BB, and where necessary, we relabelled the A and B surfaces so that A is the stronger interface. We measured the frequencies p_A and p_B of amino acids on each surface, relative to the expected frequency under neutral mutations, π_i = 0.05. These are shown in Fig. 6, and the data are given in Table II of the supplementary material. When we compare pairs of interfaces in this way and distinguish the stronger from the weaker, the frequencies of amino acids on the two surfaces are different, even though the average frequency on both surfaces has to be equal for all amino acids. Hence, in the figure, we see that p_A can be significantly higher or lower than p_B for many of the amino acids, even though the average of p_A and p_B has to be 1 for every amino acid.

FIG. 6. — Relative amino acid frequencies at the A surface (p_A, blue) and the B surface (p_B, red) for neutrally evolving proteins. The frequencies are not equal on the two surfaces because the A and B surfaces have been defined as the stronger and weaker of the two, respectively.

From these probabilities, we define an interface propensity for each amino acid as

S_{i n t} = \ln (p_{A} / p_{B}) .

(11)

This score is positive for amino acids that have increased frequency at strong interfaces and negative for those that have decreased frequency.

The S_int scores are related to the energies in the B_ij matrix, as shown in Fig. 7(a). We define B_self as the self-interaction energy of the amino acid (the diagonal element B_ii of the matrix) and B_ave as the mean of the interaction of one amino acid with the 20 possible partners. The more negative the B_self and B_ave, the higher the interface propensity. The data are given in Table II of the supplementary material. The correlation coefficients r and the p values for the t-test of correlation are given in the caption and in Table III of the supplementary material. These correlations are highly significant, and the correlation with B_ave is stronger than with B_self, as can be seen graphically in Fig. 7(a).

The S_int scores are also related to two previous scales of interface propensities measured from protein structure data. In Table II of the supplementary material, ln(RIP) is the “relative interface propensity” from Ref. 35, and “stickiness” is the interface propensity scale from Ref. 36. Both of these scales are derived from the observed frequencies of amino acids at protein-protein interfaces relative to their frequencies at noninteracting surfaces. The score from our model, obtained from the relative frequencies at the A and B surfaces, is directly comparable to these. Figure 7(b) shows that there is a strong positive correlation between S_int and the two other interface propensities. The correlation coefficients and t-test parameters are given in Table III of the supplementary material, showing that there is a highly significant correlation between all three scores.

This observation tells us that the B_ij matrix contains detailed information about the strengths of interactions between the different amino acids that is sufficient to quantitatively predict which amino acids increase or decrease in frequency at interfaces. It also tells us something about why this occurs. Since surface amino acids interact with other copies of themselves and with all possible other amino acids at the interface, those which have the most negative B_self and B_ave increase the most in frequency at the strongly binding interface.

Figure 8 and Table IV of the supplementary material show the frequencies of amino acids at surfaces A and B in the sets of sequences generated by the MCMC sampling method when selection is present. Selection for dimers [as in Fig. 8(a)] accentuates the difference between the A and B surfaces that is already seen in the neutral case (Fig. 6). In the dimer case, the hydrophobic amino acids on the left are very much more frequent on the dimer-forming A interface than on the noninteracting B interface, whereas the hydrophilic amino acids on the right are much more frequent on the noninteracting B interface.

In the case of selection for fibrils, both AA and BB interfaces become strong. Thus, p_A and p_B both show a decreasing trend from left to right in Fig. 8(b), and there is not much difference between p_A and p_B. The case of selection against fibrils is shown in Table IV of the supplementary material. The frequencies do not change very much from the neutral case because most sequences have a low fibril forming probability, as we saw previously.

Figure 8(c) compares the p_A frequencies in the case of selection for dimers with the p_A frequencies in the case of selection for oriented fibrils. In the dimer case, we select for isologous AA interfaces, whereas in the case of oriented fibrils, we select for heterologous AB interfaces. A comparison of these two shows that amino acids differ significantly in frequency between isologous and heterologous interfaces. In particular, it can be seen that Cys is more frequent at isologous interfaces and the charged amino acids (Arg, Lys, Glu, and Asp) are more frequent at heterologous interfaces. From this, we define a propensity for amino acids at isologous vs heterologous interfaces as

S_{i s o} = \ln (p_{A} (d i m e r s) / p_{A} (o r i e n t e d fi b r i l s)) .

(12)

This propensity is the highest for Cys and the most negative for the charged amino acids (see Table IV of the supplementary material).

This effect occurs because amino acids in isologous interfaces have a significant probability of interacting with the copy of themselves on the other side of the interface, whereas amino acids in heterologous interfaces only interact with themselves if there is an independently evolved amino acid of the same kind on the other surface. We define the difference between average and self-energies as ΔB = B_ave − B_self. We expect amino acids with positive ΔB to be favored at isologous interfaces, and vice versa. Figure 9(a) shows that there is a highly significant correlation of S_iso and ΔB. The significance values are given in the caption and in Table III of the supplementary material. From the B_ij matrix in Table I of the supplementary material, it can be seen that Cys has a particularly low value of B_self, presumably reflecting the presence of disulphide bridges in the data from which the B_ij matrix was derived. This results in a large positive ΔB for Cys. The charged amino acids have positive values of B_self, presumably due to repulsions between like charges. This results in negative ΔB for the charged amino acids.

FIG. 9. — (a) Correlation of the difference between average and self-energies ΔB = B_ave − B_self with the isologous vs heterologous interface propensity, S_iso. (b) Correlation of the relative enrichment of amino acids in isologous vs heterologous interfaces, ln(REI), found in analysis of real protein complex structures with S_iso obtained from our model. The correlation coefficients and significance values for these plots are ΔB, r = 0.897, p = 8.7 × 10⁻⁸; and ln(REI), r = 0.615, p = 3.9 × 10⁻³.

It can be seen in Fig. 1(b) that when there is a 90° rotation of one protein with respect to the other, four of the 16 amino acids (numbered 1, 6, 11, and 16) form pairwise contacts with themselves. The same happens in the 270° rotation but does not happen in the 0° and 180° rotations. Although some of these details are particular to the square lattice we are using, the point that amino acids in isologous interfaces can interact with copies of themselves is still true in real proteins where there is no square lattice. For example, the same effect occurs in the circular patch model used in Refs. 14 and 15. Therefore, it is reasonable to ask whether the systematic difference in amino acid frequencies between isologous and heterologous interfaces that we observe in our model also arises in real proteins. To address this, we performed a systematic analysis of the amino acid residues present in the homomeric interfaces of real protein complex structures present in the Protein Data Bank.

Starting from a snapshot of all of the structures in the Protein Data Bank (9-26-2018), all protein residues present in homomeric interfaces were identified as those burying any solvent-accessible surface area with an identical polypeptide chain. Incomplete residues missing any nonhydrogen atoms from the side chain were excluded. Isologous interfaces were classified as those where the correlation between the residue-specific buried surface areas for each subunit in an interacting pair was >0.7, as defined previously.⁴² Since the dataset of interface residues initially contained data from many proteins with closely related or identical sequences, we used the PISCES protein sequence culling server⁴³ to remove chains with 90% or greater similarity. This resulted in a total number of 932 536 interface residues from 18 777 nonredundant chains. The total number of occurrences of each amino acid was then counted for isologous and heterologous interfaces, and the proportion was calculated by dividing by the total number of residues at each type of interface. Relative enrichment at isologous interfaces, REI, was calculated by dividing the proportion of each amino acid at isologous interfaces by the proportion at heterologous interfaces (see Table IV of the supplementary material).

Interestingly, we observe a significant correlation between S_iso and ln(REI) [see Table III of the supplementary material and Fig. 9(b)], thus validating the utility of our simplified model and demonstrating its power in capturing genuine sequence differences between the different types of interfaces. The deviations between our model and the pattern observed in real structures could be due to a number of factors. In particular, there are likely to be systematic differences in the functions of homo-oligomers with isologous vs heterologous interfaces as there is a strong association between symmetry and function.⁴ For example, transmembrane channels will be enriched in heterologous interfaces due to their strong association with higher-order cyclic symmetry. Thus, if the interfaces of transmembrane proteins tend to differ in amino acid composition compared to other proteins, this could add a degree of bias.

VIII. DISCUSSION AND CONCLUSIONS

This work presents a first attempt at a theoretical description of the evolution of multimers and fibrils. The pairwise contact-energy matrix that we used is a simple way of defining interface energies that does not account for the three dimensional structure of surfaces. It was not optimized in any way for the present model. We are therefore very satisfied that several features of the interface propensities and the isologous/heterologous propensities are quite close to those seen in real proteins.

The inevitable presence of hydrophobic residues means that all proteins will be aggregation prone to some extent. Hydrophobic residues in the interior are necessary for proper folding of proteins, and hydrophobic residues on the surface can lead to formation of functional multimeric states. While uncontrolled protein aggregation has been shown to be associated with an increasing number of pathological conditions, including human diseases, due to loss of normal function or gain in toxic activity, fibril formation can also serve functional roles in cases such as adhesion and biofilm formation in bacteria³⁷ and defense against micro-organisms.³⁸ Cells employ a range of strategies to control aggregation at both the sequence levels (for example, through modulation of aggregation-prone regions or protein stability) and at the cellular level (for example, through compartmentalization and modulation of protein abundance).³⁹

With the interaction energies used here, we find that strongly aggregating proteins will be rare under neutral evolution at concentrations that are likely to arise in the cell. This conclusion needs to be treated with caution because of the simplicity of the interaction energy rules. We have only considered solutions of a single kind of protein. It may be possible to extend the model to consider mixtures of many kinds of proteins in the future. It should also be noted that there is a parameter ω for rotational entropy in Eq. (1) that is not known with certainty. Lower values of ω would lead to higher probabilities of aggregation at any given concentration. Also, we have arbitrarily chosen surfaces with 16 amino acids. Increasing or decreasing the number of interacting amino acids in a patch would increase or decrease the strength of interactions, which would also affect the frequency of strongly aggregating proteins expected under neutral evolution.

A further caveat is that the evolutionary calculations were done under the approximation that a single mutation is segregating in the sequence at once, which is not always true. This could be improved using full-scale population genetics simulations in the future. It would also be possible to consider evolution at the DNA level and determine the protein sequence by translation of the gene. This would allow us to consider cases where the steady state frequencies of the amino acids in the proteins and the four nucleotides in the genes are biased by mutation.

Future extensions of this work include developing this model to consider other multimer structures, such as cyclic and dihedral tetramers, by allowing more than two sticky faces on each protein or by considering proteins with two sticky faces at an angle of 90° to one another. An important aim will be to predict the relative frequencies of multimers of different symmetries and different numbers of subunits, as is tabulated in the “periodic table” classification of the protein structure database.⁴⁰ Furthermore, as our model is able to predict the way the multimer structures will change when mutations are made to the surface residues, we will be able to study evolution of multimer structures over time in a family of related species and compare this with studies of structural evolution.⁴¹ The present approach therefore introduces a method from which a wide range of new developments will be possible for the study of the evolution of higher order protein structure.

SUPPLEMENTARY MATERIAL

The supplementary material file Supplementary Tables.xlsx contains the data in the four supplementary tables. The file Supplementary Table Descriptions.docx contains descriptions of these tables.

ACKNOWLEDGMENTS

This work was supported by Discovery Grant No. 2017–05911 to Paul Higgs from the Natural Sciences and Engineering Research Council of Canada. This collaboration was initiated at KITP and supported in part by the National Science Foundation Grant No. NSF PHY-1125915, Multidisciplinary University Research Initiative Award Nos. W911NF-09-1-0444 and W911NF-14-1-0411, and National Institutes of Health Award No. R35-GM122566-01. J.A.M. was supported by a Medical Research Council Career Development Award (No. MR/M02122X/1) and a Lister Institute Research Prize Fellowship.

REFERENCES

1.Levy E. D. and Teichmann S., “Structural, evolutionary, and assembly principles of protein oligomerization,” in Progress in Molecular Biology and Translational Science (Elsevier, 2013), pp. 25–51. [DOI] [PubMed] [Google Scholar]
2.Changeux J.-P., “Allostery and the Monod-Wyman-Changeux model after 50 years,” Annu. Rev. Biophys. 41, 103–133 (2012). 10.1146/annurev-biophys-050511-102222 [DOI] [PubMed] [Google Scholar]
3.Eisenberg D. and Jucker M., “The amyloid state of proteins in human diseases,” Cell 148, 1188–1203 (2012). 10.1016/j.cell.2012.02.022 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Bergendahl L. T. and Marsh J. A., “Functional determinants of protein assembly into homomeric complexes,” Sci. Rep. 7, 4932 (2017). 10.1038/s41598-017-05084-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Jubb H. C., Pandurangan A. P., Turner M. A., ochoa-Montano B., Blundell T. L., and Asher S. B., “Mutations at protein-protein interfaces: Small changes over big surfaces have large impacts on human health,” Prog. Biophys. Mol. Biol. 128, 3–13 (2017). 10.1016/j.pbiomolbio.2016.10.002 [DOI] [PubMed] [Google Scholar]
6.Oosawa F. and Kasai M., “A theory of linear and helical aggregations of macromolecules,” J. Mol. Biol. 4, 10–21 (1962). 10.1016/s0022-2836(62)80112-0 [DOI] [PubMed] [Google Scholar]
7.Nyrkova I. A., Semenov A. N., Aggeli A., Bell M., Boden N., and McLeish T. C. B., “Self-assembly and structure transformations in living polymers forming fibrils,” Eur. Phys. J. B 17, 499–513 (2000). 10.1007/s100510070128 [DOI] [Google Scholar]
8.Schmit J. D., Ghosh K., and Dill K., “What drives amyloid molecules to assemble into oligomers and fibrils?,” Biophys. J. 100, 450–458 (2011). 10.1016/j.bpj.2010.11.041 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.van Gestel J., van der Schoot P., and Michels M., “Helical transition of polymer-like assemblies in solution,” J. Phys. Chem. B 105, 10691–10699 (2001). 10.1021/jp011733o [DOI] [Google Scholar]
10.van Gestel J. and de Leeuw S. W., “A statistical-mechanical theory of Fibril formation in dilute protein solutions,” Biophys. J. 90, 3134–3145 (2006). 10.1529/biophysj.105.076000 [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Lynch M., “Evolutionary diversification of the multimeric states of proteins,” Proc. Natl. Acad. Sci. U. S. A. 110, E2821–E2828 (2013). 10.1073/pnas.1310980110 [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Lynch M., “The evolution of multimeric protein assemblages,” Mol. Biol. Evol. 29, 1353–1366 (2011). 10.1093/molbev/msr300 [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Gillespie J. H., The Causes of Molecular Evolution (Oxford University Press, 1991). [Google Scholar]
14.Lukatsky D. B., Zeldovich K. B., and Skakhnovich E. I., “Statistically enhanced self-attraction of random patterns,” Phys. Rev. Lett. 97, 178101 (2006). 10.1103/physrevlett.97.178101 [DOI] [PubMed] [Google Scholar]
15.Lukatsky D. B., Skakhnovich B. E., Mintseris J., and Skakhnovich E. I., “Structural similarity enhances interaction propensity of proteins,” J. Mol. Biol. 365, 1596–1606 (2007). 10.1016/j.jmb.2006.11.020 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Ispolatov I., Yuryev A., Mazo I., and Maslov S., “Binding properties and evoution of homodimers in protein-protein interaction networks,” Nucleic Acids Res. 33, 3629–3635 (2005). 10.1093/nar/gki678 [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Zeldovich K. B., Chen P., Shakhnovich B. E., and Shakhnovich E. I., “A first-principles model of early evolution: Emergence of gene families, species and preferred protein folds,” PLoS Comput. Biol. 3(7), e139 (2007). 10.1371/journal.pcbi.0030139 [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Kortemmea T., Morozov A. V., and Baker D., “An orientation-dependent hydrogen bonding potential improves prediction of specificity and structure for proteins and protein-protein complexes,” J. Mol. Biol. 326, 1239–1259 (2003). 10.1016/s0022-2836(03)00021-4 [DOI] [PubMed] [Google Scholar]
19.Zheng W., Schafer N. P., Davtyan A., Papoian G. A., and Wolynes P. G., “Predictive energy landscapes for protein protein association,” Proc. Natl. Acad. Sci. U. S. A. 109, 19244–19249 (2012). 10.1073/pnas.1216215109 [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Gromiha M. M. and Yugandhar K., “Integrating computational methods and experimental data for understanding the recognition mechanism and binding affinity of protein-protein complexes,” Prog. Biophys. Mol. Biol. 128, 33–38 (2017). 10.1016/j.pbiomolbio.2017.01.001 [DOI] [PubMed] [Google Scholar]
21.Miyazawa S. and Jernigan R. J., “Residue-residue potentials with a favourable contact pair term and an unfavourable high packing density term, for simulation and threading,” J. Mol. Biol. 256, 623–644 (1996). 10.1006/jmbi.1996.0114 [DOI] [PubMed] [Google Scholar]
22.Kim Y. C. and Hummer G., “Coarse-grained models for simulations of multiprotein complexes: Application to ubiquitin binding,” J. Mol. Biol. 375, 1416–1433 (2008). 10.1016/j.jmb.2007.11.063 [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Dignon G. L., Zheng W., Kim Y. C., Best R. B., and Mittal J., “Sequence determinants of protein phase behavior from a coarse-grained model,” PLoS Comput. Biol. 14(1), e1005941 (2018). 10.1371/journal.pcbi.1005941 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Taverna D. M. and Goldstein R. A., “Why are proteins so robust to site mutations?,” J. Mol. Biol. 315, 479–484 (2002). 10.1006/jmbi.2001.5226 [DOI] [PubMed] [Google Scholar]
25.Williams P. D., Pollock D. D., and Goldstein R. A., “Functionality and the evolution of marginal stability in proteins: Inferences from lattice simulations,” Evol. Bioinf. 2, 117693430600200 (2006). 10.1177/117693430600200013 [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Goldstein R. A. and Pollock D. D., “Sequence entropy of folding and the absolute rate of amino acid substitutions,” Nature Ecol. Evol. 1, 1923–1930 (2017). 10.1038/s41559-017-0338-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Betancourt M. R. and Thirumalai D., “Pair potentials for protein folding: Choice of reference states and sensitivity of predicted native states to variations in the interaction schemes,” Protein Sci. 8, 361–369 (1999). 10.1110/ps.8.2.361 [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Betancourt M. R. and Thirumalai D., “Exploring the kinetic requirements for enhancement of protein folding rates in the GroEL cavity,” J. Mol. Biol. 287, 627–644 (1999). 10.1006/jmbi.1999.2591 [DOI] [PubMed] [Google Scholar]
29.Bulmer M., “The selection-mutation-drift theory of synonymous codon usage,” Genetics 129, 897–907 (1991); available at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1204756/. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Higgs P. G. and Ran W., “Coevolution of codon usage and tRNA genes leads to alternative stable states of biased codon usage,” Mol. Biol. Evol. 25, 2279–2291 (2008). 10.1093/molbev/msn173 [DOI] [PubMed] [Google Scholar]
31.Manhart M. and Morozov A. V., “Protein folding and binding can emerge as evolutionary spandrels through structural coupling,” Proc. Natl. Acad. Sci. U. S. A. 112, 1797–1802 (2015). 10.1073/pnas.1415895112 [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Albe K. R., Butler M. H., and Wright B. E., “Cellular concentrations of enzymes and their substrates,” J. Theor. Biol. 143, 163–195 (1990). 10.1016/s0022-5193(05)80266-8 [DOI] [PubMed] [Google Scholar]
33.Milo R., “What is the total number of protein molecules per cell volume? A call to rethink some published values,” BioEssays 35, 1050–1055 (2013). 10.1002/bies.201300066 [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Levy E. D., Kowarzyk J., and Michnick S. W., “High-resolution mapping of protein concentration reveals principles of proteome architecture and adaptation,” Cell Rep. 7, 1333–1340 (2014). 10.1016/j.celrep.2014.04.009 [DOI] [PubMed] [Google Scholar]
35.Jones S. and Thornton J. M., “Analysis of protein-protein interaction sites using surface patches,” J. Mol. Biol. 272, 121–132 (1997). 10.1006/jmbi.1997.1234 [DOI] [PubMed] [Google Scholar]
36.Levy E. D., De S., and Teichmann S. A., “Cellular crowding imposes global constraints on the chemistry and evolution of proteomes,” Proc. Natl. Acad. Sci. U. S. A. 109, 20461–20466 (2012). 10.1073/pnas.1209312109 [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Vidal O., Longin R., Prigent-Combaret C., Dorel C., Hooreman M., and Lejeune P., “Isolation of an Escherichia coli K-12 mutant strain able to form biofilms on inert surfaces: Involvement of a new ompR allele that increases Curli expression,” J. Bacteriol. 180, 2442–2449 (1998); available at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC107187/. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Kagan B. L., Jang H., Capone R., Arce F. T., Ramachandran S., Lal R., and Nussinov R., “Antimicrobial properties of amyloid peptides,” Mol. Pharm. 9, 708–717 (2011). 10.1021/mp200419b [DOI] [PMC free article] [PubMed] [Google Scholar]
39.de Groot N. S., Torrent M., Villar-Piqué A., Lang B., Ventura S., Gsponer J., and Babu M. M., “Evolutionary selection for protein aggregation,” Biochem. Soc. Trans. 40, 1032–1037 (2012). 10.1042/bst20120160 [DOI] [PubMed] [Google Scholar]
40.Ahnert S. E., Marsh J. A., Hernández H., Robinson C. V., and Teichmann S. A., “Principles of assembly reveal a periodic table of protein complexes,” Science 350, aaa2245 (2015). 10.1126/science.aaa2245 [DOI] [PubMed] [Google Scholar]
41.Dayhoff J. E., Shoemaker B. A., Bryant S. H., and Panchenko A. R., “Evolution of protein binding modes in homooligomers,” J. Mol. Biol. 395, 860–870 (2010). 10.1016/j.jmb.2009.10.052 [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Marsh J. A. and Teichmann S. A., “Protein flexibility facilitates quaternary structure assembly and evolution,” PLoS Biol. 12, e1001870 (2014). 10.1371/journal.pbio.1001870 [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Wang G. and R. L. Dunbrack, Jr., “Pisces: A protein sequence culling server,” Bioinformatics 19, 1589–1591 (2003). 10.1093/bioinformatics/btg224 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

The supplementary material file Supplementary Tables.xlsx contains the data in the four supplementary tables. The file Supplementary Table Descriptions.docx contains descriptions of these tables.

[c1] 1.Levy E. D. and Teichmann S., “Structural, evolutionary, and assembly principles of protein oligomerization,” in Progress in Molecular Biology and Translational Science (Elsevier, 2013), pp. 25–51. [DOI] [PubMed] [Google Scholar]

[c2] 2.Changeux J.-P., “Allostery and the Monod-Wyman-Changeux model after 50 years,” Annu. Rev. Biophys. 41, 103–133 (2012). 10.1146/annurev-biophys-050511-102222 [DOI] [PubMed] [Google Scholar]

[c3] 3.Eisenberg D. and Jucker M., “The amyloid state of proteins in human diseases,” Cell 148, 1188–1203 (2012). 10.1016/j.cell.2012.02.022 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c4] 4.Bergendahl L. T. and Marsh J. A., “Functional determinants of protein assembly into homomeric complexes,” Sci. Rep. 7, 4932 (2017). 10.1038/s41598-017-05084-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c5] 5.Jubb H. C., Pandurangan A. P., Turner M. A., ochoa-Montano B., Blundell T. L., and Asher S. B., “Mutations at protein-protein interfaces: Small changes over big surfaces have large impacts on human health,” Prog. Biophys. Mol. Biol. 128, 3–13 (2017). 10.1016/j.pbiomolbio.2016.10.002 [DOI] [PubMed] [Google Scholar]

[c6] 6.Oosawa F. and Kasai M., “A theory of linear and helical aggregations of macromolecules,” J. Mol. Biol. 4, 10–21 (1962). 10.1016/s0022-2836(62)80112-0 [DOI] [PubMed] [Google Scholar]

[c7] 7.Nyrkova I. A., Semenov A. N., Aggeli A., Bell M., Boden N., and McLeish T. C. B., “Self-assembly and structure transformations in living polymers forming fibrils,” Eur. Phys. J. B 17, 499–513 (2000). 10.1007/s100510070128 [DOI] [Google Scholar]

[c8] 8.Schmit J. D., Ghosh K., and Dill K., “What drives amyloid molecules to assemble into oligomers and fibrils?,” Biophys. J. 100, 450–458 (2011). 10.1016/j.bpj.2010.11.041 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c9] 9.van Gestel J., van der Schoot P., and Michels M., “Helical transition of polymer-like assemblies in solution,” J. Phys. Chem. B 105, 10691–10699 (2001). 10.1021/jp011733o [DOI] [Google Scholar]

[c10] 10.van Gestel J. and de Leeuw S. W., “A statistical-mechanical theory of Fibril formation in dilute protein solutions,” Biophys. J. 90, 3134–3145 (2006). 10.1529/biophysj.105.076000 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c11] 11.Lynch M., “Evolutionary diversification of the multimeric states of proteins,” Proc. Natl. Acad. Sci. U. S. A. 110, E2821–E2828 (2013). 10.1073/pnas.1310980110 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c12] 12.Lynch M., “The evolution of multimeric protein assemblages,” Mol. Biol. Evol. 29, 1353–1366 (2011). 10.1093/molbev/msr300 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c13] 13.Gillespie J. H., The Causes of Molecular Evolution (Oxford University Press, 1991). [Google Scholar]

[c14] 14.Lukatsky D. B., Zeldovich K. B., and Skakhnovich E. I., “Statistically enhanced self-attraction of random patterns,” Phys. Rev. Lett. 97, 178101 (2006). 10.1103/physrevlett.97.178101 [DOI] [PubMed] [Google Scholar]

[c15] 15.Lukatsky D. B., Skakhnovich B. E., Mintseris J., and Skakhnovich E. I., “Structural similarity enhances interaction propensity of proteins,” J. Mol. Biol. 365, 1596–1606 (2007). 10.1016/j.jmb.2006.11.020 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c16] 16.Ispolatov I., Yuryev A., Mazo I., and Maslov S., “Binding properties and evoution of homodimers in protein-protein interaction networks,” Nucleic Acids Res. 33, 3629–3635 (2005). 10.1093/nar/gki678 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c17] 17.Zeldovich K. B., Chen P., Shakhnovich B. E., and Shakhnovich E. I., “A first-principles model of early evolution: Emergence of gene families, species and preferred protein folds,” PLoS Comput. Biol. 3(7), e139 (2007). 10.1371/journal.pcbi.0030139 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c18] 18.Kortemmea T., Morozov A. V., and Baker D., “An orientation-dependent hydrogen bonding potential improves prediction of specificity and structure for proteins and protein-protein complexes,” J. Mol. Biol. 326, 1239–1259 (2003). 10.1016/s0022-2836(03)00021-4 [DOI] [PubMed] [Google Scholar]

[c19] 19.Zheng W., Schafer N. P., Davtyan A., Papoian G. A., and Wolynes P. G., “Predictive energy landscapes for protein protein association,” Proc. Natl. Acad. Sci. U. S. A. 109, 19244–19249 (2012). 10.1073/pnas.1216215109 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c20] 20.Gromiha M. M. and Yugandhar K., “Integrating computational methods and experimental data for understanding the recognition mechanism and binding affinity of protein-protein complexes,” Prog. Biophys. Mol. Biol. 128, 33–38 (2017). 10.1016/j.pbiomolbio.2017.01.001 [DOI] [PubMed] [Google Scholar]

[c21] 21.Miyazawa S. and Jernigan R. J., “Residue-residue potentials with a favourable contact pair term and an unfavourable high packing density term, for simulation and threading,” J. Mol. Biol. 256, 623–644 (1996). 10.1006/jmbi.1996.0114 [DOI] [PubMed] [Google Scholar]

[c22] 22.Kim Y. C. and Hummer G., “Coarse-grained models for simulations of multiprotein complexes: Application to ubiquitin binding,” J. Mol. Biol. 375, 1416–1433 (2008). 10.1016/j.jmb.2007.11.063 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c23] 23.Dignon G. L., Zheng W., Kim Y. C., Best R. B., and Mittal J., “Sequence determinants of protein phase behavior from a coarse-grained model,” PLoS Comput. Biol. 14(1), e1005941 (2018). 10.1371/journal.pcbi.1005941 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c24] 24.Taverna D. M. and Goldstein R. A., “Why are proteins so robust to site mutations?,” J. Mol. Biol. 315, 479–484 (2002). 10.1006/jmbi.2001.5226 [DOI] [PubMed] [Google Scholar]

[c25] 25.Williams P. D., Pollock D. D., and Goldstein R. A., “Functionality and the evolution of marginal stability in proteins: Inferences from lattice simulations,” Evol. Bioinf. 2, 117693430600200 (2006). 10.1177/117693430600200013 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c26] 26.Goldstein R. A. and Pollock D. D., “Sequence entropy of folding and the absolute rate of amino acid substitutions,” Nature Ecol. Evol. 1, 1923–1930 (2017). 10.1038/s41559-017-0338-9 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c27] 27.Betancourt M. R. and Thirumalai D., “Pair potentials for protein folding: Choice of reference states and sensitivity of predicted native states to variations in the interaction schemes,” Protein Sci. 8, 361–369 (1999). 10.1110/ps.8.2.361 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c28] 28.Betancourt M. R. and Thirumalai D., “Exploring the kinetic requirements for enhancement of protein folding rates in the GroEL cavity,” J. Mol. Biol. 287, 627–644 (1999). 10.1006/jmbi.1999.2591 [DOI] [PubMed] [Google Scholar]

[c29] 29.Bulmer M., “The selection-mutation-drift theory of synonymous codon usage,” Genetics 129, 897–907 (1991); available at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1204756/. [DOI] [PMC free article] [PubMed] [Google Scholar]

[c30] 30.Higgs P. G. and Ran W., “Coevolution of codon usage and tRNA genes leads to alternative stable states of biased codon usage,” Mol. Biol. Evol. 25, 2279–2291 (2008). 10.1093/molbev/msn173 [DOI] [PubMed] [Google Scholar]

[c31] 31.Manhart M. and Morozov A. V., “Protein folding and binding can emerge as evolutionary spandrels through structural coupling,” Proc. Natl. Acad. Sci. U. S. A. 112, 1797–1802 (2015). 10.1073/pnas.1415895112 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c32] 32.Albe K. R., Butler M. H., and Wright B. E., “Cellular concentrations of enzymes and their substrates,” J. Theor. Biol. 143, 163–195 (1990). 10.1016/s0022-5193(05)80266-8 [DOI] [PubMed] [Google Scholar]

[c33] 33.Milo R., “What is the total number of protein molecules per cell volume? A call to rethink some published values,” BioEssays 35, 1050–1055 (2013). 10.1002/bies.201300066 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c34] 34.Levy E. D., Kowarzyk J., and Michnick S. W., “High-resolution mapping of protein concentration reveals principles of proteome architecture and adaptation,” Cell Rep. 7, 1333–1340 (2014). 10.1016/j.celrep.2014.04.009 [DOI] [PubMed] [Google Scholar]

[c35] 35.Jones S. and Thornton J. M., “Analysis of protein-protein interaction sites using surface patches,” J. Mol. Biol. 272, 121–132 (1997). 10.1006/jmbi.1997.1234 [DOI] [PubMed] [Google Scholar]

[c36] 36.Levy E. D., De S., and Teichmann S. A., “Cellular crowding imposes global constraints on the chemistry and evolution of proteomes,” Proc. Natl. Acad. Sci. U. S. A. 109, 20461–20466 (2012). 10.1073/pnas.1209312109 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c37] 37.Vidal O., Longin R., Prigent-Combaret C., Dorel C., Hooreman M., and Lejeune P., “Isolation of an Escherichia coli K-12 mutant strain able to form biofilms on inert surfaces: Involvement of a new ompR allele that increases Curli expression,” J. Bacteriol. 180, 2442–2449 (1998); available at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC107187/. [DOI] [PMC free article] [PubMed] [Google Scholar]

[c38] 38.Kagan B. L., Jang H., Capone R., Arce F. T., Ramachandran S., Lal R., and Nussinov R., “Antimicrobial properties of amyloid peptides,” Mol. Pharm. 9, 708–717 (2011). 10.1021/mp200419b [DOI] [PMC free article] [PubMed] [Google Scholar]

[c39] 39.de Groot N. S., Torrent M., Villar-Piqué A., Lang B., Ventura S., Gsponer J., and Babu M. M., “Evolutionary selection for protein aggregation,” Biochem. Soc. Trans. 40, 1032–1037 (2012). 10.1042/bst20120160 [DOI] [PubMed] [Google Scholar]

[c40] 40.Ahnert S. E., Marsh J. A., Hernández H., Robinson C. V., and Teichmann S. A., “Principles of assembly reveal a periodic table of protein complexes,” Science 350, aaa2245 (2015). 10.1126/science.aaa2245 [DOI] [PubMed] [Google Scholar]

[c41] 41.Dayhoff J. E., Shoemaker B. A., Bryant S. H., and Panchenko A. R., “Evolution of protein binding modes in homooligomers,” J. Mol. Biol. 395, 860–870 (2010). 10.1016/j.jmb.2009.10.052 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c42] 42.Marsh J. A. and Teichmann S. A., “Protein flexibility facilitates quaternary structure assembly and evolution,” PLoS Biol. 12, e1001870 (2014). 10.1371/journal.pbio.1001870 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c43] 43.Wang G. and R. L. Dunbrack, Jr., “Pisces: A protein sequence culling server,” Bioinformatics 19, 1589–1591 (2003). 10.1093/bioinformatics/btg224 [DOI] [PubMed] [Google Scholar]

PERMALINK

Evolution of protein interfaces in multimers and fibrils

W Jeffrey Zabel

Kyle P Hagner

Benjamin J Livesey

Joseph A Marsh

Sima Setayeshgar

Michael Lynch

Paul G Higgs

Abstract

I. INTRODUCTION

II. CALCULATION OF INTERFACE ENERGIES

FIG. 1.

III. CALCULATION OF AGGREGATION PROBABILITIES

IV. EVOLUTIONARY COMPUTATIONS

V. PHENOTYPE DISTRIBUTIONS

FIG. 2.

TABLE I.

FIG. 3.

FIG. 4.

VI. PROPERTIES OF PROTEIN INTERFACES

FIG. 5.

VII. INTERFACE PROPENSITIES OF AMINO ACIDS

FIG. 6.

FIG. 7.

FIG. 8.

FIG. 9.

VIII. DISCUSSION AND CONCLUSIONS

SUPPLEMENTARY MATERIAL

ACKNOWLEDGMENTS

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Evolution of protein interfaces in multimers and fibrils

W Jeffrey Zabel

Kyle P Hagner

Benjamin J Livesey

Joseph A Marsh

Sima Setayeshgar

Michael Lynch

Paul G Higgs

Abstract

I. INTRODUCTION

II. CALCULATION OF INTERFACE ENERGIES

FIG. 1.

III. CALCULATION OF AGGREGATION PROBABILITIES

IV. EVOLUTIONARY COMPUTATIONS

V. PHENOTYPE DISTRIBUTIONS

FIG. 2.

TABLE I.

FIG. 3.

FIG. 4.

VI. PROPERTIES OF PROTEIN INTERFACES

FIG. 5.

VII. INTERFACE PROPENSITIES OF AMINO ACIDS

FIG. 6.

FIG. 7.

FIG. 8.

FIG. 9.

VIII. DISCUSSION AND CONCLUSIONS

SUPPLEMENTARY MATERIAL

ACKNOWLEDGMENTS

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases