Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2010 Apr 26;107(19):8623–8626. doi: 10.1073/pnas.1001299107

Global characteristics of protein sequences and their implications

S Rackovsky 1,1
PMCID: PMC2889366  PMID: 20421501

Abstract

Computational studies of the relationships between protein sequence, structure, and folding have traditionally relied on purely local sequence representations. Here we show that global representations, on the basis of parameters that encode information about complete sequences, contain otherwise inaccessible information about the organization of sequences. By studying the spectral properties of these parameters, we demonstrate that amino acid physical properties fall into two distinct classes. One class is comprised of properties that favor sequentially localized interaction clusters. The other class is comprised of properties that favor globally distributed interactions. This observation provides a bridge between two classic models of protein folding—the collapse model and the nucleation model—and provides a basis for understanding how any degree of intermediacy between these two extremes can occur.

Keywords: proteomics, sequence analysis


Bioinformatic studies of protein sequences have concentrated almost exclusively on their local properties. The relationship between local sequence properties and local folding has been extensively examined. Sequence homology studies have concentrated on developing methods for establishing local equivalences between corresponding residues in pairs of sequences. It has become increasingly clear, however, that a purely local view of protein sequences is not adequate. In a number of recent studies (16), we have demonstrated quantitatively that there are intrinsic limitations to the informatic power of local descriptions of protein sequence, particularly with respect to the encoding of structural information. It is clear, however, that sequence does completely determine protein structure, and it therefore follows that folding instructions must be encrypted in global, rather than local, sequence information. In the present work, we discuss some fundamental global properties of protein sequences and examine their implications for mechanisms of protein folding.

Model

A necessary preliminary to any meaningful discussion of sequence characteristics is the conversion of protein sequences into a numerical form amenable to systematic analysis. We follow a procedure set forth in previous work (79), by using the 10 Kidera property factors (10, 11) , which form an orthonormal and essentially complete basis set for the known physical properties of the amino acids, to represent an amino acid as a 10-vector. (The Kidera factors are given in Table 1.) A complete protein sequence is then represented by a set of 10 N-member numerical strings, each of which records the course of one property factor along the N-residue sequence. These strings can be Fourier transformed, leading to a representation of the sequence by a set of sine and cosine Fourier coefficients. Each of these coefficients, which is labeled by a wave number k and a property identifier l, encodes information about the entire sequence of the protein. Furthermore, the Fourier components are determined by information associated with different intrinsic length scales in the sequence—coefficients with wave number k contain information (7) about structural features of size ∼N/k. The Fourier decomposition of a sequence is therefore, by construction, complete and orthonormal with respect to both physical properties and wave number. This approach provides a method for systematically studying the presence in each property factor of features on specific scales and for doing so in a uniform manner in sequences of varying lengths. (This scaling is an advantage of working in k space. Inspection of features at a specified length scale along the sequence would, of course, involve different k values in sequences of different length.) In this respect it differs from Fourier methods and other periodicity-based approaches previously proposed by a number of workers (1218), who have used these tools to examine the role of sequence in local structure formation and particularly in studying the role of hydrophobicity in protein sequences. Recent studies (19) have used simple potentials to relate sequence and secondary structure prediction in model proteins. The present work is distinct both in the nature of the sequence representation used and in presenting a systematic study of the complete set of physical properties over a wide range of wave numbers.

Table 1.

The Kidera property factors

1. Helix/bend preference
2. Side-chain size
3. Extended structure preference
4. Hydrophobicity
5. Double-bend preference
6. Partial specific volume
7. Flat extended preference
8. Occurrence in alpha region
9. pK-C
10. Surrounding hydrophobicity

The first four factors are essentially pure physical properties; the remaining six factors are superpositions of several physical properties and are labeled for convenience by the name of the most heavily weighted component.

In recent work (9) we studied the informatic properties of the k = 0 Fourier coefficient. It was demonstrated that this component contains information about protein sequences that correctly encodes the structural relationships between proteins. This finding is particularly remarkable in view of the fact that the k = 0 coefficient contains information about sequence composition but not about the actual sequential arrangement of residues along the chain. In the present work we ask how information is encoded by Fourier components of sequences with k > 0. We are particularly interested in relating the k-space properties of protein sequences to the folding mechanisms of proteins. We formulate this interest in terms of three specific questions:

  • At what values of k are unusually large Fourier coefficients observed, for each property factor?

  • Do the observed occurrences of large Fourier coefficients form coherent patterns?

  • What are the implications of these patterns for folding mechanisms?

Our methodology is straightforward. Because we are interested only in the magnitudes of the Fourier coefficients, we study the behavior of the sine and cosine power spectra, the elements of which are squared Fourier coefficients. It is necessary to determine whether the observed value of a power spectral element differs significantly from that one would expect at random. We determine the statistical significance of spectral magnitudes by calculating Z functions for the power spectra:

graphic file with name pnas.1001299107eq2.jpg [1]

Here Inline graphic is the sine or cosine Fourier coefficient with wave number k for property l (where 1 ≤ l ≤ 10), the subscripted brackets denote an average over all possible permutations of the sequence, and σ denotes the standard deviation of the power spectral element over the ensemble of possible sequence permutations. By measuring the value of the power spectral element relative to the expected value over all sequence permutations, we determine the contribution of the specific sequence to the power spectrum, beyond that provided by sequence amino acid composition. The Z function for k > 0 differs in this respect from the k = 0 Fourier coefficient and provides information complementary to that obtained at k = 0. Determination of the Z functions was greatly simplified by the fact that the averages and standard deviations in Eq. 1 can be calculated analytically and exactly (8).

We define a signal in the power spectrum by the equation

graphic file with name pnas.1001299107eq3.jpg [2]

This condition is the standard criterion for a power spectral element that is larger than average at the 5% confidence level. We then seek those values of k and l at which spectral signals are observed with high frequency.

To study this question, a dataset was assembled from the CATH database (20, 21). The dataset was based on the CathDomainSeqs.S35.ATOM.v3.1.020 subset of CATH, which was constructed to be representative of the entire database while containing no sequence pairs with identity greater than 35%. The entire dataset thus lies in the “twilight zone” and contains no pairs that can be considered to be homologous in the traditional sense. This subset of CATH was edited in order to remove all sequences with missing segments or sequence uncertainties. This redaction left 7,056 sequences, each of which was subjected to the Fourier and spectral analyses described above.

The first two questions we have posed can be answered by counting the occurrences of spectral signals in the sequences of the dataset as a function of k and l. By the properties of the Fourier transform, the presence of a signal in a given sequence, at given k and l, is independent of the presence of signals in the same sequence at different values of k and l, or in other sequences. Therefore, the statistical significance of the observed number of signals at k, compared to the average number observed over all values of k, can be calculated by using Bernoulli statistics. We have determined this significance for a wide range of k (1 ≤ k ≤ 60), for each of the property factors (1 ≤ l ≤ 10). For some values of k and l, the observed number of signals Ns(k,l) will be significantly larger than average, and for others Ns(k,l) will be significantly smaller than average.

Results

In Fig. 1 we summarize the variation with k of the number of signals (in combined data from both sin and cos spectra). Rather than plotting Ns(k,l), we plot, as a function of k, the values of an auxiliary function λ(kl), which takes the value -1 if signal usage is significantly lower than average, +1 if usage is significantly higher than average, and 0 if usage does not differ significantly from average. The following points are evident by inspection:

  • Four of the property factors have very pronounced significance patterns, which fall into one of two classes. The first two property factors exhibit runs of high signal usage at low k and low signal usage at higher values of k. Conversely, factors 3 and 4 exhibit runs of low signal usage at low values of k, and property factor 4 exhibits a particularly pronounced run of high signal usage over a large range of elevated values of k.

  • Deviations from average for the remaining six property factors occur as isolated cases, and strong patterns are not as clearly visible by inspection. There are, however, hints in these properties too of behavior corresponding to one or the other of the two classes.

Fig. 1.

Fig. 1.

Plots of an auxiliary function that represents the statistical significance of signal usage (in the combined sine and cosine power spectra) for each of the 10 property factors. The function has the value -1 when signal usage is significantly lower than the average over all values of k, 1 when signal usage is significantly higher than average, and 0 when signal usage is indistinguishable statistically from the average.

In order to investigate quantitatively the accuracy of these empirical impressions, we measured pairwise distances between the signal usage patterns λ(kl) by using a correlation function metric:

graphic file with name pnas.1001299107eq4.jpg [3]

where the overbar indicates an average over all values of k. The matrix of correlation coefficients is a similarity matrix for the set of functions {λ(kl)} and can be used as input for a message passing algorithm (22). Message passing is distinct from other clustering methods in that it does not require the presupposition of a specified number of clusters but rather determines the number of clusters present directly from the input data. We find that the 10 signal usage spectra fall into two classes, corresponding exactly to the visual impression produced by the plots in Fig. 1. These are

graphic file with name pnas.1001299107eq5.jpg
graphic file with name pnas.1001299107eq6.jpg

C1 is comprised of those property factors that exhibit statistically elevated signal usage at low values of k and low usage at high k. C2 is comprised of those property factors that exhibit the opposite behavior.

Discussion

These two distinct patterns of k dependence correspond to distinctly different physical behaviors. As we noted above, a signal in property l at wave number k = k0 arises from the existence of physical features in the sequence on a length scale ∼N/k0. It follows that large values of the properties in class C1 are distributed preferentially in a few sequentially long regions, separated by long regions in which the property value is low. Conversely, large values of the properties in C2 are preferentially distributed in many small, closely spaced regions. These contrasting behaviors are illustrated in Fig. 2. Large values of a particular property factor imply strong interactions arising from a corresponding term in the intramolecular potential energy. The two behavior types therefore correspond to different physical interaction patterns. These different interaction patterns in turn are likely to lead to different folding mechanisms. Folding governed by properties in C1 will take place under the influence of interactions strongly localized in a limited number of well-separated regions, leading to folding by a nucleation-like mechanism. Folding governed by properties in C2 will take place under the influence of interactions in small regions distributed proximally along the entire length of the chain. These regions can interact with neighboring regions, leading to delocalized folding modes (“collapse”), in much the same way that periodic nearest-neighbor interactions in solids lead to delocalized collective excitations.

Fig. 2.

Fig. 2.

The difference in behavior between signals at low and high values of the wave number k. The area shaded in black in each case is the region in which the value of the signal is above average. In (A) k is low, and the regions of high value extend over relatively large intervals of the x axis but are separated by large intervals. In (B) k is large, and the regions of high value are narrow and separated by small intervals.

With this picture in mind, we note that Kidera et al. showed (10, 11) that, of the 10 property factors, the first four carry the largest part (68%) of the variance of the dataset and that those four principal property factors are essentially single amino acid properties. Two of these principal factors occur in class C1, the class of localized interactions—helix/bend preference and side-chain size. The remaining principal property factors—extended structure preference and hydrophobicity (10, 11)—fall in C2, the class of collective, delocalized interactions. The latter observation is particularly intriguing, because it suggests that the formation of extended structure may occur by a collective mechanism that shares certain underlying physical features with hydrophobic collapse.

We have demonstrated the existence of two types of physical properties, with clear differences in global sequence behavior, which we suggest favor the two classic, diametrically opposite prototypical folding mechanisms—nucleation and collapse. In specific sequences, of course, multiple signals in properties from both classes may be present simultaneously, and the balance between the strengths of their corresponding interactions, and the relationships between their wave numbers, will determine the folding mechanism of the protein. This approach provides a unified framework in which an entire range of folding mechanisms can be induced. We are continuing to investigate the implications of these observations.

Acknowledgments.

This work was supported by the National Library of Medicine of the National Institutes of Health, through Grant LM06789.

Footnotes

The authors declare no conflict of interest.

*This Direct Submission article had a prearranged editor.

References

  • 1.Rackovsky S. On the nature of the protein folding code. Proc Natl Acad Sci USA. 1993;90:644–648. doi: 10.1073/pnas.90.2.644. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Rackovsky S. On the existence and implications of an inverse folding code in proteins. Proc Natl Acad Sci USA. 1995;92:6861–6863. doi: 10.1073/pnas.92.15.6861. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Solis AD, Rackovsky S. Optimized representations and maximal information in proteins. Proteins. 2000;38:149–164. [PubMed] [Google Scholar]
  • 4.Solis AD, Rackovsky S. Optimally informative backbone structural propensities in proteins. Proteins. 2002;48:463–486. doi: 10.1002/prot.10126. [DOI] [PubMed] [Google Scholar]
  • 5.Solis AD, Rackovsky S. On the use of secondary structure in protein structure prediction: A bioinformatic analysis. Polymer. 2004;45:525–546. [Google Scholar]
  • 6.Solis AD, Rackovsky S. Property-based sequence representations do not adequately encode local protein folding information. Proteins. 2007;67:785–788. doi: 10.1002/prot.21434. [DOI] [PubMed] [Google Scholar]
  • 7.Rackovsky S. “Hidden” sequence periodicities and protein architecture. Proc Natl Acad Sci USA. 1998;95:8580–9584. doi: 10.1073/pnas.95.15.8580. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Rackovsky S. Characterization of architecture signals in proteins. J Phys Chem B. 2006;110:18771–18778. doi: 10.1021/jp0575097. [DOI] [PubMed] [Google Scholar]
  • 9.Rackovsky S. Sequence physical properties encode the global organization of protein structure space. Proc Natl Acad Sci USA. 2009;106:14345–14348. doi: 10.1073/pnas.0903433106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Kidera A, et al. Statistical analysis of the physical properties of the 20 naturally occurring amino acids. J Protein Chem. 1985;4:23–55. [Google Scholar]
  • 11.Kidera A, et al. Relation between sequence similarity and structural similarity in proteins: Role of important properties of amino acids. J Protein Chem. 1985;4:265–297. [Google Scholar]
  • 12.Eisenberg D, et al. The hydrophobic moment detects periodicity in protein hydrophobicity. Proc Natl Acad Sci USA. 1984;81:140–144. doi: 10.1073/pnas.81.1.140. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Xiong H, et al. Periodicity of polar and nonpolar amino acids is the major determinant of secondary structure in self-assembling oligomeric peptides. Proc Natl Acad Sci USA. 1995;92:6349–6353. doi: 10.1073/pnas.92.14.6349. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.West MW, Hecht MH. Binary patterning of polar and nonpolar amino acids in the sequences and structures of native proteins. Protein Sci. 1995;4:2032–2039. doi: 10.1002/pro.5560041008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Broome BM, Hecht MH. Nature disfavors sequences of alternating polar and non-polar amino acids: Implications for amyloidogenesis. J Mol Biol. 2000;296:961–968. doi: 10.1006/jmbi.2000.3514. [DOI] [PubMed] [Google Scholar]
  • 16.Cornette JL, et al. Hydrophobicity scales and computational techniques for detecting amphipathic structures in proteins. J Mol Biol. 1987;195:659–685. doi: 10.1016/0022-2836(87)90189-6. [DOI] [PubMed] [Google Scholar]
  • 17.Murray KB, Gorse D, Thornton JM. Wavelet transforms for the characterization and detection of repeating motifs. J Mol Biol. 2002;316:341–363. doi: 10.1006/jmbi.2001.5332. [DOI] [PubMed] [Google Scholar]
  • 18.Giuliani A, et al. Nonlinear signal analysis methods in the elucidation of protein sequence-structure relationships. Chem Rev. 2002;102:1471–1491. doi: 10.1021/cr0101499. [DOI] [PubMed] [Google Scholar]
  • 19.Bellesia G, Jewett AI, Shea J-E. Sequence periodicity and secondary structure propensity in model proteins. Protein Sci. 2010;19:141–154. doi: 10.1002/pro.288. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Orengo CA, et al. CATH—A heirarchic classification of protein domain structures. Structure. 1997;5:1093–1108. doi: 10.1016/s0969-2126(97)00260-8. [DOI] [PubMed] [Google Scholar]
  • 21. http://www.cathdb.info/wiki/doku.php?id=data:index.
  • 22.Frey BJ, Dueck D. Clustering by passing messages between data points. Science. 2007;315:972–976. doi: 10.1126/science.1136800. [DOI] [PubMed] [Google Scholar]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES