Abstract
Here, we describe a unique probabilistic evaluation of the 20, naturally occurring, amino acids and their distributions within the Swiss-Prot and Complete Human Genebank databases. We have developed a computational technique that imparts both directionality and length constraints into searches for unique combinations of amino acids within protein sequences. Using statistical approaches, we have carried out searches of all possible two- and three-residue motifs contained within these databases. This technique is based on the unusually high occurrence of a small number of these motifs when compared to the expected probability of finding a specific residue grouping within a given database. Subsequent filtering of this search to identify such unique combinations has provided several examples that can be used as markers to identify particular proteins within or across databases. We focus on three of these motifs, which were found to be of greatest interest to us. The CC, CM and a combination of the two, CCM motifs all occur either more or less frequently than would be predicted based on standard amino acid distributions within the entire human proteome.
Keywords: Amino acid motif, Fast detection method, Significance test
1. Introduction
The primary sequence of a protein refers to an ordered string of amino acids that belong to a finite set. It is therefore natural to expect that the distribution of amino acids, across all protein sequences, would appear to be fixed, given no mutations over a specific period of time. Of the 20 naturally occurring amino acids, each seems to appear randomly located at a particular position within the sequence of a protein. However, when the sequences of a group of proteins are compared to one another, specific patterns begin to emerge. These patterns are commonly referred to as motifs and frequently produce similar structural elements and in many cases, similar functions between the proteins that contain them. In recent years, significant efforts have been made to reduce sequences into discrete motifs, so that each can be linked to some specific biological function based on particular rules [1], [2]. Databases that consist of identifiable amino acid motifs, such as PROSITE, Pratt and EMOTIF, have been constructed to allow researchers an easy classification of unknown proteins into families based on their common motifs [3], [4], [5], [6], [7]. All of these methods share one property, they depend on specific alignment algorithms, such as the basic local alignment search tool (BLAST or PSI-BLAST) to perform the sequence comparison between diverse families of proteins being analyzed [5], [8]. These methods have already led to a number of key discoveries and offer the promise of determining biologically meaningful information about function from a protein's primary sequence.
It is generally accepted that a strong correlation exists between a protein's amino acid sequence and its resulting structure and function [9], [10], [11]. Techniques that can quickly determine significant similarities between protein sequences are valuable in predicting an unknown protein's function within the cell. Here we describe a procedure that, by computing the frequency of key amino acid pairs, can quickly classify large groups of proteins into functional groups. In order to uncover any relevant amino acid strings, we have utilized statistical methods applicable to datasets with large fluctuations in the search of human protein sequences within both the Swiss-Prot and Complete Human Genebank databases.
The goal of this manuscript is to demonstrate how these two residue motifs can be used to develop rapid methods for searching protein or genomic databases in order to locate potentially interesting or aberrant protein sequences. As an example, we describe the distribution of select amino acid pairs within the genome of the Coronavirus, Tor2, which is responsible for severe acute respiratory syndrome (SARS) [12]. We have examined additional sets, but the limited format of this paper precludes a large presentation of potential applications.
2. Methods
2.1. Breakdown of sequences
The set of all 20, naturally occurring, amino acids are described by their standard single letter representations, and are taken as the complete set contained within all protein sequences as shown below:
(2.1) |
Residues that are arranged in a certain order can be referred to as vectors and are represented symbolically as
(2.2) |
The coordinates of each vector represent the amino acid symbols contained within any given protein, where is the total length of the vector. The collection of all vectors in the proteome can therefore be denoted simply by .
The underlying structure of a protein sequence, which is the order of the amino acids, has unique and identifiable characteristics. As an analogy, in written English, the letter q is almost always followed by u. Also, the frequency of the word “and” is much higher than that of the related three-letter strings: “nad”, “nda”, “dan” and “dna”, which are simply permutations of the same set of individual letters to construct a word with a different meaning. Utilizing similar characteristics, we can intuitively scan for unique sequence vectors contained within any given protein. We expand on these characteristics, by utilizing the concepts of affinity and order stability in Section 2.2.
Using the entire Swiss-Prot database (release number 43.0) as the basis for our calculations, we have constructed a probability table for the occurrence of the 20 naturally occurring amino acids (see Table 1 ) [13]. In addition to the Swiss-Prot database, we have also included data from Complete Human Genebank (NCBI Build Number 34 March, 2004) as a source for all characterized and hypothetical human protein sequences.
Table 1.
[0.0704] | [0.0231] | [0.0484] | [0.0692] | [0.0378] |
[0.0675] | [0.0256] | [0.0450] | [0.0565] | [0.0984] |
[0.0237] | [0.0368] | [0.0610] | [0.0465] | [0.0552] |
[0.0799] | [0.0534] | [0.0613] | [0.0121] | [0.0282] |
The occurrence of each individual amino acid has been averaged over the total number of amino acids found within each of the databases.
2.2. Mathematical background
We use the symbol to denote the Swiss-Prot database, and the letter to denote the number of proteins in the database (133,723). Hence we can state that where is the th protein, with is the th amino acid in the sequence, and is the length of the protein sequence . For any vector , if there is a sequence in such that is a segment of , then we can say occurs in . We can then use to denote the total number of outcomes that has occurred in , and let be the total number of all vectors occurring in with length . Therefore,
(2.3) |
is the probability that the vector occurs in . We then introduce
(2.4) |
where and are any two probability distributions on , and belongs to . The quantity in Eq. (2.4) is defined for any two probability distributions and on where the distribution arises from independent letter (i.e., amino acid) probabilities. Then, can be viewed as a random variable about and its probability distribution is defined as follows:
(2.5) |
The expectation value and variance of are expressed mathematically as
(2.6) |
(2.7) |
Hence we see that is simply the Kullback–Leibler entropy [14]. As a consequence, is always non-negative and if and only if . As a result, the amino acid sequence of can be represented by .
In Eq. (2.4), the probability distributions , in the have several possible values. For example, if we let be the probability of , and only change the value of in , we obtain the following expression:
(2.8) |
Based on Eqs. (2.4) and (2.8), the value of gives us the affinity for all amino acids within vector . Here two situations are identified as especially interesting:
-
(i)
When implies that among the amino acids in vector , there is an increased affinity. That is, the vector occurs more frequently than expected.
-
(ii)
When implies that among the amino acids in vector there is a decreased affinity. That is, the vector occurs less frequently than expected.
In a similar approach to the condition described above, we may choose , where is a permutation of , and denotes a permutation transform. In this case, the order sensitivity of can also be reflected by the value of and produces the following outcomes:
-
(i)
when implies that occurs much more often than ;
-
(ii)
when implies that occurs much more often than .
In addition to the concepts of affinity and order sensitivity, we can also design other analysis techniques by choosing alternate representations of . Quantifying the terms , we can represent vector frequency numerically. In cases where the corresponding vector is called -positive, while in those cases where the opposite takes place: the vector is called -negative. By judiciously choosing a equal to 2, we obtained 16, two-residue, positive-pairs, and 8 two-residue negative-pairs (see Table 2 ) out of a total of 400 possibilities of amino acid pairs.
Table 2.
Amino acid pair | Calculated frequency (%) | Expected frequency (%) |
---|---|---|
AA | 0.78 | 0.60 |
RR | 0.36 | 0.28 |
NN | 0.22 | 0.18 |
CC | 0.04 | 0.02 |
CH | 0.04 | 0.03 |
HC | 0.04 | 0.03 |
0.24 | 0.15 | |
EE | 0.58 | 0.43 |
EK | 0.48 | 0.39 |
HH | 0.08 | 0.05 |
HP | 0.14 | 0.11 |
KK | 0.47 | 0.31 |
PP | 0.31 | 0.22 |
SS | 0.63 | 0.48 |
WW | 0.02 | 0.01 |
YY | 0.12 | 0.10 |
CM | 0.03 | 0.04 |
EP | 0.24 | 0.26 |
ES | 0.36 | 0.46 |
HE | 0.12 | 0.15 |
HK | 0.11 | 0.14 |
IM | 0.12 | 0.14 |
PM | 0.09 | 0.11 |
WP | 0.04 | 0.08 |
The first 16 residue pairs represent “positive” motifs and occur more frequently within the database than expected. The remaining eight pairs represent “negative” motifs (grayed boxes) and occur less frequently than expected. Data are taken from the Swiss-Prot database and it includes all the -positive and -negative pairs.
2.3. Developing detection methods
Using any one of the positive motifs from Table 2, we can develop a test method to detect changes in their distribution through any protein database. If is -positive, then , where is the probability of the amino acids or vectors in the database. Since
(2.9) |
we have
(2.10) |
or
(2.11) |
and since
(2.12) |
we have
(2.13) |
and
As a consequence of Eq. (2.13), the detection of a deleterious mutation based on the increased appearance of an arbitrary string of these residues becomes possible. For example, according to Table 2, CC is an attractive motif with , then
(2.14) |
Let and be the number of CC and the number of C in the th protein. Then we have because and since
(2.15) |
Based upon the result that the CC pair represents a rare event, proteins satisfying , as defined in Eq. (2.15), suddenly become interesting. We can therefore consider the number as an index to develop detection methods to test for the presence of a CC containing protein within a database. Using this same reasoning, one can also use any other one of the remaining motifs in Table 2 to develop detection methods for additional proteins.
2.4. Methods based on the negative two-residue motifs
In an identical fashion to the previous example using the positive two-residue motifs in Table 2, we can also use any of the “negative” two-residue motifs to develop search methods. For example, the CM pair is a “negative” two-residue “word” with . That is, . Using identical arguments to those described in Section 2.3, we obtain and . Since both and are small numbers for fixed proteins, the overall number of CM pairs in most proteins should be close to zero. Therefore, we can also use any of the remaining seven “negative” motifs in Table 2 to develop additional detection methods.
3. Results
3.1. The distribution of “positive” CC and “negative” CM pairs within Swiss-Prot
In our analysis of the positive and negative two-residue pairs, we have arbitrarily chosen CC and CM as candidates with which to search the Swiss-Prot database. The total number of proteins in Swiss-Prot is 133,723, amongst which the number satisfying is 128,922, while the total number satisfying is 14,801. Since the total number of amino acids in the Swiss-Prot is 46,120,418, we would expect to find only 11,368 of CC pairs in Swiss-Prot database based upon the amino acid frequencies given in Table 1, thus we would expect to find only at most 11,368 proteins having the CC pairs. Not surprisingly, there are only a small number of proteins that contain more than two of these CC pairs. For example, the total number of proteins in Swiss-Prot that satisfy is only 104, amongst which, there are only a total of 23 found in humans (Tables 3 and 4 ).
Table 3.
# pairs | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
CC | 1678 | 385 | 136 | 35 | 14 | 9 | 7 | 5 | 4 | 7 |
CM | 1308 | 244 | 31 | 8 | 3 | 1 | 0 | 0 | 1 | 0 |
Statistical thresholds for CC (6) and CM (4) pairs are shown in bold case.
Table 4.
Protein name | Total length | Number of C | Number of CC | Function |
---|---|---|---|---|
NIC1 | 99 | 17 | 9 | (Uncharacterized) |
(NICE-1 protein) | Involved in epidermal differentiation | |||
VTDB | 474 | 28 | 7 | (Secreted-plasma) |
(Vitamin D-binding protein) | Prevents polymerization of actin | |||
AFAM | 599 | 34 | 8 | (Secreted) |
(Afamin) | Contains albumin domains | |||
MCS | 116 | 20 | 9 | (Cytoplasmic) |
(Sperm mitochondrial associated | Associated male | |||
cysteine-rich protein) | infertility | |||
ALBU | 609 | 35 | 8 | (Secreted-plasma) |
(Serum albumin) | Familial dysalbuminemic hyperthyroxinemia | |||
DJC5 | 198 | 14 | 8 | (Membrane bound) |
(DnaJ homolog) | Involved in presynaptic function | |||
DJCX | 199 | 14 | 7 | (Membrane) |
(DnaJ homolog) | ||||
KRUA | 169 | 60 | 19 | (Extracellular) |
(Keratin, ultra high-sulfur | Cuticle layers of differentiating | |||
matrix protein A) | hair follicles | |||
KRUB | 194 | 67 | 22 | (Extracellular) |
(Keratin, ultra high-sulfur | Cuticle layers of differentiating | |||
matrix protein B) | hair follicles | |||
CIWC | 430 | 19 | 9 | (Membrane protein) |
(Potassium channel subfamily | Probable potassium channel | |||
K member 12) | subunit | |||
MDFI | 246 | 29 | 7 | (Cytoplasmic) |
(Myogenic repressor I-mf) | ||||
GRN | 593 | 88 | 28 | (Secreted) |
(Granulins) | May play a role in inflammation, wound | |||
repair, and tissue remodeling | ||||
ECM1 | 540 | 28 | 7 | (Extracellular matrix) |
(Extracellular matrix protein | Lipoid proteinosis | |||
1 [Precursor]) | ||||
MU5A | 1233 | 95 | 7 | (Extracellular matrix) |
(Mucin 5AC) | ||||
LTBS | 1394 | 139 | 7 | (Secreted) |
(Latent transforming growth factor | ||||
beta binding protein, isoform 1S) | ||||
LTBL | 1595 | 138 | 8 | (Secreted) |
(Latent transforming growth factor | Involved in the assembly, secretion | |||
beta binding protein) | and targeting of TGF1 | |||
LYST | 3801 | 93 | 9 | (Cytoplasmic) |
(Lysosomal trafficking regulator) | Chediak–Higashi syndrome | |||
(hypopigmentation) | ||||
FBN1 | 2871 | 361 | 17 | (Extracellular matrix) |
(Fibrillin 1) | Ongenital contractural arachnodactyly | |||
FBN2 | 2911 | 363 | 17 | (Extracellular matrix) |
(Fibrillin 2) | Ongenital contractural arachnodactyly | |||
VWF | 2813 | 234 | 8 | (Secreted) |
(Von Willebrand factor) | Von Willebrand disease |
Shown here are the top 20 of 28 human proteins from the Swiss-Prot database, which contain seven or more CC pairs. Proteins are ordered based upon their total CC content in relation to the total number of cysteine residues and the overall length of the protein.
3.1.1. Frequency of CM
In contrast to the CC motif, the total number of proteins in the Swiss-Prot database having more than one CM pair is quite small. For example, among the 133,723 proteins contained within this database, the number of proteins such that is 1595, and the number of proteins satisfying the condition is only 46. For human proteins in Swiss-Prot, the total number of proteins containing CM pairs is 1596, and there are only 13 proteins that satisfy (Table 3). However, we would expect to find about 17,233 proteins having CM pairs in the Swiss-Prot database, based upon the amino acid frequencies given in Table 1, which is greater than the number observed.
3.1.2. C-rich proteins do not necessarily contain numerous CC pairs
Cysteine rich proteins and the presence of CC pairs in proteins are both concepts that have been studied previously [15], [16]. If we combine these two concepts and describe them in mathematical terms, we observe that the rates and describe these two conditions, where , and are the numbers of all amino acids, cysteine, and CC, respectively. We can then derive
(3.1) |
where and are the standard deviations about
(3.2) |
and
respectively.
We have found the following computational results based on Swiss-Prot database
For , we have a much simpler method to determine that a protein sequence is both C- and CC-rich if
(3.3) |
So, a much simpler method to test whether a protein is a both C- and CC-rich for is
Using this principle, we found a total of eight proteins that are both C- and CC-rich within the Swiss-Prot database.
For , there is an easier method to determine whether a protein sequence is both C- and CC-rich if
(3.4) |
So, a much simpler method to test whether a protein is both C- and CC-rich for is based on the inequalities
Using this principle, we can get a total of 65 proteins that are both C- and CC-rich within the entire Swiss-Prot database.
For , there is also an easier method to determine if a protein sequence is both C- and CC-rich sequence when
(3.5) |
So, a much simpler method to test whether a protein is both C- and CC-rich for is based on the inequalities
Using this principle, we have identified a total of 200 proteins that are both C- and CC-rich within the Swiss-Prot database. From these results, it becomes clear that there are many C-rich proteins and many CC-rich proteins that are not mutually inclusive. The sets of both C- and CC-rich proteins are only subsets of the set containing all C-rich proteins or that containing all CC-rich proteins.
3.2. Distribution of CC and CM in human protein sequences in Genebank
Because the Swiss-Prot database is somewhat biased to those proteins that are of interest to researchers, we have decided to examine a database that was, in essence, complete. For this, we chose the human subset sequences contained within the Complete Human Genebank database. This was the most complete and readily available source of human genetic sequences and should therefore provide us with an accurate representation of both CC and CM pair distributions. We observed that the total number of CC pairs within the human subset of sequences contained in Genebank (a total of 27,178 proteins) was 9631. These pairs were contained within a total of 6295 proteins (Table 4), a number that is substantially larger than the 3198 we obtained from Swiss-Prot data (Table 3). Since the total number of amino acids in The Complete Human Genebank is 5,777,674, we can easily get the expected number of CC in Complete Human Genebank is 1424, which is much less than this actual value 9631. The same also holds true for the CM pair, where we would expect to see 2159 CM pairs, but actually observe 5007 (Table 5 ). This result is surprising but true and shows that there are fundamental difference between the Human and other species in terms of the frequency of CC and CM pairs.
Table 5.
# pair | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
CC | 4651 | 1081 | 330 | 87 | 37 | 29 | 24 | 8 | 7 | 41 |
CM | 3488 | 520 | 94 | 36 | 5 | 2 | 1 | 0 | 1 | 0 |
3.3. The distribution of CC and CM in proteins with at least four pairs
To create a single CM pair, it is obvious that one needs only a single C. If a protein sequence also contains a CC pair, the total number of cysteines available to create a CM pair has been reduced by a factor of 2. Using the cutoff value of four CC pairs as statistically significant, we determined the probability of finding a CM pair within proteins that already contained a CC pair (Table 5). Even though both CC and CM can occur in the same protein, the total number of proteins in Genebank that contain both CC and CM pairs is 1400 and these pairs almost always occur at different locations in the sequence. However, in some rare cases the second C within the CC pair is shared with the first C within the CM pair, giving us a new triplet CCM.
3.4. Frequency of the triplet CCM
If we filter the set of proteins described above for redundant entries, we obtain a set of 120 unique proteins that contain the CCM triplet in the human proteome. Using the probability distributions obtained from Swiss-Prot, we obtained the probability of CCM as 0.000005866. Since the total of all protein lengths in the Swiss-Prot database is 49,190,847, there should be a total of 289 CCM triplets. This value is almost identical to the predicted number of CCM of 290. This result suggests that the CCM triplet at least in Swiss-Prot occurs randomly. However, we can also analyze the CCM probability using a different approach, putting into Eq. (2.4). If CCM occurs randomly, then we have , and since , we will get
(3.6) |
Based on Eq. (3.6), it is clear that the presence of a CC pair strongly excludes M. Therefore, the probability of the CCM triplet should occur less frequently than that of CC or M individually.
4. Discussion
We have demonstrated that certain pairs of amino acids occur within the sequences of proteins with either a greater, or lower, than expected frequency, as calculated from the observed frequencies of the individual amino acids. We expect that this can be used to rapidly classify proteins that carry these statistically uncommon pairs, into large families. As an example, we focused on the double-cysteine (CC) pair, a pair that occurred more frequently, within the Swiss-Prot and Complete Human Genebank databases. We observed that within the human proteome, the actual frequency of CC was twice that of the calculated mean frequency.
The study of C-rich proteins has had a long history and many of these proteins have been identified as extracellular components [15]. While the fact that extracellular proteins contain numerous cysteine residues is not new, the distribution of these cysteines into distinct CC pairs is interesting. Proteins containing the CC motifs have previously been mentioned in the literature; however, their surprising prevalence within protein databases has not been mentioned [15], [16], [17]. To demonstrate the statistical relevance of the CC motif, we analyzed their distribution within both the Swiss-Prot and the Complete Human Genebank databases.
If we examine those proteins within the Swiss-Prot database that contain more CC pairs than the threshold we can see that most of them are extracellular, either being secreted into the blood stream or are a part of the extracellular matrix (Table 4). It is understood that proteins with high numbers of cysteine residues are generally stabilized by disulfide bridges and these proteins are usually found outside the cell. This, however, does not explain the subset of proteins that contain numerous CC pairs, with respect to the calculated probabilities. Further analysis is required to determine if these pairs have arisen due to some type of duplication event through evolution, leading to the increased stabilization of extracellular matrix proteins.
While the CC motif occurs more frequently than expected and the frequency of the CM pair occurs less frequently, as predicted by the amino acid frequencies within the Swiss-Prot database, the occurrence of a combination of the two, CCM, occurs more frequently. Using the genebank database, we determined the total number of human proteins containing the CCM motif is 119 and the total number of proteins is 27,178. In the Swiss-Prot database, the total number of the proteins having CCM is about 290 but the total number of proteins in this database is 128,494. In other words, the rate of all CCM proteins in human cell is twice as much as the rate in Swiss-Prot, implying that the rate of CCM proteins in human proteins is statistically significant. Using the probability distributions obtained from Swiss-Prot, the probability of a CCM triplet is 0.000005866, and total length of all human protein sequences in genebank is 5,777,674, so the number of CCM should be approximately 35, much less than 119. In other words, even though the occurrence of the CCM triplet seems to be random within the Swiss-Prot database, its frequency seems significant in human proteins.
One possible explanation as to the increased frequency of the CCM triplet within the human proteome is an increased rate of mutations at either of the two C positions (XCM or CYM). Our results suggest that the CC pair strongly excludes M because the conditional probability of M under the condition that the first two positions are occupied by CC is much less than the unconditional probability of M. So, mathematically, not only is CM a negative motif, but so is CCM. The strong competitiveness of CC and strong repulsion between C and M, and between CC and M, lead us to conclude that the mutational event leading to CCM would result from either XCM or CXM. If we let X and Y be the amino acids that correspond to the CC pair, then in addition to X-Cys-Met, we may find that the triplet Cys-Y-Met will also have a chance to change into CCM. In fact, since that the repulsion between CC and M, as well as between C and M are strong, it would imply that the number of Cys-Y-Met is greater than the number of X-Cys-Met in total protein sequences. Thus, the probability of changing Cys-Y-Met into CCM would be greater if the mutation at Y occurred at the same rate as that at X. Mutations to cysteine can arise from a single base substitution from phenylalanine, serine, tyrosine, arginine and glycine. The most relevant mutation would seem to come from serine to cysteine, where there are a total of four possible substitutions TCT, TCC, AGT, AGC to either TGT or TGC. A more detailed analysis of those residues and mutation frequencies at these positions is still required before any concrete conclusions can be made regarding the evolution of the CCM triplet in the human proteome and the discrepancy between it and the Swiss-Prot database. However, as an interesting aside, recent research has identified several proteins that contain the CCM triplet; however, the function of CCM, if any, has not been characterized [18], [19], [20], [21], [22].
4.1. Example of the CC and CM detection method for SARS
Having developed a formal approach to the search for interesting amino acid pairs in protein sequences, we now discuss its application in the detection of specific diseases. In our first example, we will focus on developing detection methods for the Coronavirus responsible for causing severe acute respiratory syndrome (SARS) [23].
4.1.1. Statistical results for CC and CM in the Swiss-Prot database
We have obtained the following statistical results with regard to the total number of CC and CM pairs contained in the Swiss-Prot database:
-
(i)
The total number of these proteins satisfying is 14,801, and 2280 of these are found in humans.
-
(ii)
The total number of human proteins in that satisfy is 32.
-
(iii)
The total number of the proteins in that satisfy is 12,438, 1596 of these are found in humans.
-
(iv)
The total number of human proteins in that satisfy is only 13.
4.1.2. Statistical results of CC and CM in the SARS proteins
The genome of the SARS-associated Coronavirus contains a total of 11 open reading frames, which code for six characterized and five uncharacterized proteins [12]. After examining the complete proteome the SARS-associated Coronavirus we observed a statistically significant increase in the occurrence of the CC and CM pairs in the SARS-associated Coronavirus’ genome.
Of the 11 putative proteins, four contain either CC or CM pairs. Two large poly-proteins, GI:29836505 and GI:29836495 contain 12 CC and 4 CM pairs, as well as 6 CC and 4 CM pairs, respectively. The putative spike glycoprotein, GI: 29836496 contains 3 CC and 1 CM, while the small envelope protein, GI: 29836499, contains only 1 CC pair. The remaining 6 proteins, of which two have been identified as a nucleocapsid and matrix protein, contain no CC or CM pairs. When compared to the occurrence of CC and CM pairs within all human proteins found in the Swiss-Prot database, it becomes clear that there is an increased occurrence of these pairs within the SARS-associated Coronavirus proteome.
It is clear that the frequency of CC and CM pairs is significantly higher in the SARS genome than it is in the human genome. It is our hope that these unique two residue motifs could be used as a means of identifying a virus such as SARS. While currently impossible to determine, one can see a possible clinical use for determining the frequency of CC or CM pairs in an individual. If a person has been infected by SARS, we would expect the numbers of CC and CM pairs to be much higher than in a healthy individual. If we were then able to develop a clinically feasible technique that could identify the CC or CM pair content in total cellular protein, without damaging the cell, it could be possible to implement a detection method simply by counting the numbers of CC or CM pairs in all proteins examined. If the presence of increased CC or CM pairs was indicative of infection by SARS we could develop new drugs to inhibit the translation of proteins that contain these pairs within infected cells. As a very specific example, this could be accomplished through the use of specific tRNA mimics that could block translating ribosome when these pairs appear at the P and A site simultaneously [24], [25]. While tRNA mimics could normally be seen as non-specific protein synthesis inhibitors, the lower frequency of the CC or CM motifs in host proteins as compared to the relatively high occurrence in viral protein provide a starting point for the analysis of treatments using methods such as these.
5. Summary and conclusions
In this paper we have performed a detailed search of the Swiss-Prot and Complete Human Genebank database in relation to the frequencies of each of the 20 natural amino acids and motifs made up of two residues. This has provided us with a new insight into the distribution of proteins within these databases and the information that may be contained within a protein's primary amino acid sequence. Our focus in this quest has been to examine specific pairs or triplets of residues and use probability theory to assess whether or not particular strings of residues occur more frequently than their probability would indicate. By analyzing affinity and anti-affinity, we developed a procedure to search out all two- and three-residue clusters within these databases. Through this technique, we can rapidly classify large groups of proteins into families based on the occurrence of these motifs. Combining this concept with a specific knowledge of protein function, we propose novel ideas aimed at developing fast detection methods for specific diseases based on the unusually high occurrence of some of these motifs when compared with the expected probabilities. A timely example of the application of this method is given, namely the analysis of the SARS virus. This method will hopefully lead to the discovery of new diagnostic techniques and possible treatment of diseases, by recognizing the location of target domains in the expressed sequence.
References
- 1.Mount D.W. Cold Spring Harbor Laboratory Press; New York: 2001. Bioinformatics: Sequence and Genome Analysis. [Google Scholar]
- 2.Baldi P., Brunak S., Frasconi P., Soda G., Pollastri G. Exploiting the past and the future in protein secondary structure prediction. Bioinformatics. 1999;15:937–946. doi: 10.1093/bioinformatics/15.11.937. [DOI] [PubMed] [Google Scholar]
- 3.Rost B. Review: protein secondary structure prediction continues to rise. J. Struct. Biol. 2001;134:204–218. doi: 10.1006/jsbi.2001.4336. [DOI] [PubMed] [Google Scholar]
- 4.Hart R.K., Royyuru A.K., Stolovitzky G., Califano A. Systematic and fully automated identification of protein sequence patterns. J. Comput. Biol. 2000;7:585–600. doi: 10.1089/106652700750050952. [DOI] [PubMed] [Google Scholar]
- 5.Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Bairoch A. Prosite: a dictionary of sites and patterns in proteins. Nucleic Acids Res. 1992;20(Suppl):2013–2018. doi: 10.1093/nar/20.suppl.2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Bork P., Koonin E.V. Protein sequence motifs. Curr. Opin. Struct. Biol. 1996;6:366–376. doi: 10.1016/s0959-440x(96)80057-1. [DOI] [PubMed] [Google Scholar]
- 8.Shen S.Y., Yang J., Yao A., Hwang P.I. Super pairwise alignment: an efficient approach to global alignment for homologous sequences. J. Comput. Biol. 2002;9:477–486. doi: 10.1089/106652702760138574. [DOI] [PubMed] [Google Scholar]
- 9.Rost B., Casadio R., Fariselli P., Sander C. Transmembrane helices predicted at 95% accuracy. Protein Sci. 1995;4:521–533. doi: 10.1002/pro.5560040318. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Rost B., Sander C. Bridging the protein sequence-structure gap by structure predictions. Annu. Rev. Biophys. Biomol. Struct. 1996;25:113–136. doi: 10.1146/annurev.bb.25.060196.000553. [DOI] [PubMed] [Google Scholar]
- 11.Gan H.H., Perlow R.A., Roy S., Ko J., Wu M., Huang J., Yan S., Nicoletta A., Vafai J., Sun D., Wang L., Noah J.E., Pasquali S., Schlick T. Analysis of protein sequence/structure similarity relationships. Biophys. J. 2002;83:2781–2791. doi: 10.1016/s0006-3495(02)75287-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Marra M.A., Jones S.J., Astell C.R., Holt R.A., Brooks-Wilson A., Butterfield Y.S., Khattra J., Asano J.K., Barber S.A., Chan S.Y., Cloutier A., Coughlin S.M., Freeman D., Girn N., Griffith O.L., Leach S.R., Mayo M., McDonald H., Montgomery S.B., Pandoh P.K., Petrescu A.S., Robertson A.G., Schein J.E., Siddiqui A., Smailus D.E., Stott J.M., Yang G.S., Plummer F., Andonov A., Artsob H., Bastien N., Bernard K., Booth T.F., Bowness D., Czub M., Drebot M., Fernando L., Flick R., Garbutt M., Gray M., Grolla A., Jones S., Feldmann H., Meyers A., Kabani A., Li Y., Normand S., Stroher U., Tipples G.A., Tyler S., Vogrig R., Ward D., Watson B., Brunham R.C., Krajden M., Petric M., Skowronski D.M., Upton C., Roper R.L. The genome sequence of the SARS-associated Coronavirus. Science. 2003;300:1399–1404. doi: 10.1126/science.1085953. [DOI] [PubMed] [Google Scholar]
- 13.Boeckmann B., Bairoch A., Apweiler R., Blatter M.C., Estreicher A., Gasteiger E., Martin M.J., Michoud K., O’Donovan C., Phan I., Pilbout S., Schneider M. The Swiss-Prot protein knowledgebase and its supplement tremble in 2003. Nucleic Acids Res. 2003;31:365–370. doi: 10.1093/nar/gkg095. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Cover T.M., Thomas J.A. Wiley; New York: 1991. Elements of Information Theory. [Google Scholar]
- 15.Buchanan J.H. A cystine-rich protein fraction from oxidized alpha-keratin. Biochem. J. 1977;167:489–491. doi: 10.1042/bj1670489. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Zhang H., Serrero G. Inhibition of tumorigenicity of the teratoma Pc cell line by transfection with antisense Cdna for Pc cell-derived growth factor (Pcdgf, epithelin/granulin precursor) Proc. Natl. Acad. Sci. USA. 1998;95:14202–14207. doi: 10.1073/pnas.95.24.14202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Combadiere C., Ahuja S.K., Murphy P.M. Cloning and functional expression of a human eosinophil Cc chemokine receptor. J. Biol. Chem. 1996;271:11034. [PubMed] [Google Scholar]
- 18.Strittmatter S.M., Igarashi M., Fishman M.C. Gap-43 amino terminal peptides modulate growth cone morphology and neurite outgrowth. J. Neurosci. 1994;14:5503–5513. doi: 10.1523/JNEUROSCI.14-09-05503.1994. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Liu Y., Fisher D.A., Storm D.R. Intracellular sorting of neuromodulin (Gap-43) mutants modified in the membrane targeting domain. J. Neurosci. 1994;14:5807–5817. doi: 10.1523/JNEUROSCI.14-10-05807.1994. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Reinhard E., Nedivi E., Wegner J., Skene J.H., Westerfield M. Neural selective activation and temporal regulation of a mammalian Gap-43 promoter in zebrafish. Development. 1994;120:1767–1775. doi: 10.1242/dev.120.7.1767. [DOI] [PubMed] [Google Scholar]
- 21.Neel V.A., Young M.W. Igloo, a Gap-43-related gene expressed in the developing nervous system of drosophila. Development. 1994;120:2235–2243. doi: 10.1242/dev.120.8.2235. [DOI] [PubMed] [Google Scholar]
- 22.Kapfhammer J.P., Schwab M.E. Increased expression of the growth-associated protein Gap-43 in the myelin-free rat spinal cord. Eur. J. Neurosci. 1994;6:403–411. doi: 10.1111/j.1460-9568.1994.tb00283.x. [DOI] [PubMed] [Google Scholar]
- 23.Drosten C., Gunther S., Preiser W., van der Werf S., Brodt H.R., Becker S., Rabenau H., Panning M., Kolesnikova L., Fouchier R.A., Berger A., Burguiere A.M., Cinatl J., Eickmann M., Escriou N., Grywna K., Kramme S., Manuguerra J.C., Muller S., Rickerts V., Sturmer M., Vieth S., Klenk H.D., Osterhaus A.D., Schmitz H., Doerr H.W. Identification of a novel Coronavirus in patients with severe acute respiratory syndrome. N. Engl. J. Med. 2003;348:1967–1976. doi: 10.1056/NEJMoa030747. [DOI] [PubMed] [Google Scholar]
- 24.Schmeing T.M., Moore P.B., Steitz T.A. Structures of deacylated tRNA mimics bound to the E site of the large ribosomal subunit. RNA. 2003;9:1345–1352. doi: 10.1261/rna.5120503. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Selmer M., Al-Karadaghi S., Hirokawa G., Kaji A., Liljas A. Crystal structure of thermotoga maritima ribosome recycling factor: a tRNA mimic. Science. 1999;286:2349–2352. doi: 10.1126/science.286.5448.2349. [DOI] [PubMed] [Google Scholar]