Abstract
Sequences of the variable heavy (VH) and κ (Vκ) domains of Ig structures were divided into 21 fragments that correspond to strands, loops, or parts of these structural units of the variable domains. Amino acid sequences of fragments (termed “words”) were collected from the 1,172 human heavy and 668 human κ chains available in the Kabat database. Statistical analysis of words of 17 fragments was performed (fragments that comprise the complementary determining regions′ fragments will not be discussed in this paper). The number of different words (those with different residues in at least one position) ranged, for various fragments, from 11 to 75 in the κ chains, and from 23 to 189 in the heavy chains. The main result of this study is that very few keywords, or main patterns of words, were necessary to describe over 90% of the sequences (no more than two keywords per fragment in the κ and no more than five per fragment in the heavy chains). No identical keywords were found for different fragments of the variable domains. Keywords of aligned fragments of the VH and Vκ domains were different in all but two instances. Thus, knowing the keywords, one can determine whether any given small part of a sequence belongs to a heavy or κ chain and predict its precise localization in the sequence. In addition, by using all of the keywords obtained through analysis of the Kabat database, it was possible to describe completely the sequences of the human VH and Vκ germ-line segments.
Keywords: Ig sequence, sequence comparison, pattern recognition
Analysis of residues that maintain a common Ig structure has been the subject of many investigations (1–13). At present, the numbers of Ig sequences in the Kabat database and Ig structures in the Protein Data Bank are large enough to carry out a statistical analysis of sequence repertoire of the variable domains and to find the relationship between the sequence of a protein and its three-dimensional structure. Analysis of approximately 5,300 sequences, 111 gene segments, and about 100 three-dimensional structures of the variable domains allowed us to determine how frequently a residue was encountered at each position and to give a description of sequence determinants of Ig fold (refs. 12 and 13; C. Chothia, I.M.G., and A.E.K., unpublished work).
In this paper we focus on the correlation among residues within small fragments of the human heavy and κ chains in the Kabat database. We term amino acid sequences of the fragments “words” (12). Comparison of aligned fragments in the chains revealed a large number of variations among words of each fragment. In spite of this diversity, statistical analysis of words allowed us to select a very limited number of keywords, or main patterns, for each fragment, which characterize more than 90% of all Ig chains. The possibility of defining the keywords presupposes the presence of a large number of correlation among residues. Our approach, which involved dividing the sequences into words, permits such correlation to be revealed.
Methods of Analysis and Results
Fragmentation of Sequences.
Prediction of secondary structures for all sequences of the Kabat database (12) allowed us to divide the human variable heavy (VH) and κ (Vκ) sequences into 21 fragments that correspond, approximately, to strands, loops, or parts of these structural units. A, A′, B, C, C′, C", D, E, F, and G fragments correspond to strands, and A′B, CC′, C′C", C"D, DE, EF, and FG fragments correspond to loops. Because of its unique two-arch conformation, the loop between B and C strands was divided into two fragments: BC and CB. The first three residues of the sequences were defined as the OA fragment.
A position in a fragment was assigned to each residue in a sequence. In this paper, a residue, in addition to its Kabat numbering, is referred to by an index that contains the name of the fragment and its position therein. Referencing amino acids this way permits us to compare amino acids that are located in the same positions of various sequences.
Collections of Words from the Kabat Database Chains.
In this work 1,172 human heavy chains and 668 human κ chains in the Kabat database were analyzed. The numbers of sequences containing particular fragments differs because not all chains in the database were complete. The numbers of sequences with each of the 17 fragments is presented in Table 1 (results for the remaining fragments, CB, C′C", C", and FG will be considered elsewhere).
Table 1.
Fragment | Heavy chains
|
κ chains
|
||
---|---|---|---|---|
Numbers of chains | Numbers of different words | Numbers of chains | Numbers of different words | |
OA | 743 (82) | 43 (6) | 444 (32) | 29 (9) |
A | 799 (82) | 35 (8) | 483 (32) | 12 (2) |
AA′ | 803 (82) | 23 (7) | 472 (32) | 29 (13) |
A′ | 840 (82) | 61 (13) | 506 (32) | 32 (12) |
A′B | 841 (82) | 38 (15) | 485 (32) | 26 (8) |
B | 811 (82) | 119 (15) | 465 (32) | 74 (12) |
BC | 828 (82) | 66 (10) | 439 (32) | 34 (5) |
C | 820 (83) | 95 (23) | 425 (32) | 43 (12) |
CC′ | 828 (82) | 84 (17) | 419 (32) | 47 (13) |
C′ | 807 (83) | 135 (33) | 427 (32) | 43 (11) |
C"D | 828 (82) | 124 (18) | 430 (32) | 50 (6) |
D | 825 (82) | 129 (22) | 421 (32) | 32 (4) |
DE | 830 (82) | 112 (21) | 425 (32) | 11 (1) |
E | 838 (82) | 189 (18) | 415 (32) | 46 (6) |
EF | 827 (82) | 140 (20) | 411 (32) | 58 (12) |
F | 844 (78) | 99 (15) | 411 (32) | 51 (7) |
G | 478 | 98 | 335 | 75 |
Numbers of sequences containing a given fragment (numbers of chains) and numbers of different words are shown for 17 of 21 fragments of the human heavy and κ chains; analogous data for 16 available VH and Vκ domain fragments of germ-line sequences are presented in parentheses.
Analysis of the words revealed that some words of a given fragment were encountered many times, whereas others were seen very rarely, or just once. In the F fragment, for example, the word AVYYCAR was found in 394 human heavy chains, but the word AVYYCTR was in only five chains. (Data on frequency of occurrence of individual words are not presented in this paper.) For each fragment we selected different words (those with differing residues in at least one position). The numbers of different words varied considerably for different fragments, ranging in the heavy chains, from 23 for the AA′ fragment to 189 for the E fragment (Table 1).
Classification of Amino Acids in Ig Sequences.
For the purpose of analysis of words, we divided amino acids into several groups. This classification of amino acids is based on our previous analysis of residues’ frequencies in 5,300 Ig sequences (C. Chothia, I.M.G., and A.E.K., unpublished work; ref. 14). Inspection of these data allowed us to group residues that usually occupy the same positions and are of similar chemical character. All residues were divided into nine groups: 1) L, M, V, I, F, and A (at positions 0A2, A1, B2, B4, C′3, CB1, C1, E4, E6, EF2, G8, and G10); 2) S and T (at positions A4, B1, B5, BC1, D3, D5, and G6); 3) G, A, and S (at positions AA′3, A′B2, B8, C"D5, DE2, F1, G3, and G5); 4) F, Y, and W (at positions C3, F3, F4, and G2); 5) D and E (at positions OA1, A3, A′1, and EF6); 6) K, R, and H (at positions A′3, A′4, B3, CC′4,C5, C6, and D1); 7) Q and N (at positions OA3, A3, and A′4); 8) C (at positions B6 and F5); and 9) P (at position CC′2). Three residues (A, F, and S) each are found in two groups. This is because our classification takes into account not only chemical structure, but also structural role of residues. For instance, in loops, residues S and A are found mostly in same positions and so, together with G, form one group. In strands, however, S and A are classified into two distinct groups: S “shares” positions with T, while A belongs to one group with hydrophophic residues.
Keywords, or Main Patterns of Words.
Statistical analysis of words in human heavy and κ chains (which involved determining the number of chains containing particular words, frequency of occurrence of each word in the chains, number of different words, residues’ frequencies at the positions of words, and so on) allowed us to define keywords, or main patterns of words, for the 17 fragments. Keywords demonstrate correlation among residues that most frequently are encountered in positions of words. We require that residues at any one position of a keyword belong to the same amino acid group (as per amino acid classification outlined above). The keywords are listed in Table 2. Let us consider, by way of an example, the keywords of the DE fragment of the heavy chains. The three keywords differ mostly in residues at the DE3 and DE4 positions. The first keyword represents the correlation between residues S or A (both of the same group) at DE2, K at DE3, and N at DE4. The DE1 position is variable (marked with an X) because no correlation was found between residues at this position and the residues at the other positions. This pattern (X at DE1, S or A at DE2, K at DE3, and N at DE4) describes 12 different words that were found in 461 human heavy chains. Similar analysis was performed for all keywords of the 17 fragments in the human heavy and κ chains.
Table 2.
The results of the statistical analysis of words in a, b, and c levels of clusters are presented for 17 fragments of the human heavy and κ chains in the Kabat database; analogous data for germ-line database sequences is given in parentheses. We illustrate the use of Table 2 on the example of BC fragment of the heavy chains. The Kabat numbers (25–28, in the first row) correspond to indices BC1, BC2, BC3, and BC4. Three keywords: SG(FY)X; SGG(ST), and SGDS serve as a basis for dividing all words of BC fragment into three clusters. The words of these clusters are found in 97% of all chains and cover 80% of different words (last two columns—Total). Six middle columns show the statistics for three a, b, and c levels of each cluster. The a columns list the numbers of different words (dwords) and numbers of chains (chains) containing all words of the a level of each of the three clusters. b columns contain the data for b levels; and the number of different words and the total number of chains in each of clusters are presented in c columns. Thus, in a level of the first cluster, there are 19 different words (six in germ-line sequences), which are found in 533 chains (64 in germ-line sequences). Total number of chains in which were found all words of the first cluster (551) and total number of different words in a, b, and c levels (34) are given in c columns (respective data for germ-line sequences are 64 and 6). The superscripts refer to Comment to Table 2 section of the paper.
Clusters of Words.
The keywords defined above serve as a basis for dividing collections of words into clusters. Each cluster corresponds to a particular keyword and has three levels. The first level (a) includes those words that are exactly identical to the keyword. (The a level of the first cluster of DE words already has been discussed.)
In the second, b level, all words were included that had the following property: the residue at any position of these words is of the same amino acid group as the residue in the respective position of the keyword. For instance, the b level of the first cluster of the DE fragment of the heavy chain will include the words that have either R or H at DE3 position, as these two residues belong to the same group as K (which occupies DE3 in the keyword). The same considerations apply to all other positions. Our analysis showed that in the a and b levels of this cluster 16 different words were found in 467 human heavy chains (Table 2).
Inspection of words showed that many words cannot be included in the a or b level of a cluster because a residue in one position of these words is not of the same group as the residue(s) in the respective position of the keyword. It can be said that these words belong to a cluster with a mutation in one position. To incorporate these words into a cluster, we defined an additional c level. Our analysis revealed 53 different words in 539 human heavy chains that belong to all three levels (a, b, and c) of the first cluster of the DE fragment (Table 2).
In this work we defined the keywords for 17 fragments of the human heavy and κ chains and used them to form clusters (Table 2). For each level of every cluster we present two characteristics: the number of the chains containing all words of this level and the number of different words. Our analysis showed that more than 93% of all words can be assigned to 50 clusters of the heavy chains and 26 clusters of the κ chain.
Comments to Table 2
AA′ Words.
The heavy chains. Residues in all positions are from the GAS or P amino acid groups.
The κ chains.
Words in the κ chains are longer by one position than words in the heavy chains.
B Words.
The heavy chains. The two keywords differ mainly at the B3 and B8 positions. It is interesting to note that residues S and T, which usually share one position and belong to the same group, are separated in these patterns. In the first pattern, S residues at the B1 and B5 positions are correlated with R or K at B3, and G or A at B8, whereas in the second pattern, T residues at B1 and B5 are correlated with S or T at B3, and I or V at B8. All patterns are characterized by hydrophobic residues at the even positions, hydrophilic residues at the odd positions, and C at the B6 position. 1At the hydrophobic positions B2 and B4, the following pairs of residues are usually found: L-L, V-V, and L-I. 2The B7 position is occupied by residues from different amino acids group: residues A, K, and T are found in about 90% of sequences and about 60% of different words.
The κ chains.
The main motif of both keywords is the same as in the heavy chains. In contrast to the patterns of the heavy chains, the B7 position is not variable in the κ chains. R and K residues are found in 90% of the sequences. The main difference between B words of the VH and Vκ domains is at the B1 position, which is occupied by S and T in the heavy chains and by P, or positively charged R and K, in the κ chains. 1Combinations of residues A-L-S, or V-I-T at B2, B4, and B5, positions, respectively often are observed in the first pattern.
BC Words.
The heavy chains. Three of four positions are conservative positions. 1Fifteen residues are found at BC4 of which S and T, which are found in 50% of different words and about 90% of sequences, are most common. The three keywords are distinguished by residues at only the BC3 position.
The κ chains.
Usually three positions are in each word; however, a few four-position words are observed. 1BC3 or BC4 positions are variable, but most frequently occupied by S.
C Words.
The heavy chain. 1Position C2 is occupied by 16 different amino acids (S, H, N, and G are the most common).
The κ chains.
The main differences between the patterns of the heavy and κ chains are at the C4 and C5 positions.
CC′ Words.
The heavy chains. 1At CC′1 position, A, P, M, and S residues are found most frequently. 2The combination of G and K residues usually occupies CC′3 and CC′4 positions, as do S and R residues.
The κ chains.
1A, P, and S residues are found at CC′5 position in more than 90% of the chains.
C′ Words.
The heavy chains. The two patterns differ only at the C′6 position. 1Three hydrophobic residues, V, M, and I, are found in an approximately equal number of words at C′3 position. 2C′5 is a very variable position, which is occupied by any residue except D and P.
The κ chains.
Peculiar words are found. Three hydrophobic residues are in a row, at positions C′2, C′3, and C′4. By contrast to the patterns of the heavy chains, in which the first position is occupied by negatively charged residues, the κ chains have positively charged residues at the first position. Also, κ chains words are shorter, by one position, than heavy chain words.
C"D Words.
The heavy chains. 1Different residues occupy the C"D1 position (mainly P and D), but Q never does. 2When V occupies C"D3, then either G and S is at C"D5, but if C"D3 is taken by L, the residue at C"D5 is usually S. 3When Q is at the C"D1 position, C"D2 can be occupied by many different residues (mostly K and N, but never S). 4X equals A, T, D, P, or G residues at C"D1 position.
The κ chains.
1In most cases, positions C"D3 and C"D5 are occupied by either V and S, or I and A, respectively.
D Words.
The heavy chains. Hydrophobic residues at even positions are a common feature of the strands of the D, E, and B strands of the β-sheet. Positively charged residues, which take part in the salt bridge, are found at D1 in 85% of the heavy chains. Negatively charged residues occupy the last position (D7) in 95% of the sequences. 1F at the D2 position always occurs with I at D4 and S at D5, and V at D2 correlates with T at D5.
The κ chains.
One pattern is common to 98% of the sequences. The first three positions are similar in the heavy and κ chains, whereas the last four differ greatly in both chains. All positions from D3 on contain only G and S residues.
DE words.
The heavy chains. 1T, N, K, and D are the most common residues at the DE1 position (found in 322, 208, 135, and 82 sequences, respectively).
The κ chains.
The keyword is two positions long.
E Words.
The heavy chains. Hydrophobic residues were at the even positions, whereas hydrophilic residues were at the odd ones, as were B and D words. An exception was found in only 20 sequences, in which the E3 position was occupied by hydrophobic V residue. The E5 position almost always is occupied by charged and polar residues (most frequent are Q, E, and K residues, which are found in 56%, 19%, and 13% of sequences, respectively). We discovered two regularities involving residues at the E1 and E3 positions: (i) S and T at E1 never are found together with S at the E3 position; and (ii) S at the E3 position usually correlates with Q at E1. 1The most frequently encountered residues at E7 were S and N.
The κ chains.
Main motifs are similar to those of the heavy chains. However, in hydrophilic positions many differences between the two classes of chains were observed. D and E residues usually were found at E1 positions of the κ chains, but never at this position of the heavy chains. Also, in the κ chain, at the E3 position, T was found in 91% of the chains and S was seen very rarely, whereas the reverse situation was found in the heavy chains (S occurs in 27% of the sequences and T is never found).
EF words.
The heavy chains. Four conservative positions were found: EF1 (contains S residue), EF2 (hydrophobic), EF6 (contains D), and EF7 (contains T). In the variable EF4 position, A, S, and P are the most frequent residues (88% of the chains); generally, these residues are located at those sites where a chain is “broken”. 1L at EF2 correlates, in almost all cases, with R or K at EF3 position. 2Another correlation was observed between V at EF2 and T at EF3.
The κ chains.
Just as in the heavy chain, EF4 is the variable position; A, S and P are most frequent residues at this position (88% of the chains).
F words.
The heavy chains. Fifteen different amino acids can be found at position F6 and 17 at F7.
The κ chains.
F1, F3, F4, and F5 are conservative positions. The same residues are found in the heavy and κ chains.
G words.
The heavy chains. 1Of the 16 residues found in the G1 position, most common were Y, V, I, and P (84% of the chains).
Keywords in the Germ Line of the Human VH Segments
Although the number of sequences in the Kabat database is large (2), a question remains about whether the keywords obtained through analysis of these sequences will suffice for description of all VH and Vκ domains. We, therefore, checked our results by using another database that contained the sequences of the complete human repertoire of the germ-line VH and Vk segments (14, 15). The results of our analysis can be summarized as follows: (i) it was shown that the ratio of the number of different words to the total number of words is much higher for the fragments of the germ-line database sequences than for the fragments of the Kabat sequences (Table 1); (ii) all of the Kabat database keywords were found among the germ-line words (Table 2); (iii) the set of keywords was adequate for complete description of the entire germ-line repertoire, with the exception of five words of E and EF fragments; and (iv) comparison of the words’ distribution in the three levels of clusters revealed that the fraction of words in the c level is significantly larger for the Kabat sequences than for the germ-line sequences. Considering the fact that the germ-line database (unlike the Kabat database) contains no sequences with somatic mutations, this last observation can be interpreted as supporting the hypothesis that most words in the c level arise through somatic mutations.
Conclusions
Division of sequences into fragments allowed us to perform alignment of all human heavy and human κ chains in the Kabat database. To describe a residue’s location in a chain we used an index that included the name of the fragment and residue’s position number therein. This permitted us to analyze residues at identical positions in different chains.
Statistical analysis of words of 17 fragments was performed. For each fragment we calculated the number of the chains containing this fragment and number of different words. The number of different words varied for each fragment, with the largest being 189 (E fragment) in the human heavy chains and 75 (G fragment) in the κ chains.
Very few keywords, or main patterns, for each fragment were found: 2–5 for the human heavy and 1–2 for the human κ chains. Each keyword served as the basis for constructing a cluster. It was shown that words of more than 90% of all sequences belong to clusters. This result demonstrates that all sequences of the variable domain can be described almost completely (with the exception of complementary determining regions) by a very limited number of patterns.
For the human heavy and human κ chains we suggested 50 and 26 keywords, respectively. The variable X positions, which are occupied by residues from different amino acid groups, were found in 19 keywords of the heavy chains and six keywords of the κ chains. Only one variable position was in all of these patterns (with the exception of the BC keyword of the κ chains). The small number of variable X positions demonstrates that strong correlation is among residues for each fragment.
No identical keywords were found in more than one fragment of either human heavy or human κ chains. This observation demonstrates that different fragments of the variable domains have their own individual patterns (no repeat fragments are found).
In keywords of all fragments certain positions were found that are always, in both heavy and κ chains, occupied by residues from the same amino acid group. Residues at these positions constitute the common characteristic of the fragments. The F fragment, for instance, in both chains is characterized by Ala at F1, aromatic residues at F3 and F4, and Cys at F5 position (see Table 2).
However, keywords of aligned fragments of the human heavy and κ chains differed in all but two cases (0A and A′B fragments). Thus, human heavy and κ chains can be distinguished by the patterns of (almost) any one of their fragments. Moreover, because keywords of a fragment are unique, given a small segment of residues it is possible, knowing all the keywords, to determine the exact location of that segment in the κ or heavy chains.
Keywords obtained through an analysis of all chains in the Kabat database were used in the study of human repertoire of VH and Vκ germ-line segments. It was shown that all available words of the germ-line sequences (with the exception of several words of the E and EF fragments) belong to the existing clusters. Furthermore, analysis of the Kabat database did not disclose any superfluous keywords, i.e., at least one word from a germ-line sequence was assigned to each cluster.
Acknowledgments
We are grateful to Drs. C. Chothia, C. Kulikowski, I. Muchnik, and O. Ptitsyn for very helpful discussions. We wish to acknowledge with deep gratitude the support of the Gabriella and Paul Rosenbaum Foundation and also to thank Mrs. M. Goldman for continuous encouragement. A.E.K. is supported by the Gabriella and Paul Rosenbaum Foundation.
ABBREVIATIONS
- VH
variable domain of the heavy chain
- Vκ
variable domain of the κ (kappa) chain
References
- 1.Kabat E A. Adv Protein Chem. 1978;32:1–75. [PubMed] [Google Scholar]
- 2.Kabat E A, Wu T T, Perry H M, Gottesman K S, Foeller C. Sequences of Proteins of Immunological Interest. Bethesda: Natl. Inst. Health; 1991. , NIH Publ. No. 91–3442, 5th Ed. [Google Scholar]
- 3.Harpaz Y, Chothia C. J Mol Biol. 1994;238:528–539. doi: 10.1006/jmbi.1994.1312. [DOI] [PubMed] [Google Scholar]
- 4.Tramantano A, Chothia C, Lesk A. J Mol Biol. 1989;215:175–182. doi: 10.1016/S0022-2836(05)80102-0. [DOI] [PubMed] [Google Scholar]
- 5.Chothia C, Boswell D R, Lesk A M. EMBO J. 1988;7:3745–3755. doi: 10.1002/j.1460-2075.1988.tb03258.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Chothia C, Lesk A M. J Mol Biol. 1987;196:901–917. doi: 10.1016/0022-2836(87)90412-8. [DOI] [PubMed] [Google Scholar]
- 7.Bork P, Holm L, Sander C. J Mol Biol. 1994;242:309–320. doi: 10.1006/jmbi.1994.1582. [DOI] [PubMed] [Google Scholar]
- 8.Beale D, Coadwell J. Int J Biochem. 1989;21:227–232. doi: 10.1016/0020-711x(89)90113-4. [DOI] [PubMed] [Google Scholar]
- 9.Padlan E A. Mol Immunol. 1994;31:169–217. doi: 10.1016/0161-5890(94)90001-9. [DOI] [PubMed] [Google Scholar]
- 10.Williams A F, Barclay A N. Annu Rev Immunol. 1988;6:381–405. doi: 10.1146/annurev.iy.06.040188.002121. [DOI] [PubMed] [Google Scholar]
- 11.Gerstein M, Altman R B. J Mol Biol. 1995;251:161–175. doi: 10.1006/jmbi.1995.0423. [DOI] [PubMed] [Google Scholar]
- 12.Gelfand I M, Kister A E. Proc Natl Acad Sci USA. 1995;92:10885–10889. doi: 10.1073/pnas.92.24.10884. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Gelfand I M, Kister A E, Leshchiner D. Proc Natl Acad Sci USA. 1996;93:3675–3678. doi: 10.1073/pnas.93.8.3675. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Chothia C, Lesk A M, Gherardi E, Tomlinson I M, Walter G, Marks J D, Llewelyn M B, Winter G. J Mol Biol. 1992;227:799–817. doi: 10.1016/0022-2836(92)90224-8. [DOI] [PubMed] [Google Scholar]
- 15.Tomlinson I M, Cox J P L, Gherardi E, Lesk A M, Chothia C. EMBO J. 1995;14:4628–4638. doi: 10.1002/j.1460-2075.1995.tb00142.x. [DOI] [PMC free article] [PubMed] [Google Scholar]